msc-dissertation-final

所属分类:论文
开发工具:Jupyter Notebook
文件大小:0KB
下载次数:0
上传日期:2020-12-17 07:10:51
上 传 者sh-1993
说明:  硕士学位论文期末考试
(msc dissertation final)

文件列表:
3.1 - Data Extraction.ipynb (32125, 2020-12-16)
3.2 - Data Cleaning (Part 1).ipynb (161948, 2020-12-16)
3.3 - Data Cleaning (Part 2).ipynb (3311682, 2020-12-16)
3.4 - Feature Engineering.ipynb (140095, 2020-12-16)
3.5 - Text Pre-Processing.ipynb (329101, 2020-12-16)
4.1 - Exploratory Data Analysis of Scam Reports.ipynb (581877, 2020-12-16)
4.2 - Exploratory Data Analysis of Text in Scam Reports.ipynb (455026, 2020-12-16)
4.3 - Exploratory Data Analysis of COVID-19-Related Scams.ipynb (197274, 2020-12-16)
5.1 - Experiment 1 (Classification with Class Imbalance).ipynb (729618, 2020-12-16)
5.2 - Generating Augmented Text.ipynb (43855, 2020-12-16)
5.3 - Experiment 2 (Classification with Class Balanced with Text Augmentation).ipynb (639389, 2020-12-16)
5.4 - Experiment 3 (Classification with Class Balanced with SMOTE).ipynb (641266, 2020-12-16)
5.5 - Analysis of Results.ipynb (332283, 2020-12-16)
6.1 - Training Doc2Vec Models.ipynb (36656, 2020-12-16)
7.1 - Putting It All Together (Part 1).ipynb (8602140, 2020-12-16)
7.2 - Putting It All Together (Part 2).ipynb (105923, 2020-12-16)
Data/ (0, 2020-12-16)
Data/scam_candidate.csv (14102, 2020-12-16)
Data/scam_data_1.csv (3445413, 2020-12-16)
Data/scam_data_2.csv (3432908, 2020-12-16)
Data/scam_data_3.csv (3251174, 2020-12-16)
Data/scam_data_4.csv (8734214, 2020-12-16)
Data/scam_raw_dataset.csv (3436529, 2020-12-16)
Data/scam_type_cat_mapping.pkl (173, 2020-12-16)
Dissertation Report/ (0, 2020-12-16)
Dissertation Report/MSc Dissertation - Zeya Lwin Tun (Final).pdf (14581264, 2020-12-16)
Figures/ (0, 2020-12-16)
Figures/architecture-bilstm-glove.png (21876, 2020-12-16)
Figures/architecture-bilstm.png (22195, 2020-12-16)
Figures/architecture-lstm-glove.png (18906, 2020-12-16)
Figures/architecture-lstm.png (19063, 2020-12-16)
Figures/architecture-rnn-glove.png (18780, 2020-12-16)
Figures/architecture-rnn.png (18780, 2020-12-16)
Figures/f1_score.png (378555, 2020-12-16)
Figures/no_title_aggregated_by_day.png (414368, 2020-12-16)
Figures/no_title_aggregated_by_month.png (520398, 2020-12-16)
Figures/no_title_aggregated_by_year.png (220825, 2020-12-16)
Figures/no_title_boxplots.png (748086, 2020-12-16)
Figures/no_title_by_scam_types.png (715757, 2020-12-16)
Figures/no_title_daily_average.png (207220, 2020-12-16)
... ...

# **Supervised and Unsupervised Applications of Natural Language Processing on Free Text towards Tackling Scams** ## MATH5872M Dissertation in Data Science and Analytics ## University of Leeds, September 2020 ### Author: Zeya Lwin Tun ### Supervisors: Dr Daniel Birks and Dr Leonid Bogachev This repository contains reproducible code written in Python 3.7.7 as part of my Master's dissertation at University of Leeds. My dissertation is titled **Supervised and Unsupervised Applications of Natural Language Processing on Free Text towards Tackling Scams**. ### Abstract Scams are becoming increasingly prevalent and a cause of concern globally. In Singapore, scams made up 27.0\% of overall crimes in 2019, compared to 17.5\% in 2018. In the first half of 2020, a total of S\$82 million was cheated from victims, almost twice the amount in the same period of 2019. Besides immediate financial losses, victims of scams also suffer from longer-term emotional and psychological effects. Despite efforts by authorities, victims continue to fall prey, owing partly to more sophisticated means used by scammers. There is therefore a strong need to increase the understanding of scams and how they can be prevented. The research in this dissertation aims to achieve this by drawing lessons from others’ scam experiences shared on `Scam Alert’, a Singapore-based website aimed at promoting scam awareness. More specifically, this research harnesses the hidden potential of free text in these scam reports using machine learning and Natural Language Processing (NLP) methods towards the following research goals: finding scam reports with similar modus operandi, extracting common characteristics from similar scam reports and classifying scam reports. In pursuit of these research goals, this dissertation presents novel applications of machine learning and NLP on free text in scam reports in two areas: supervised and unsupervised. In supervised application, deep learning techniques are used for multi-class classification of scams. Given class imbalance in the data, text augmentation techniques and the Synthetic Minority Over-sampling Technique are explored. In addition, the efficacy of using pre-trained Global Vectors (GloVe) word embeddings is examined. Results show that the Long Short-Term Memory model trained without GloVe word embeddings on a dataset balanced with text augmentation outperformed the rest. In unsupervised application, the concept of vector semantics is leveraged using doc2vec models to encode scam reports as document embeddings. To evaluate doc2vec models, a new framework known as normalised Similarity-Dissimilarity Quotient (SDQ) is introduced. Normalised SDQ assesses a doc2vec model's ability to infer document embeddings that can recognise similar and dissimilar reports from sets of pre-identified scam reports. Using normalised SDQ, the most optimal doc2vec model is found to be the model trained with 150 epochs, 50-dimensional embeddings and the Distributed Memory Model of Paragraph Vector algorithm. Findings from both supervised and unsupervised applications lay the foundation for the development of tools towards achieving the research goals. It is envisioned that these tools will sharpen the sense-making capabilities of law enforcement authorities in better understanding how scams operate and in identifying intervention points where scams can be disrupted. With such insights, public education and engagement efforts can be more tailored and effective. They also boost quality of criminal investigations against scammers, which in turn serves as a deterrent and helps toughen the stance against scams. Additionally, these tools can nurture a stronger sense of awareness and guardianship within the society. After all, a discerning public is the strongest defence against scams. ### Table of Contents for iPython Notebooks The .ipynb notebooks are organised as follow: ![](https://github.com/zeyalt/msc-dissertation-final/blob/master/ipython_notebooks_content.png) ### Acknowledgements National Crime Prevention Council, Singapore.

近期下载者

相关文件


收藏者