Deep_Learning_fake_news

所属分类:内容生成
开发工具:Jupyter Notebook
文件大小:623KB
下载次数:0
上传日期:2018-12-27 16:35:13
上 传 者sh-1993
说明:  基于深度学习的假新闻检测
(Fake news detection by Deep Learning)

文件列表:
LICENSE (1064, 2018-12-28)
Notebooks (0, 2018-12-28)
Notebooks\Making_Emb_.ipynb (21860, 2018-12-28)
Notebooks\News_dataset_by_George McIntire.ipynb (294853, 2018-12-28)
Notebooks\News_dataset_from_opensources.co.ipynb (268315, 2018-12-28)
Notebooks\News_dataset_kaggle.ipynb (273946, 2018-12-28)
imgs (0, 2018-12-28)
imgs\NN.jpg (29578, 2018-12-28)
imgs\SE.jpg (22412, 2018-12-28)
imgs\metadata (20, 2018-12-28)
imgs\r1.jpg (23247, 2018-12-28)
imgs\r2.jpg (22951, 2018-12-28)

# **Application of Transfer Learning - Usage of Sentence Embeddings for Fake News Detection by Deep Learning.** In this project, Google Universal Sentence Encoder is used. Using Google's Sentence Encoder Deep learning Models where trained and their performance was documented.
The task was to build a model that could classify news into two different classes, Real and Fake.
Labelled dataset was available.
This technique is supposed to work with any kind of **NLP classification task**.
Research paper - [Universal Sentence Encoder](https://arxiv.org/abs/1803.11175) Sentence Encoder by Google - [Tensorflow hub](https://tfhub.dev/google/universal-sentence-encoder/2) *Key Requirements: Python 3+, Keras 2+, TensorFlow 1+.
Training was performed on [Google Colab](https://colab.research.google.com/)* ## Contents ### **Python Notebooks** - 3 python Notebooks demostrating Model performance on three different datasets. - 1 Python Notebook showing how embeddings where generated. ## Fake or Real Dataset 3 different Datasets where used in this project. These are: - [News Dataset from Kaggle](https://www.kaggle.com/c/fake-news/data) - [Dataset available from OpensSources](http://www.opensources.co/) - [News obtained from George McIntire Fake News Dataset](https://bitbucket.org/WiZar-D/fake_real_dataset)
**Dataset describtion** Dataset consisted of labelled news under two categories "Real" and "Fake". Below is an overview of the dataset.
Dataset 1: Kaggle
Type Real Fake
Count 6200 6200
Dataset 2: OpenSources
Type Real Fake
Count 6724 6162
Dataset 3: Work done by George McIntire
Type Real Fake
Count 1900 1992
**The number of news articles in the respective corpus where more than the figures shown above. Only those articles in which number of sentences was more than 7 and less than 50.**
## **Model Description** 4 Deep Learning models where developed for classification task. [Recurrent Neural Networks](https://keras.io/layers/recurrent/) where used. **Reason for selecting articles which has sentences in the range(7,50)**
As LSTM takes input a 3-d Tensor comprised of (samples, timesteps, features), having this limit help us to design a network which takes input a 3-d Tensor of shape (n,50,512), where - "n" is number of articles. **Google sentence encoder** returns sentence embedding which is of shape (512,). ![Sentence Encoder](imgs/SE.jpg "Universal Sentence Encoder") These where then reshaped into [numpy's multidimensional, homogeneous array](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ndarray.html) These numpy arrays made up the input for the Neural Network.
*(The figure shown below is taken from deeplearning course on Coursera. Similar version of this network was implemented.)*
![Network](imgs/NN.jpg "Network Architecture") **3 different architecture where used, these are:** - Single Layer LSTM Model - A LSTM layer is defined which is followed by a dense/ output layer. - Double Layer LSTM (Connected) Model - 2 Layer of LSTM are defined with output from previous layer fed into next LSTM layer. - CNN LSTM Model - First a CNN layer or layers are defined, wrapping them in a TimeDistributed layer and then LSTM and output layer is defined.
All this model where build using Keras. ## Results Evaluation of each model is shown with the help of [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html), Loss and Accuracy Graphs.
*sample output* ``` [[1178 67] [ 63 1172]] precision recall f1-score support Real 0.95 0.95 0.95 1245 Fake 0.95 0.95 0.95 1235 micro avg 0.95 0.95 0.95 2480 macro avg 0.95 0.95 0.95 2480 weighted avg 0.95 0.95 0.95 2480 ``` ![Accuracy Graph](imgs/r1.jpg "Accuracy") ![Loss Graph](imgs/r2.jpg "Loss") From the graphs it is evident that model has overfitted the data.
Work needed to be done to make generalize the model, implementing more complex and deeper network, training on more data may improve model performance.
More feautres can be added apart from sentence embeddings, like how much the headline agrees with the content, source of the article and so on..
Neural Network trained with [ELMo](https://arxiv.org/abs/1802.05365) ([*developed by AllenNLP*](https://allennlp.org/elmo)) - A Deep contextualized word representations can also be used.

近期下载者

相关文件


收藏者