## migrated from [this repo](https://github.com/Johnny-Wish/fake-news-group2-project) of mine
# Fake News Detection Pipeline
## Collaborators Shuheng Liu, Qiaoyi Yin, Yuyuan Fang
Group project materials for fake news detection at Hollis Lab, GEC Academy
## Project Plan
![a](resources/GEC%20Group%20Presentation.jpg)
# Table of Contents
- [Fake News Detection Pipeline](#fake-news-detection-pipeline)
* [Collaborators Shuheng Liu, Qiaoyi Yin, Yuyuan Fang ](#collaborators-shuheng-liu-qiaoyi-yin-yuyuan-fang)
* [Project Plan](#project-plan)
- [Notice for Collaborators](#notice-for-collaborators)
* [Doing Train-Test Split](#doing-train-test-split)
* [Directory to Push Models](#directory-to-push-models)
- [Downloadables](#downloadables)
* [URL for Different Embeddings Precomputed on Cloud](#url-for-different-embeddings-precomputed-on-cloud)
* [Hypyertuning Logs, Codes, and Stats](#hypyertuning-logs-codes-and-stats)
- [Quick Walkthrough (Presentation)](#quick-walkthrough-presentation)
* [Infrastructure for Embeddings](#infrastructure-for-embeddings)
* [Embedding Computation](#embedding-computation)
+ [URLs](#urls)
* [Embedding Visualization](#embedding-visualization)
+ [2D T-SNE](#2d-t-sne)
+ [3D T-SNE](#3d-t-sne)
+ [Visualizing Bigram Statistics](#visualizing-bigram-statistics)
* [Binary Classification](#binary-classification)
+ [Train-Val-Test Split](#train-val-test-split)
+ [Hypertuned Classifiers](#hypertuned-classifiers)
+ [Histogram of CV/Test Scores](#histogram-of-cvtest-scores)
+ [TF-IDF](#tf-idf)
+ [Feature Ranking with Logistic Coefficients](#feature-ranking-with-logistic-coefficients)
+ [Ensemble Learning](#ensemble-learning)
# Notice for Collaborators
## Doing Train-Test Split
Specifying `random_state` in `sklearn.model_selection.train_test_split()` ensures same split on different datasets
(of the same length), and on different machines.
(See this [link](https://stackoverflow.com/questions/43095076/scikit-learn-train-test-split-can-i-ensure-same-splits-on-different-datasets))
For purpose of this project, we will be using `random_state=58` for each split.
While grid/random searching for the best set of hyperparameters, a 75%-25% train-test-split is used. A 5-Fold
cross-validation is used in the training phase on the 75% samples.
## Directory to Push Models
There is a `model/` directory nested under the project. Please name your model as `model_name.py`, and place it under
the `model/` directory (e.g. `model/KNN.py`) before pushing to this repo.
# Downloadables
Before trying to reproduce our result, please know that pre-computed embeddings can be downloaded from the URLs below. Consider downloading them and storing them into the `pretrained/` folder under this repository, which will save a lot of time.
## URL for Different Embeddings Precomputed on Cloud
- [all computed embeddings and labels](https://www.floydhub.com/wish1104/datasets/fake-news-embeddings/5), see list below
- [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/33/output), scorer:
raw-count
- [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/35/output), scorer:
raw-count, L2-normalized
- [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/38/output), scorer:
tfidf
- [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/41/output), scorer:
tfidf, L2-normalized
- [naive doc2vec title](https://www.floydhub.com/wish1104/projects/fake-news/19/output), normalizer: {L2, mean, None}
- [naive doc2vec text](https://www.floydhub.com/wish1104/projects/fake-news/20/output), normalizer: {L2, mean, None}
- [doc2vec title](https://www.floydhub.com/wish1104/projects/fake-news/21/output), window_size: 13,
min_count:{5, 25, 50}, strategy: {DM, DBOW}, epochs: 100; all six combinations tried
- [doc2vec text](https://www.floydhub.com/wish1104/projects/fake-news/22/output), window_size: 13,
min_count:{5, 25, 50}, strategy: {DM, DBOW}, epochs: 100; all six combinations tried
- [doc2vec title](https://www.floydhub.com/wish1104/projects/fake-news/88/output), window_size: {13, 23}, min_count: 5,
strategy: DBOW, epochs: {200, 500}; all four combinations tried
- [doc2vec text](https://www.floydhub.com/wish1104/projects/fake-news/88/output), window_size: {13. 23}, min_count: 5,
strategy: DBOW, epochs: {200, 500}; all four combinations tried
## Hypyertuning Logs, Codes, and Stats
The logs, codes, and stats of hypertuning all simple models (that is, excluding Ensemble model) can be found [here](https://www.floydhub.com/wish1104/projects/fake-news/jobs).
# Quick Walkthrough (Presentation)
*Below is the final presentation, originally implemented in jupyter notebook. To see the original presentation file, checkout the following command in your terminal*
```bash
git log -- "UCB Final Project.ipynb"
```
*or,*
```bash
git checkout f7e1c41
```
*Alternatively, visit [this link which takes you back in history](https://github.com/Johnny-Wish/fake-news-detection-pipeline/blob/f7e1c41c675d8c43a2d0039bcdf2558cdf6748ec/UCB%20Final%20Project.ipynb).*
## Infrastructure for Embeddings
The following classes `DocumentSequence` and `DocumentEmbedder` can be found in sub-package `doc_utils/`. Different ways of computing embeddings (doc2vec, naive doc2vec, one-hot) and their choices of hyperparameters are encapsulated in these files. Below is a snapshot of these classes their methods.
```python
class DocumentSequence:
def __init__(self, raw_docs, clean=False, sw=None, punct=None): ...
# setters (only to be called internally)
def _set_tokenized(self, clean=False, sw=None, punct=None): ...
def _set_tagged(self): ...
def _set_dictionary(self): ...
def _set_bow(self): ...
# getters (exposed)
def get_dictionary(self): ...
dictionary = property(get_dictionary) # property field of get_dictionary()
def get_tokenized(self): ...
tokenized = property(get_tokenized) # property field of get_tokenized()
def get_tagged(self): ...
tagged = property(get_tagged) # property field of get_tagged()
def get_bow(self): ...
bow = property(get_bow) # property field of get_bow()
```
```python
class DocumentEmbedder:
def __init__(self, docs: DocumentSequence, pretrained_word2vec=None): ...
# setters (only to be called internally)
def _set_word2vec(self): ...
def _set_doc2vec(self, vector_size=300, window=5, min_count=5, dm=1, epochs=20): ...
def _set_naive_doc2vec(self, normalizer='l2'): ...
def _set_tfidf(self): ...
def _set_onehot(self, scorer='tfidf'): ...
# getters (exposed)
def get_onehot(self, scorer='tfidf'): ...
onehot = property(get_onehot) # property field of get_onehot()
def get_doc2vec(self, vectors_size=300, window=5, min_count=5, dm=1, epochs=20): ...
doc2vec = property(get_doc2vec) # property field of get_doc2vec()
def get_naive_doc2vec(self, normalizer='l2'): ...
naive_doc2vec = property(get_naive_doc2vec) # propery field of get_naive_doc2vec()
def get_tfidf_score(self): ...
tfidf = property(get_tfidf_score) # property field of get_tfidf_score()
```
```python
import pandas as pd
from string import punctuation
from nltk.corpus import stopwords
df = pd.read_csv("./fake_or_real_news.csv")
# obtain the raw news texts and titles
raw_text = df['text'].values
raw_title = df['title'].values
df['label'] = df['label'].apply(lambda label: 1 if label == "FAKE" else 0)
# build two instances for preprocessing raw data
from doc_utils import DocumentSequence
texts = DocumentSequence(raw_text, clean=True, sw=stopwords.words('english'), punct=punctuation)
titles = DocumentSequence(raw_title, clean=True, sw=stopwords.words('english'), punct=punctuation)
df.head()
```
|
Unnamed: 0 |
title |
text |
label |
title_vectors |
0 |
8476 |
You Can Smell Hillary’s Fear |
Daniel Greenfield, a Shillman Journalism Fello... |
1 |
[ 1.1533764e-02 4.2144405e-03 1.9692603e-02 ... |
1 |
10294 |
Watch The Exact Moment Paul Ryan Committed Pol... |
Google Pinterest Digg Linkedin Reddit Stumbleu... |
1 |
[ 0.11267698 0.02518966 -0.00212591 0.021095... |
2 |
3608 |
Kerry to go to Paris in gesture of sympathy |
U.S. Secretary of State John F. Kerry said Mon... |
0 |
[ 0.04253004 0.04300297 0.01848392 0.048672... |
3 |
10142 |
Bernie supporters on Twitter erupt in anger ag... |
— Kaydee King (@KaydeeKing) November 9, 2016 T... |
1 |
[ 0.10801624 0.11583211 0.02874823 0.061732... |
4 |
875 |
The Battle of New York: Why This Primary Matters |
It's primary day in New York and front-runners... |
0 |
[ 1.69016439e-02 7.13498285e-03 -7.81233795e-... |
## Embedding Computation
### URLs
- [all computed embeddings and labels](https://www.floydhub.com/wish1104/datasets/fake-news-embeddings/5), see list below
- [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/33/output), scorer:
raw-count
- [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/35/output), scorer:
raw-count, L2-normalized
- [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/38/output), scorer:
tfidf
- [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/41/output), scorer:
tfidf, L2-normalized
- [naive doc2vec title](https://www.floydhub.com/wish1104/projects/fake-news/19/output), normalizer: {L2, mean, None}
- [naive doc2vec text](https://www.floydhub.com/wish1104/projects/fake-news/20/output), normalizer: {L2, mean, None}
- [doc2vec title](https://www.floydhub.com/wish1104/projects/fake-news/21/output), window_size: 13,
min_count:{5, 25, 50}, strategy: {DM, DBOW}, epochs: 100; all six combinations tried
- [doc2vec text](https://www.floydhub.com/wish1104/projects/fake-news/22/output), window_size: 13,
min_count:{5, 25, 50}, strategy: {DM, DBOW}, epochs: 100; all six combinations tried
- [doc2vec title](https://www.floydhub.com/wish1104/projects/fake-news/88/output), window_size: {13, 23}, min_count: 5,
strategy: DBOW, epochs: {200, 500}; all four combinations tried
- [doc2vec text](https://www.floydhub.com/wish1104/projects/fake-news/88/output), window_size: {13. 23}, min_count: 5,
strategy: DBOW, epochs: {200, 500}; all four combinations tried
```python
from doc_utils import DocumentEmbedder
try:
from embedding_utils import EmbeddingLoader
loader = EmbeddingLoader("pretrained/")
news_embeddings = loader.get_d2v("concat", vec_size=300, win_size=23, min_count=5, dm=0, epochs=500)
labels = loader.get_label()
except FileNotFoundError as e:
print(e)
print("Cannot find existing embeddings, computing new ones now")
pretrained = "./pretrained/GoogleNews-vectors-negative300.bin"
text_embedder = DocumentEmbedder(texts, pretrained_word2vec=pretrained)
title_embedder = DocumentEmbedder(titles, pretrained_word2vec=pretrained)
text_embeddings = text_embedder.get_doc2vec(vectors_size=300, window=23, min_count=5, dm=0, epochs=500)
title_embeddings = title_embedder.get_doc2vec(vectors_size=300, window=23, min_count=5, dm=0, epochs=500)
# concatenate title vectors and text vectors
news_embeddings = np.concatenate((title_embeddings, text_embeddings), axis=1)
labels = df['label'].values
```
## Embedding Visualization
```python
from embedding_utils import visualize_embeddings
# visualize the news embeddings in the graph
# MUST run in command line "tensorboard --logdir visual/" and visit localhost:6006 to see the visualization
visualize_embeddings(embedding_values=news_embeddings, label_values=labels, texts = raw_title)
```
```python
print("visit https://localhost:6006 to see the result")
# ATTENTION: This cell must be manually stopped
```
visit https://localhost:6006 to see the result
Some screenshots of the tensorboard are shown below. We visuallize the embeddings of documents with T-SNE projection on 3D and 2D spaces. Each red data point indicates a piece of FAKE news, and each blue one indicates a piece of real news. These two categories are well-separated as can be seen from the visualization.
### 2D T-SNE
red for fake ones, blue for real ones
![jpg](resources/T-SNE-2D.jpg)
### 3D T-SNE
red for fake ones, blue for real ones
![jpg](resources/T-SNE-3D.jpg)
### Visualizing Bigram Statistics
```python
import itertools
import nltk
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
## Get tokenized words of fake news and real news independently
real_text = df[df['label'] == 0]['text'].values
fake_text = df[df['label'] == 1]['text'].values
sw = [word for word in stopwords.words("english")] + ["``", "“"]
other_puncts = u'.,;《》?!“”‘’@#¥%…&×()——+【】{};;●,。&~、|\s::````'
punct = punctuation + other_puncts
fake_words = DocumentSequence(real_text, clean=True, sw=sw, punct=punct)
real_words = DocumentSequence(fake_text, clean=True, sw=sw, punct=punct)
## Get cleaned text using chain
real_words_all = list(itertools.chain(*real_words.get_tokenized()))
fake_words_all = list(itertools.chain(*fake_words.get_tokenized()))
## Drawing histogram
def plot_most_common_words(num_to_show,words_list,title = ""):
bigrams = nltk.bigrams(words_list)
counter = Counter(bigrams)
labels = [" ".join(e[0]) for e in counter.most_common(num_to_show)]
values = [e[1] for e in counter.most_common(num_to_show)]
indexes = np.arange(len(labels))
width = 1
plt.title(title)
plt.barh(indexes, values, width)
plt.yticks(indexes + width * 0.2, labels)
plt.show()
```
```python
plot_most_common_words(20, fake_words_all, "Fake News Most Frequent words")
plot_most_common_words(20, real_words_all, "Real News Most Frequent words")
```
![png](resources/output_15_0.png)
![png](resources/output_16_0.png)
## Binary Classification
### Train-Val-Test Split
(with 75% of data for 5-fold Random CV, 25% for testing)
```python
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.model_selection._search import BaseSearchCV
import pickle as pkl
seed = 58
# perform the split which gets us the train data and the test data
news_train, news_test, labels_train, labels_test = train_test_split(news_embeddings, labels,
test_size=0.25,
random_state=seed,
stratify=labels)
```
### Hypertuned Classifiers
We used RandomSearch on different datasets to get the best hyper-parameters.
The following exhibits every classifier with almost optimal parameters in our experiments.
The RandomSearch process is omitted.
```python
from model.hypyertuned_models import mlp, knn, qda, gdb, svc, gnb, rf, lg
from model.hypyertuned_models import classifiers as classifiers_list
```
We list the best-performing hyperparameters in the following chart.
```python
from sklearn.metrics import classification_report
# print details of testing results
for model in classifiers_list:
model.fit(news_train, labels_train)
labels_pred = model.predict(news_test)
# Report the metrics
target_names = ['Real', 'Fake']
print(model.__class__.__name__)
print(classification_report(y_true=labels_test, y_pred=labels_pred, target_names=target_names, digits=3))
```
MLPClassifier
precision recall f1-score support
Real 0.956 0.950 0.953 793
Fake 0.950 0.956 0.953 791
avg / total 0.953 0.953 0.953 1584
KNeighborsClassifier
precision recall f1-score support
Real 0.849 0.905 0.876 793
Fake 0.898 0.838 0.867 791
avg / total 0.874 0.872 0.872 1584
QuadraticDiscriminantAnalysis
precision recall f1-score support
Real 0.784 0.995 0.877 793
Fake 0.993 0.726 0.839 791
avg / total 0.889 0.860 0.858 1584
GradientBoostingClassifier
precision recall f1-score support
Real 0.921 0.868 0.894 793
Fake 0.875 0.925 0.899 791
avg / total 0.898 0.896 0.896 1584
SVC
precision recall f1-score support
Real 0.944 0.939 0.942 793
Fake 0.940 0.944 0.942 791
avg / total 0.942 0.942 0.942 1584
GaussianNB
precision recall f1-score support
Real 0.848 0.793 0.820 793
Fake 0.805 0.857 0.830 791
avg / total 0.826 0.825 0.825 1584
RandomForestClassifier
precision recall f1-score support
Real 0.868 0.805 0.835 793
Fake 0.817 0.877 0.846 791
avg / total 0.843 0.841 0.841 1584
LogisticRegression
precision recall f1-score support
Real 0.921 0.929 0.925 793
Fake 0.929 0.920 0.924 791
avg / total 0.925 0.925 0.925 1584
### Histogram of CV/Test Scores
![jpg](resources/models_with_best_performance_updated2.jpg)
### TF-IDF
Getting sparse matrix
```python
def bow2sparse(tfidf, corpus):
rows = [index for index, line in enumerate(corpus) for _ in tfidf[line]]
cols = [elem[0] for line in corpus for elem in tfidf[line]]
data = [elem[1] for line in corpus for elem in tfidf[line]]
return csr_matrix((data, (rows, cols)))
```
```python
from gensim import corpora, models
from scipy.sparse import csr_matrix
tfidf = models.TfidfModel(texts.get_bow())
tfidf_matrix = bow2sparse(tfidf, texts.get_bow())
## split the data
news_train, news_test, labels_train, labels_test = train_test_split(tfidf_matrix,
labels,
test_size=0.25,
random_state=seed)
```
dictionary is not set for
, setting dictionary automatically
```python
from ... ...