array(7) { [0]=> string(37) "pretrained/title_text-d2v(vecsize=300" [1]=> string(11) " winsize=13" [2]=> string(11) " mincount=5" [3]=> string(5) " dbow" [4]=> string(16) " epochs=100).pkl" [5]=> string(8) "15204162" [6]=> string(21) "2018-11-08 09:14:56 " } fake-news-detection-pipeline 联合开发网 - pudn.com
fake-news-detection-pipeline

所属分类:嵌入式/单片机/硬件编程
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2018-11-09 02:25:15
上 传 者sh-1993
说明:  检测假新闻的管道,包括数据摄取、文档嵌入、分类器超调和模型集成。快速演练av...,
(Pipeline for detecting fake news, covering data ingestion, doc embedding, classifier hypertuning & model ensembling. Quick walkthrough available in README. Execution logs on my FloydHub page.)

文件列表:
LICENSE (11357, 2018-11-08)
cross_validate.py (780, 2018-11-08)
doc_utils/ (0, 2018-11-08)
doc_utils/__init__.py (218, 2018-11-08)
doc_utils/document_embedder.py (9554, 2018-11-08)
doc_utils/document_sequence.py (4304, 2018-11-08)
download_from_google_drive.py (1374, 2018-11-08)
embedding_utils/ (0, 2018-11-08)
embedding_utils/__init__.py (270, 2018-11-08)
embedding_utils/embedding_getter.py (5046, 2018-11-08)
embedding_utils/embedding_loader.py (6076, 2018-11-08)
embedding_utils/embedding_visualizer.py (4571, 2018-11-08)
fake_or_real_news.csv (58064793, 2018-11-08)
floyd_requirements.txt (20, 2018-11-08)
model/ (0, 2018-11-08)
model/__init__.py (632, 2018-11-08)
model/__main__.py (7629, 2018-11-08)
model/ensemble_learning.py (3548, 2018-11-08)
model/gaussian_nb.py (202, 2018-11-08)
model/gradient_boosting.py (608, 2018-11-08)
model/hypertuned_models.py (3785, 2018-11-08)
model/input_specific_model.py (1477, 2018-11-08)
model/knn.py (246, 2018-11-08)
model/logistic_regression.py (334, 2018-11-08)
model/mlp_adam.py (591, 2018-11-08)
model/mlp_config.py (498, 2018-11-08)
model/mlp_lbfgs.py (471, 2018-11-08)
model/mlp_sgd.py (720, 2018-11-08)
model/qda.py (254, 2018-11-08)
model/random_forest.py (482, 2018-11-08)
model/svc.py (348, 2018-11-08)
pretrained/ (0, 2018-11-08)
pretrained/labels.pkl (50839, 2018-11-08)
resources/ (0, 2018-11-08)
resources/Ensemble-Voter.png (32693, 2018-11-08)
resources/GEC Group Presentation.jpg (668818, 2018-11-08)
resources/T-SNE-2D.jpg (376711, 2018-11-08)
... ...

## migrated from [this repo](https://github.com/Johnny-Wish/fake-news-group2-project) of mine # Fake News Detection Pipeline ## Collaborators Shuheng Liu, Qiaoyi Yin, Yuyuan Fang Group project materials for fake news detection at Hollis Lab, GEC Academy ## Project Plan ![a](resources/GEC%20Group%20Presentation.jpg) # Table of Contents - [Fake News Detection Pipeline](#fake-news-detection-pipeline) * [Collaborators Shuheng Liu, Qiaoyi Yin, Yuyuan Fang ](#collaborators-shuheng-liu-qiaoyi-yin-yuyuan-fang) * [Project Plan](#project-plan) - [Notice for Collaborators](#notice-for-collaborators) * [Doing Train-Test Split](#doing-train-test-split) * [Directory to Push Models](#directory-to-push-models) - [Downloadables](#downloadables) * [URL for Different Embeddings Precomputed on Cloud](#url-for-different-embeddings-precomputed-on-cloud) * [Hypyertuning Logs, Codes, and Stats](#hypyertuning-logs-codes-and-stats) - [Quick Walkthrough (Presentation)](#quick-walkthrough-presentation) * [Infrastructure for Embeddings](#infrastructure-for-embeddings) * [Embedding Computation](#embedding-computation) + [URLs](#urls) * [Embedding Visualization](#embedding-visualization) + [2D T-SNE](#2d-t-sne) + [3D T-SNE](#3d-t-sne) + [Visualizing Bigram Statistics](#visualizing-bigram-statistics) * [Binary Classification](#binary-classification) + [Train-Val-Test Split](#train-val-test-split) + [Hypertuned Classifiers](#hypertuned-classifiers) + [Histogram of CV/Test Scores](#histogram-of-cvtest-scores) + [TF-IDF](#tf-idf) + [Feature Ranking with Logistic Coefficients](#feature-ranking-with-logistic-coefficients) + [Ensemble Learning](#ensemble-learning) # Notice for Collaborators ## Doing Train-Test Split Specifying `random_state` in `sklearn.model_selection.train_test_split()` ensures same split on different datasets (of the same length), and on different machines. (See this [link](https://stackoverflow.com/questions/43095076/scikit-learn-train-test-split-can-i-ensure-same-splits-on-different-datasets)) For purpose of this project, we will be using `random_state=58` for each split. While grid/random searching for the best set of hyperparameters, a 75%-25% train-test-split is used. A 5-Fold cross-validation is used in the training phase on the 75% samples. ## Directory to Push Models There is a `model/` directory nested under the project. Please name your model as `model_name.py`, and place it under the `model/` directory (e.g. `model/KNN.py`) before pushing to this repo. # Downloadables Before trying to reproduce our result, please know that pre-computed embeddings can be downloaded from the URLs below. Consider downloading them and storing them into the `pretrained/` folder under this repository, which will save a lot of time. ## URL for Different Embeddings Precomputed on Cloud - [all computed embeddings and labels](https://www.floydhub.com/wish1104/datasets/fake-news-embeddings/5), see list below - [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/33/output), scorer: raw-count - [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/35/output), scorer: raw-count, L2-normalized - [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/38/output), scorer: tfidf - [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/41/output), scorer: tfidf, L2-normalized - [naive doc2vec title](https://www.floydhub.com/wish1104/projects/fake-news/19/output), normalizer: {L2, mean, None} - [naive doc2vec text](https://www.floydhub.com/wish1104/projects/fake-news/20/output), normalizer: {L2, mean, None} - [doc2vec title](https://www.floydhub.com/wish1104/projects/fake-news/21/output), window_size: 13, min_count:{5, 25, 50}, strategy: {DM, DBOW}, epochs: 100; all six combinations tried - [doc2vec text](https://www.floydhub.com/wish1104/projects/fake-news/22/output), window_size: 13, min_count:{5, 25, 50}, strategy: {DM, DBOW}, epochs: 100; all six combinations tried - [doc2vec title](https://www.floydhub.com/wish1104/projects/fake-news/88/output), window_size: {13, 23}, min_count: 5, strategy: DBOW, epochs: {200, 500}; all four combinations tried - [doc2vec text](https://www.floydhub.com/wish1104/projects/fake-news/88/output), window_size: {13. 23}, min_count: 5, strategy: DBOW, epochs: {200, 500}; all four combinations tried ## Hypyertuning Logs, Codes, and Stats The logs, codes, and stats of hypertuning all simple models (that is, excluding Ensemble model) can be found [here](https://www.floydhub.com/wish1104/projects/fake-news/jobs). # Quick Walkthrough (Presentation) *Below is the final presentation, originally implemented in jupyter notebook. To see the original presentation file, checkout the following command in your terminal* ```bash git log -- "UCB Final Project.ipynb" ``` *or,* ```bash git checkout f7e1c41 ``` *Alternatively, visit [this link which takes you back in history](https://github.com/Johnny-Wish/fake-news-detection-pipeline/blob/f7e1c41c675d8c43a2d0039bcdf2558cdf6748ec/UCB%20Final%20Project.ipynb).* ## Infrastructure for Embeddings The following classes `DocumentSequence` and `DocumentEmbedder` can be found in sub-package `doc_utils/`. Different ways of computing embeddings (doc2vec, naive doc2vec, one-hot) and their choices of hyperparameters are encapsulated in these files. Below is a snapshot of these classes their methods. ```python class DocumentSequence: def __init__(self, raw_docs, clean=False, sw=None, punct=None): ... # setters (only to be called internally) def _set_tokenized(self, clean=False, sw=None, punct=None): ... def _set_tagged(self): ... def _set_dictionary(self): ... def _set_bow(self): ... # getters (exposed) def get_dictionary(self): ... dictionary = property(get_dictionary) # property field of get_dictionary() def get_tokenized(self): ... tokenized = property(get_tokenized) # property field of get_tokenized() def get_tagged(self): ... tagged = property(get_tagged) # property field of get_tagged() def get_bow(self): ... bow = property(get_bow) # property field of get_bow() ``` ```python class DocumentEmbedder: def __init__(self, docs: DocumentSequence, pretrained_word2vec=None): ... # setters (only to be called internally) def _set_word2vec(self): ... def _set_doc2vec(self, vector_size=300, window=5, min_count=5, dm=1, epochs=20): ... def _set_naive_doc2vec(self, normalizer='l2'): ... def _set_tfidf(self): ... def _set_onehot(self, scorer='tfidf'): ... # getters (exposed) def get_onehot(self, scorer='tfidf'): ... onehot = property(get_onehot) # property field of get_onehot() def get_doc2vec(self, vectors_size=300, window=5, min_count=5, dm=1, epochs=20): ... doc2vec = property(get_doc2vec) # property field of get_doc2vec() def get_naive_doc2vec(self, normalizer='l2'): ... naive_doc2vec = property(get_naive_doc2vec) # propery field of get_naive_doc2vec() def get_tfidf_score(self): ... tfidf = property(get_tfidf_score) # property field of get_tfidf_score() ``` ```python import pandas as pd from string import punctuation from nltk.corpus import stopwords df = pd.read_csv("./fake_or_real_news.csv") # obtain the raw news texts and titles raw_text = df['text'].values raw_title = df['title'].values df['label'] = df['label'].apply(lambda label: 1 if label == "FAKE" else 0) # build two instances for preprocessing raw data from doc_utils import DocumentSequence texts = DocumentSequence(raw_text, clean=True, sw=stopwords.words('english'), punct=punctuation) titles = DocumentSequence(raw_title, clean=True, sw=stopwords.words('english'), punct=punctuation) df.head() ```
Unnamed: 0 title text label title_vectors
0 8476 You Can Smell Hillary’s Fear Daniel Greenfield, a Shillman Journalism Fello... 1 [ 1.1533764e-02 4.2144405e-03 1.9692603e-02 ...
1 10294 Watch The Exact Moment Paul Ryan Committed Pol... Google Pinterest Digg Linkedin Reddit Stumbleu... 1 [ 0.11267698 0.02518966 -0.00212591 0.021095...
2 3608 Kerry to go to Paris in gesture of sympathy U.S. Secretary of State John F. Kerry said Mon... 0 [ 0.04253004 0.04300297 0.01848392 0.048672...
3 10142 Bernie supporters on Twitter erupt in anger ag... — Kaydee King (@KaydeeKing) November 9, 2016 T... 1 [ 0.10801624 0.11583211 0.02874823 0.061732...
4 875 The Battle of New York: Why This Primary Matters It's primary day in New York and front-runners... 0 [ 1.69016439e-02 7.13498285e-03 -7.81233795e-...
## Embedding Computation ### URLs - [all computed embeddings and labels](https://www.floydhub.com/wish1104/datasets/fake-news-embeddings/5), see list below - [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/33/output), scorer: raw-count - [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/35/output), scorer: raw-count, L2-normalized - [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/38/output), scorer: tfidf - [onehot title & text (sparse matrix)](https://www.floydhub.com/wish1104/projects/fake-news/41/output), scorer: tfidf, L2-normalized - [naive doc2vec title](https://www.floydhub.com/wish1104/projects/fake-news/19/output), normalizer: {L2, mean, None} - [naive doc2vec text](https://www.floydhub.com/wish1104/projects/fake-news/20/output), normalizer: {L2, mean, None} - [doc2vec title](https://www.floydhub.com/wish1104/projects/fake-news/21/output), window_size: 13, min_count:{5, 25, 50}, strategy: {DM, DBOW}, epochs: 100; all six combinations tried - [doc2vec text](https://www.floydhub.com/wish1104/projects/fake-news/22/output), window_size: 13, min_count:{5, 25, 50}, strategy: {DM, DBOW}, epochs: 100; all six combinations tried - [doc2vec title](https://www.floydhub.com/wish1104/projects/fake-news/88/output), window_size: {13, 23}, min_count: 5, strategy: DBOW, epochs: {200, 500}; all four combinations tried - [doc2vec text](https://www.floydhub.com/wish1104/projects/fake-news/88/output), window_size: {13. 23}, min_count: 5, strategy: DBOW, epochs: {200, 500}; all four combinations tried ```python from doc_utils import DocumentEmbedder try: from embedding_utils import EmbeddingLoader loader = EmbeddingLoader("pretrained/") news_embeddings = loader.get_d2v("concat", vec_size=300, win_size=23, min_count=5, dm=0, epochs=500) labels = loader.get_label() except FileNotFoundError as e: print(e) print("Cannot find existing embeddings, computing new ones now") pretrained = "./pretrained/GoogleNews-vectors-negative300.bin" text_embedder = DocumentEmbedder(texts, pretrained_word2vec=pretrained) title_embedder = DocumentEmbedder(titles, pretrained_word2vec=pretrained) text_embeddings = text_embedder.get_doc2vec(vectors_size=300, window=23, min_count=5, dm=0, epochs=500) title_embeddings = title_embedder.get_doc2vec(vectors_size=300, window=23, min_count=5, dm=0, epochs=500) # concatenate title vectors and text vectors news_embeddings = np.concatenate((title_embeddings, text_embeddings), axis=1) labels = df['label'].values ``` ## Embedding Visualization ```python from embedding_utils import visualize_embeddings # visualize the news embeddings in the graph # MUST run in command line "tensorboard --logdir visual/" and visit localhost:6006 to see the visualization visualize_embeddings(embedding_values=news_embeddings, label_values=labels, texts = raw_title) ``` ```python print("visit https://localhost:6006 to see the result") # ATTENTION: This cell must be manually stopped ``` visit https://localhost:6006 to see the result Some screenshots of the tensorboard are shown below. We visuallize the embeddings of documents with T-SNE projection on 3D and 2D spaces. Each red data point indicates a piece of FAKE news, and each blue one indicates a piece of real news. These two categories are well-separated as can be seen from the visualization. ### 2D T-SNE red for fake ones, blue for real ones ![jpg](resources/T-SNE-2D.jpg) ### 3D T-SNE red for fake ones, blue for real ones ![jpg](resources/T-SNE-3D.jpg) ### Visualizing Bigram Statistics ```python import itertools import nltk import numpy as np import matplotlib.pyplot as plt from collections import Counter ## Get tokenized words of fake news and real news independently real_text = df[df['label'] == 0]['text'].values fake_text = df[df['label'] == 1]['text'].values sw = [word for word in stopwords.words("english")] + ["``", "“"] other_puncts = u'.,;《》?!“”‘’@#¥%…&×()——+【】{};;●,。&~、|\s::````' punct = punctuation + other_puncts fake_words = DocumentSequence(real_text, clean=True, sw=sw, punct=punct) real_words = DocumentSequence(fake_text, clean=True, sw=sw, punct=punct) ## Get cleaned text using chain real_words_all = list(itertools.chain(*real_words.get_tokenized())) fake_words_all = list(itertools.chain(*fake_words.get_tokenized())) ## Drawing histogram def plot_most_common_words(num_to_show,words_list,title = ""): bigrams = nltk.bigrams(words_list) counter = Counter(bigrams) labels = [" ".join(e[0]) for e in counter.most_common(num_to_show)] values = [e[1] for e in counter.most_common(num_to_show)] indexes = np.arange(len(labels)) width = 1 plt.title(title) plt.barh(indexes, values, width) plt.yticks(indexes + width * 0.2, labels) plt.show() ``` ```python plot_most_common_words(20, fake_words_all, "Fake News Most Frequent words") plot_most_common_words(20, real_words_all, "Real News Most Frequent words") ``` ![png](resources/output_15_0.png) ![png](resources/output_16_0.png) ## Binary Classification ### Train-Val-Test Split (with 75% of data for 5-fold Random CV, 25% for testing) ```python from sklearn.model_selection import RandomizedSearchCV, train_test_split from sklearn.model_selection._search import BaseSearchCV import pickle as pkl seed = 58 # perform the split which gets us the train data and the test data news_train, news_test, labels_train, labels_test = train_test_split(news_embeddings, labels, test_size=0.25, random_state=seed, stratify=labels) ``` ### Hypertuned Classifiers We used RandomSearch on different datasets to get the best hyper-parameters. The following exhibits every classifier with almost optimal parameters in our experiments. The RandomSearch process is omitted. ```python from model.hypyertuned_models import mlp, knn, qda, gdb, svc, gnb, rf, lg from model.hypyertuned_models import classifiers as classifiers_list ``` We list the best-performing hyperparameters in the following chart. ```python from sklearn.metrics import classification_report # print details of testing results for model in classifiers_list: model.fit(news_train, labels_train) labels_pred = model.predict(news_test) # Report the metrics target_names = ['Real', 'Fake'] print(model.__class__.__name__) print(classification_report(y_true=labels_test, y_pred=labels_pred, target_names=target_names, digits=3)) ``` MLPClassifier precision recall f1-score support Real 0.956 0.950 0.953 793 Fake 0.950 0.956 0.953 791 avg / total 0.953 0.953 0.953 1584 KNeighborsClassifier precision recall f1-score support Real 0.849 0.905 0.876 793 Fake 0.898 0.838 0.867 791 avg / total 0.874 0.872 0.872 1584 QuadraticDiscriminantAnalysis precision recall f1-score support Real 0.784 0.995 0.877 793 Fake 0.993 0.726 0.839 791 avg / total 0.889 0.860 0.858 1584 GradientBoostingClassifier precision recall f1-score support Real 0.921 0.868 0.894 793 Fake 0.875 0.925 0.899 791 avg / total 0.898 0.896 0.896 1584 SVC precision recall f1-score support Real 0.944 0.939 0.942 793 Fake 0.940 0.944 0.942 791 avg / total 0.942 0.942 0.942 1584 GaussianNB precision recall f1-score support Real 0.848 0.793 0.820 793 Fake 0.805 0.857 0.830 791 avg / total 0.826 0.825 0.825 1584 RandomForestClassifier precision recall f1-score support Real 0.868 0.805 0.835 793 Fake 0.817 0.877 0.846 791 avg / total 0.843 0.841 0.841 1584 LogisticRegression precision recall f1-score support Real 0.921 0.929 0.925 793 Fake 0.929 0.920 0.924 791 avg / total 0.925 0.925 0.925 1584 ### Histogram of CV/Test Scores ![jpg](resources/models_with_best_performance_updated2.jpg) ### TF-IDF Getting sparse matrix ```python def bow2sparse(tfidf, corpus): rows = [index for index, line in enumerate(corpus) for _ in tfidf[line]] cols = [elem[0] for line in corpus for elem in tfidf[line]] data = [elem[1] for line in corpus for elem in tfidf[line]] return csr_matrix((data, (rows, cols))) ``` ```python from gensim import corpora, models from scipy.sparse import csr_matrix tfidf = models.TfidfModel(texts.get_bow()) tfidf_matrix = bow2sparse(tfidf, texts.get_bow()) ## split the data news_train, news_test, labels_train, labels_test = train_test_split(tfidf_matrix, labels, test_size=0.25, random_state=seed) ``` dictionary is not set for , setting dictionary automatically ```python from ... ...

近期下载者

相关文件


收藏者