nlp-fake-news

所属分类:聚类算法
开发工具:Jupyter Notebook
文件大小:1137KB
下载次数:0
上传日期:2023-05-17 11:16:49
上 传 者sh-1993
说明:  基于支持向量机分类器的假新闻检测
(Fake News Detection using Support Vector Machine (SVM) Classifier)

文件列表:
.DS_Store (6148, 2023-05-17)
.ipynb_checkpoints (0, 2023-05-17)
.ipynb_checkpoints\NLP_Assignment_1-checkpoint.ipynb (14357, 2023-05-17)
.ipynb_checkpoints\NLP_Assignment_1-notebook2-checkpoint.ipynb (525324, 2023-05-17)
LICENSE (1066, 2023-05-17)
bin (0, 2023-05-17)
config (0, 2023-05-17)
coverage.svg (903, 2023-05-17)
data (0, 2023-05-17)
data\error_analysis_report.csv (46762, 2023-05-17)
data\fake_news.tsv (2408372, 2023-05-17)
makefile (482, 2023-05-17)
notebooks (0, 2023-05-17)
notebooks\.DS_Store (6148, 2023-05-17)
notebooks\fake-news-1.ipynb (524467, 2023-05-17)
notebooks\fake-news-2.ipynb (520008, 2023-05-17)
requirements doc.pdf (97352, 2023-05-17)
requirements.in (32, 2023-05-17)
requirements_dev.in (95, 2023-05-17)
src (0, 2023-05-17)
src\main.py (207, 2023-05-17)
tests (0, 2023-05-17)
tests\test_main.py (84, 2023-05-17)

Fake News Detection using Support Vector Machine (SVM) Classifier

[![code coverage](https://github.com/SaySohail/nlp-fake-news/blob/master/coverage.svg "Code coverage")](https://github.com/SaySohail/nlp-fake-news/blob/master/)
--- ## § About The pervasive problem of fake news in online platforms necessitates the development of automatic detection methods. In this coursework, I will build and evaluate a text classifier using the Support Vector Machine (SVM) algorithm to detect fake news. The dataset provided contains over 10,000 statements on current affairs, annotated with labels ranging from 'true' to 'pants on fire' to represent different degrees of fakeness. For the purpose of this project, I will simplify the task to a binary classification problem, distinguishing between 'real' and 'fake' news. ### Simple Data Input and Pre-processing: * Implement the parse_data_line function to extract the label and statement column values from a line of the tab-separated text file. * Implement the pre_process function to tokenize the text into a list of words. * Convert the labels to binary values, where 'REAL' is represented as 1 and 'FAKE' as 0. ### Simple Feature Extraction: * Implement the to_feature_vector function to create a feature vector from a preprocessed text. * Build a global_feature_dict to keep track of all the tokens/feature names present in the dataset. * Experiment with different feature weighting schemes such as binary values, term frequency, or term frequency-inverse document frequency (TF-IDF). ### Cross-validation on Training Data: * Complete the implementation of the cross_validate function to perform a 10-fold cross-validation on the training data. * Utilize the train_classifier and predict_labels functions to train the SVM classifier and predict labels for validation. * Calculate evaluation metrics such as precision, recall, F1-score, and accuracy for each fold. * Compute the average performance metrics and store them in the cv_results variable. ### Error Analysis: * Examine the performance of the classifier using a confusion matrix to identify false positives and false negatives. * Conduct an error analysis on a simple train-test split of the training data. * Print or record the false positives and false negatives for the 'FAKE' label to investigate why the classifier is misclassifying these instances. * Provide observations and examples in the report to explain the classifier's confusion. ### Optimizing Pre-processing and Feature Extraction: * Explore various methods to improve the pre-processing stage, considering token filtering, punctuation handling, normalization, lemmatization, and stop-word removal. * Experiment with different feature types beyond unigram tokens, such as word combinations or character-level features. * Explore alternative feature weighting schemes and stylistic features like the number of words per sentence. * Adjust SVM parameters, such as the cost parameter or per-class weighting. * Perform feature selection techniques to limit the number of features. * Utilize external resources like publicly available lists to enhance the feature set. * Document the methods attempted and report their impact on the classifier's performance. ### Using Other Metadata in the File: * Modify the load_data function to include additional metadata features from the data file. * Experiment with incorporating these features into the classification model to optimize performance. * Record the improvements achieved by different features in a results table in the report. * Clearly document the exploration process and describe the impact of each feature in the notebook. ### Conclusion: In this project, I have implemented an SVM classifier for fake news detection. By utilizing text classification techniques, I have developed a model capable of distinguishing between real and fake news. I have explored various strategies to optimize the pre-processing stage, feature extraction, and the inclusion of metadata features. The final model's performance was evaluated using cross-validation and error analysis. The findings and improvements made contribute to enhancing the accuracy and effectiveness of fake news detection systems. ## ”– Project structure ``` Project_folder/ |- bin/ # contains scripts and main files that should be run |- config/ # config files |- notebooks/ # notebooks for EDA, exploration, predictions and results and conclusions |- src/ # source code - contains functions |- tests/ # Test files should mirror the src folder |- Makefile # automatize taks through make utility ``` ## Getting Started These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. ### Prerequisites Setup your environement and install project dependencies ``` conda create -n my_project python=3.10 source activate my_project python -m pip install pip-tools pip-compile --output-file requirements.txt requirements.in requirements_dev.in python -m pip install -r requirements.txt ``` ### Installing ## ”§ Running the tests Tests are implemented in ./tests, you need to run the following command to run them. ``` make tests ```

近期下载者

相关文件


收藏者