stigmaClassification

所属分类:聚类算法
开发工具:Jupyter Notebook
文件大小:7398KB
下载次数:0
上传日期:2022-05-18 16:29:32
上 传 者sh-1993
说明:  在线新闻期刊中污名化精神疾病文章的自动分类
(Automatic classification of stigmatizing mental illness articles in online news journals)

文件列表:
1. preprocessing (0, 2022-05-19)
1. preprocessing\data_labeled.csv (6449094, 2022-05-19)
1. preprocessing\preprocessing.ipynb (41041, 2022-05-19)
2.classification (0, 2022-05-19)
2.classification\bert (0, 2022-05-19)
2.classification\bert\blank.txt (0, 2022-05-19)
2.classification\classification.ipynb (91294, 2022-05-19)
2.classification\classification_bert.ipynb (109328, 2022-05-19)
2.classification\data_preprocessed.pkl (4235459, 2022-05-19)
2.classification\dictionary (0, 2022-05-19)
2.classification\dictionary\liwc.txt (3139843, 2022-05-19)
2.classification\hp_tuning (0, 2022-05-19)
2.classification\hp_tuning\blank.txt (0, 2022-05-19)
2.classification\models (0, 2022-05-19)
2.classification\models\blank.txt (0, 2022-05-19)
2.classification\pre-trained (0, 2022-05-19)
2.classification\pre-trained\blank.txt (0, 2022-05-19)
2.classification\vectors (0, 2022-05-19)
2.classification\vectors\blank.txt (0, 2022-05-19)
3.topic modeling (0, 2022-05-19)
3.topic modeling\data_preprocessed_tm.pkl (4235459, 2022-05-19)
3.topic modeling\topic_modeling.ipynb (36181, 2022-05-19)
4.visualization and analysis (0, 2022-05-19)
4.visualization and analysis\4.analysis.ipynb (217831, 2022-05-19)
4.visualization and analysis\data_preprocessed_va.pkl (4404727, 2022-05-19)
data_collection (0, 2022-05-19)
data_collection\collect_from_api.py (9123, 2022-05-19)
data_collection\requirements.txt (39, 2022-05-19)
data_collection\search.py (2247, 2022-05-19)
data_collection\utils.py (934, 2022-05-19)
data_collection\web_scraping.py (7944, 2022-05-19)

## Classification of stigmatizing articles of mental illness in portuguese online newspapers, with machine learning and natural language processing This project consists of the automatic classification of articles stigmatizing mental illness in portuguese online newspapers, and the automatic detection of topics present in them. The official data source is the Arquivo.pt public repository. Author: Alina Yanchuk - alinyanchuk@ua.pt - University of Aveiro ### Data Collection program This program collects data (web pages referring to portuguese articles that portray the mental disorders of schizophrenia and psychosis) from public repository Arquivo.pt and performs web scraping. The result is a CSV file (data.csv) with structured relevant data. To execute, run: pip install -r requirements.txt python3 collect_from_api.py ### 1. Preprocessing The Preprocessing stage is organized in a Jupyter Notebook, with the relevant steps described in the same. The data file used is the file returned in the Data Collection step (needs manual class annotation). Returns a data file for the Classification task and the Topic Modeling task and a data file for the final Visualization and Analysis task. Note: the data file in the directory (data_labeled.csv) is already manually labeled and also got some manual processing (after being returned in the Data Collection stage) to fully prepare it for the next steps. ### 2. Classification The Classification stage is organized in 2 Jupyter Notebooks, with the relevant steps described in the same. The data file used is the file with the cleaned data returned in the Preprocessing step. The first Notebook has 5 traditional Machine Learning algorithms and 3 Deep Learning algorithms, their hyper-parameters tuning and evaluation metrics. The second Notebook has the implemention of BERT (BERTimbau - PT BERT) algorithm (placed in a separate Notebook due to organizational issues and differences in some steps). Models implemented: - Logistic Regression - Linear SVC - Multinomial Naive Bayes - K-Nearest Neighbors - Random Forest - Convolutional Neural Network (with GloVe PT 300D) - Long Short-Term Memory (LSTM) (with GloVe PT 300D) - Bidirectional Long Short-Term Memory (Bi-LSTM) (with GloVe PT 300D) - BERTimbau ### 3. Topic Modeling The Topic Modeling stage is organized in a Jupyter Notebook, with the relevant steps described in the same. The data file used is the file with the cleaned data returned in the Preprocessing step. ### 4. Visualization and Analysis The Visualization and Analysis stage is organized in a Jupyter Notebook, with the relevant steps described in the same. The data file used is the file with the cleaned and prepared (for VA) data returned in the Preprocessing step. This step was done to obtain final insights about the data after automatic Classification and Topic Modeling. Can be adapated to other needs. About Jupyter Notebooks: https://docs.jupyter.org/en/latest/ Arquivo.pt: https://arquivo.pt/

近期下载者

相关文件


收藏者