bbc_newsclf
所属分类:特征抽取
开发工具:Jupyter Notebook
文件大小:8965KB
下载次数:0
上传日期:2021-05-05 22:59:46
上 传 者:
sh-1993
说明: 使用tf_idf bow和sLDA主题建模的简单BBC新闻分类
(simple BBC news classification with tf_idf bow and sLDA topic modelling)
文件列表:
LDA_test.py (975, 2021-05-06)
Lits (0, 2021-05-06)
Lits\Machine learning in automated text categorization.pdf (524409, 2021-05-06)
Lits\NIPS-2007-supervised-topic-models-Paper.pdf (323560, 2021-05-06)
Lits\blei03a.pdf (417996, 2021-05-06)
SVC_model.pickle (14760607, 2021-05-06)
X_1000_300.pickle (94746167, 2021-05-06)
dataset (0, 2021-05-06)
dataset\bbc (0, 2021-05-06)
dataset\bbc\business (0, 2021-05-06)
dataset\bbc\business\001.txt (2560, 2021-05-06)
dataset\bbc\business\002.txt (2252, 2021-05-06)
dataset\bbc\business\003.txt (1552, 2021-05-06)
dataset\bbc\business\004.txt (2412, 2021-05-06)
dataset\bbc\business\005.txt (1570, 2021-05-06)
dataset\bbc\business\006.txt (1187, 2021-05-06)
dataset\bbc\business\007.txt (1669, 2021-05-06)
dataset\bbc\business\008.txt (1922, 2021-05-06)
dataset\bbc\business\009.txt (1494, 2021-05-06)
dataset\bbc\business\010.txt (1449, 2021-05-06)
dataset\bbc\business\011.txt (1144, 2021-05-06)
dataset\bbc\business\012.txt (1847, 2021-05-06)
dataset\bbc\business\013.txt (1830, 2021-05-06)
dataset\bbc\business\014.txt (2981, 2021-05-06)
dataset\bbc\business\015.txt (3808, 2021-05-06)
dataset\bbc\business\016.txt (1393, 2021-05-06)
dataset\bbc\business\017.txt (1299, 2021-05-06)
dataset\bbc\business\018.txt (1002, 2021-05-06)
dataset\bbc\business\019.txt (1733, 2021-05-06)
dataset\bbc\business\020.txt (3854, 2021-05-06)
dataset\bbc\business\021.txt (2046, 2021-05-06)
dataset\bbc\business\022.txt (1933, 2021-05-06)
dataset\bbc\business\023.txt (1267, 2021-05-06)
dataset\bbc\business\024.txt (1954, 2021-05-06)
dataset\bbc\business\025.txt (2704, 2021-05-06)
dataset\bbc\business\026.txt (1829, 2021-05-06)
dataset\bbc\business\027.txt (1620, 2021-05-06)
... ...
# Simple news clf
- [**Project Repository**](https://bitbucket.org/4r2eBurger/news_clf/src/master/)
- [**Project Report**](https://bitbucket.org/4r2eBurger/news_clf/src/master/report_latex/CWI_part2_report.pdf)
## File Description
### Source Code
- [`text_processing.py`](https://bitbucket.org/4r2eBurger/news_clf/src/master/text_processing.py): A module which consists of plain text processing functions, such as POS counting, TF-IDF weights calculation and etc.
- [`prepocessing.py`](https://bitbucket.org/4r2eBurger/news_clf/src/master/prepocessing.py): A module which is used to process raw data and tranform them into sample vectors that can be directly used for training.
- [`one_shot_eval.py`](https://bitbucket.org/4r2eBurger/news_clf/src/master/one_shot_eval.py): A module that conducts simple train-dev-test evaluation.
- [`fs_n_training.py`](https://bitbucket.org/4r2eBurger/news_clf/src/master/fs_n_training.py): A module that conducts feature selection, grid-search with cross-validation.
- [`main.py`](https://bitbucket.org/4r2eBurger/news_clf/src/master/main.py): A moudule that combines steps of processing-training-validation-evaluation pipeline.
- [`main.ipynb`](https://bitbucket.org/4r2eBurger/news_clf/src/master/main.ipynb): processing-training-validation-evaluation pipeline demo.
- [`unseen_test.py`](https://bitbucket.org/4r2eBurger/news_clf/src/master/unseen_test.py): A module that consists simple evaluation of trained models by predicting on up-to-date unseen BBC news documents.
- [`utils.py`](https://bitbucket.org/4r2eBurger/news_clf/src/master/utils.py): A module that includes simple file IO functions.
- [`sLDA_test.py`](https://bitbucket.org/4r2eBurger/news_clf/src/master/sLDA_test.py): (optional) Test case of supervised topic modelling, implemented with [Tomotopy Library](https://github.com/bab2min/tomotopy).
- [`sLDA_test.ipynb`](https://bitbucket.org/4r2eBurger/news_clf/src/master/sLDA_test.ipynb): (optional) supervised topic modelling demo.
- [`LDA_test.py`](https://bitbucket.org/4r2eBurger/news_clf/src/master/LDA_test.py): (optional) LDA topic modelling test case.
### Dataset
- `dataset`(DIR): Raw dataset that is used in experimental pipeline.
- `unseen_dat`(DIR): Raw dataset that is used in trained models evaluation.
### Cached Files
- `SVC_model.pickle`: Serialisation of best fine-tuned SVC model.
- `softmax_model.pickle`: Serialisation of best fine-tuned Softmax Regression model.
- `vocab.pickle`: Serialisation of news contents vocabulary.
- `vocab_title.pickle`: Serialisation of news headlines vocabulary.
- `X_1000_300.pickle`: Serialisation of sample set X.
- `y_1000_300.pickle`: Serialisation of sample set y.
## Get Started
### Prerequisites
**OS**: OSX/Linux
Needs to be run in [**Conda**](https://docs.conda.io/projects/conda/en/latest/index.html) environment.
#### Conda dependencies
- python 3 `conda install python=3.9`
- scikit-learn 0.23.2 `conda install -c conda-forge scikit-learn=0.23.2`
- numpy `conda install numpy`
- pandas `conda install pandas`
- nltk `conda install nltk`
- matplotlib `conda install matplotlib`
#### Pip dependencies
- [Tomotopy](https://github.com/bab2min/tomotopy): `pip install tomotopy` (in conda environment)
### Steps
1. Clone or download project
2. Enter the source directory by `cd [user DIR]/news_clf`
3. create new conda running environment : `conda create --name news_clf python=3.9 pandas numpy scikit-learn=0.23.2 nltk matplotlib pip`
4. activate running environment `conda activate news_clf`
#### One-shot Train-Dev-Test evaluation
Run `python one_shot_eval.py`
#### Feature selection, Training and GridSearch with Cross-validation evaluation
Run `python main.py` or run jupyter notebook demo `main.ipynb`
#### Test on up-to-date small samples
Run `python unseen_test.py`
#### Demo of sLDA topic modelling with linear SVC (Optional)
Run `python sLDA_test.py` or run jupyter notebook demo `sLDA_test.ipynb`
* Tested on OSX and Ubuntu container, might fail in Linux due to [unsolved bug](https://github.com/bab2min/tomotopy#history) of tomotopy library.
近期下载者:
相关文件:
收藏者: