20newsgroups_GBDT
所属分类:数值算法/人工智能
开发工具:Python
文件大小:69098KB
下载次数:0
上传日期:2019-05-25 08:45:05
上 传 者:
sh-1993
说明: 20个新闻组_GBDT,,
(20newsgroups_GBDT,,)
文件列表:
.idea (0, 2019-05-25)
.idea\20newsgroups_GBDT.iml (398, 2019-05-25)
.idea\misc.xml (288, 2019-05-25)
.idea\modules.xml (286, 2019-05-25)
.idea\vcs.xml (180, 2019-05-25)
.idea\workspace.xml (18144, 2019-05-25)
__pycache__ (0, 2019-05-25)
__pycache__\data_process.cpython-36.pyc (4278, 2019-05-25)
__pycache__\gbdt_model.cpython-36.pyc (1058, 2019-05-25)
data (0, 2019-05-25)
data\x_test.pkl (9600162, 2019-05-25)
data\x_train.pkl (38392962, 2019-05-25)
data\y_test.pkl (32159, 2019-05-25)
data\y_train.pkl (128135, 2019-05-25)
data_process.py (4606, 2019-05-25)
data_tfidf (0, 2019-05-25)
data_tfidf\x_test.pkl (6642354, 2019-05-25)
data_tfidf\x_train.pkl (27135198, 2019-05-25)
data_tfidf\y_test.pkl (32159, 2019-05-25)
data_tfidf\y_train.pkl (128135, 2019-05-25)
gbdt_model.py (1631, 2019-05-25)
img (0, 2019-05-25)
img\default_fasttext.png (196190, 2019-05-25)
img\default_tfidf.png (198918, 2019-05-25)
img\default_tfidf_new.png (76846, 2019-05-25)
main.py (1455, 2019-05-25)
# 20newspaper_GBDT
This is a implemetation of GBDT multi-classification with sklearn,the dataset is [20news-19997.tar.gz](http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz)
## Usage
### Requirements
- python 3.6
- nltk
- scikit-learn
- numpy
- pickle
## Run
### 1. Generate train data & test data
need download and unzip 20news-19997.tar.gz to root dir(eg. './20_newsgroups'), and download crawl-300d-2M.vec, put it in './vectors/'.
```
python data_process.py
```
you will get 2 dirs, each with 4 pickle file:
- ./data_tfidf: word used TD-IDF
- ./data: word embedding using fasttext, pretrained file: [crawl-300d-2M.vec](https://www.kaggle.com/yekenot/fasttext-crawl-300d-2m)
### 2. Start training GBDT
```
python main.py
```
## Result
### 1. GBDT + fasttext:
![image](https://github.com/hanyc0914/20newsgroups_GBDT/blob/master/img/default_fasttext.png)
### 2. GBDT + TF-IDF:
![image](https://github.com/hanyc0914/20newsgroups_GBDT/blob/master/img/default_tfidf.png)
### 3. GBDT(n_estimators: 60, max_depth: 6) + TF-IDF
![image](https://github.com/hanyc0914/20newsgroups_GBDT/blob/master/img/default_tfidf_new.png)
近期下载者:
相关文件:
收藏者: