smp2018
所属分类:机器人/智能制造
开发工具:Python
文件大小:2859KB
下载次数:0
上传日期:2018-08-10 12:59:45
上 传 者:
sh-1993
说明: smp2018竞赛(区分人类写作或机器人写作与新闻)
(smp2018 contest ( distinguish human writing or robot writing from news ))
文件列表:
Author Identification Base on Media Contents.pdf (688510, 2018-08-10)
ensemble (0, 2018-08-10)
ensemble\smp_lgb_blending.py (8941, 2018-08-10)
ensemble\smp_sta_all_vec.py (7956, 2018-08-10)
evaluate (0, 2018-08-10)
evaluate\get_pre_probe.py (1900, 2018-08-10)
evaluate\predict.py (8352, 2018-08-10)
figure (0, 2018-08-10)
figure\c_gru.png (139631, 2018-08-10)
figure\capsule_gru.png (117818, 2018-08-10)
figure\char_cnn.png (177793, 2018-08-10)
figure\word_char_capsule_gru.png (135658, 2018-08-10)
figure\word_char_rcnn.png (188871, 2018-08-10)
figure\word_han.png (58608, 2018-08-10)
figure\word_rcnn.png (115561, 2018-08-10)
gdufs_iiip_smp2018.ppt (1634800, 2018-08-10)
init (0, 2018-08-10)
init\__pycache__ (0, 2018-08-10)
init\__pycache__\config.cpython-35.pyc (5701, 2018-08-10)
init\config.py (8391, 2018-08-10)
model (0, 2018-08-10)
model\Attention.py (2531, 2018-08-10)
model\Capsule.py (2673, 2018-08-10)
model\__pycache__ (0, 2018-08-10)
model\__pycache__\Attention.cpython-35.pyc (2690, 2018-08-10)
model\__pycache__\Capsule.cpython-35.pyc (2660, 2018-08-10)
model\__pycache__\deepzoo.cpython-35.pyc (22456, 2018-08-10)
model\deepzoo.py (31942, 2018-08-10)
my_utils (0, 2018-08-10)
my_utils\__pycache__ (0, 2018-08-10)
my_utils\__pycache__\data.cpython-35.pyc (7807, 2018-08-10)
my_utils\__pycache__\data_preprocess.cpython-35.pyc (6678, 2018-08-10)
my_utils\__pycache__\metrics.cpython-35.pyc (583, 2018-08-10)
my_utils\data.py (12245, 2018-08-10)
my_utils\data_preprocess.py (7607, 2018-08-10)
my_utils\metrics.py (375, 2018-08-10)
my_utils\w2v_process.py (7685, 2018-08-10)
... ...
# SMP 2018 (1st prize)
This contest is to distinguish human writing or robot writing from articles, and we won the champion out of 240 teams.
# Task description
Given an article, we need to create algorithms that judge types of authors (automatic summary, machine translation, robot writer or human writer).
More details see [SMP EUPT 2018](https://www.biendata.com/competition/smpeupt2018/)
## 1.Set up
* tensorflow >= 1.4.0
* keras >= 1.2.0
* gensim
* scikit-learn
you may need **keras.utils.vis_utils** for model visualization
## 2.Data Preprocessing
- `my_utils/`: for data preprocessing
- `my_utils/data`: convert origin data to csv file
- `my_utils/data_preprocess`: create data sequences and batches for the input of deep learning models
- `my_utils/w2v_process`: get the vocabs and pre-trained embeddings for words and chars
- `my_utils/metrics`: calcuate the precision, recall and F1 scores for each categories of authors
## 3.Models
There are total 12 models that combine word representations and character representations.
The best model `word rcnn char cgru` we devised is spired by two papers:
* [A Hybrid Framework for Text Modeling with Convolutional RNN](http://xueshu.baidu.com/s?wd=paperuri%3A%288fa9aee951dcbd75f9259bc0f6bee7d6%29&filter=sc_long_sign&tn=SE_xueshusource_2kduw22v&sc_vurl=http%3A%2F%2Fdl.acm.org%2Fcitation.cfm%3Fid%3D30***140&ie=utf-8&sc_us=15226213875739465170)
* [A C-LSTM Neural Network for Text Classfication](http://xueshu.baidu.com/s?wd=paperuri%3A%28e3c8a546d601***116***2a41cca6f2ad8%29&filter=sc_long_sign&tn=SE_xueshusource_2kduw22v&sc_vurl=http%3A%2F%2Farxiv.org%2Fpdf%2F1511.08630&ie=utf-8&sc_us=5294540248844921011)
Here is the scores of different models:
|model |off-line |on-line |
:---: |:---: |:---:
word_char_cnn | 0.***88 | 0.***49
word_char_rnn | 0.***94 | 0.***63
deep_word_char_cnn | 0.***87 | 0.***28
word_rcnn_char_rnn | 0.***99 | 0.***79
word_rnn_char_rcnn | 0.9902 | 0.***72
word_char_cgru | 0.***96 | 0.***61
word_cgru_char_rcnn | 0.9904 | untested
word_rcnn_char_cgru | 0.9910 | 0.***82
word_cgru_char_rnn | 0.***87 | untested
word_rnn_char_cgru | 0.***99 | untested
word_rnn_char_cnn | 0.***97 | 0.***62
word_char_rcnn | 0.***94 | 0.***84
* Note that rcnn comes from `A Hybrid Framework for Text Modeling with Convolutional RNN` while cgru comes from `A C-LSTM Neural Network for Text Classfication`
The source codes derives from https://github.com/fuliucansheng/360
We use `model` to create the architectures of models, and use `train` to train them
## 4.Ensemble
We use LightGBM for ensemble combined 12 models and extra statistical features, which is in `ensemble`, more details seen in https://github.com/TFknight/SMP-2018-Ensemble-Guide
In test dataset, we only adopt a simple but efficient voting mechanism for ensembling, which is in `evaluate/predict`
## 5.Main files
- `my_utils/`: for data preprocessing
- `my_utils/data`: convert origin data to csv file
- `my_utils/data_preprocess`: create data sequences and batches for the input of deep learning models
- `my_utils/w2v_process`: get the vocabs and pre-trained embeddings for words and chars
- `my_utils/metrics`: calcuate the precision, recall and F1 scores for each categories of authors
- `models/`: for creating deep learning models
- `deepzoo`: for keeping all models
- `init/config.py`: for saving the path of models, data and so on
- `train`: for training models
- `figure`: for saving the visualization of models
# Acknowledgment
Thanks for all the efforts of my teammates in `GDUFS-iiip`
We hope that more people will join in our labs: `Data Mining Lab in GDUFS(广外数据挖掘实验室)`
近期下载者:
相关文件:
收藏者: