smp2018

所属分类:机器人/智能制造
开发工具:Python
文件大小:2859KB
下载次数:0
上传日期:2018-08-10 12:59:45
上 传 者sh-1993
说明:  smp2018竞赛(区分人类写作或机器人写作与新闻)
(smp2018 contest ( distinguish human writing or robot writing from news ))

文件列表:
Author Identification Base on Media Contents.pdf (688510, 2018-08-10)
ensemble (0, 2018-08-10)
ensemble\smp_lgb_blending.py (8941, 2018-08-10)
ensemble\smp_sta_all_vec.py (7956, 2018-08-10)
evaluate (0, 2018-08-10)
evaluate\get_pre_probe.py (1900, 2018-08-10)
evaluate\predict.py (8352, 2018-08-10)
figure (0, 2018-08-10)
figure\c_gru.png (139631, 2018-08-10)
figure\capsule_gru.png (117818, 2018-08-10)
figure\char_cnn.png (177793, 2018-08-10)
figure\word_char_capsule_gru.png (135658, 2018-08-10)
figure\word_char_rcnn.png (188871, 2018-08-10)
figure\word_han.png (58608, 2018-08-10)
figure\word_rcnn.png (115561, 2018-08-10)
gdufs_iiip_smp2018.ppt (1634800, 2018-08-10)
init (0, 2018-08-10)
init\__pycache__ (0, 2018-08-10)
init\__pycache__\config.cpython-35.pyc (5701, 2018-08-10)
init\config.py (8391, 2018-08-10)
model (0, 2018-08-10)
model\Attention.py (2531, 2018-08-10)
model\Capsule.py (2673, 2018-08-10)
model\__pycache__ (0, 2018-08-10)
model\__pycache__\Attention.cpython-35.pyc (2690, 2018-08-10)
model\__pycache__\Capsule.cpython-35.pyc (2660, 2018-08-10)
model\__pycache__\deepzoo.cpython-35.pyc (22456, 2018-08-10)
model\deepzoo.py (31942, 2018-08-10)
my_utils (0, 2018-08-10)
my_utils\__pycache__ (0, 2018-08-10)
my_utils\__pycache__\data.cpython-35.pyc (7807, 2018-08-10)
my_utils\__pycache__\data_preprocess.cpython-35.pyc (6678, 2018-08-10)
my_utils\__pycache__\metrics.cpython-35.pyc (583, 2018-08-10)
my_utils\data.py (12245, 2018-08-10)
my_utils\data_preprocess.py (7607, 2018-08-10)
my_utils\metrics.py (375, 2018-08-10)
my_utils\w2v_process.py (7685, 2018-08-10)
... ...

# SMP 2018 (1st prize) This contest is to distinguish human writing or robot writing from articles, and we won the champion out of 240 teams. # Task description Given an article, we need to create algorithms that judge types of authors (automatic summary, machine translation, robot writer or human writer). More details see [SMP EUPT 2018](https://www.biendata.com/competition/smpeupt2018/) ## 1.Set up * tensorflow >= 1.4.0 * keras >= 1.2.0 * gensim * scikit-learn
you may need **keras.utils.vis_utils** for model visualization ## 2.Data Preprocessing - `my_utils/`: for data preprocessing - `my_utils/data`: convert origin data to csv file - `my_utils/data_preprocess`: create data sequences and batches for the input of deep learning models - `my_utils/w2v_process`: get the vocabs and pre-trained embeddings for words and chars - `my_utils/metrics`: calcuate the precision, recall and F1 scores for each categories of authors ## 3.Models There are total 12 models that combine word representations and character representations. The best model `word rcnn char cgru` we devised is spired by two papers: * [A Hybrid Framework for Text Modeling with Convolutional RNN](http://xueshu.baidu.com/s?wd=paperuri%3A%288fa9aee951dcbd75f9259bc0f6bee7d6%29&filter=sc_long_sign&tn=SE_xueshusource_2kduw22v&sc_vurl=http%3A%2F%2Fdl.acm.org%2Fcitation.cfm%3Fid%3D30***140&ie=utf-8&sc_us=15226213875739465170) * [A C-LSTM Neural Network for Text Classfication](http://xueshu.baidu.com/s?wd=paperuri%3A%28e3c8a546d601***116***2a41cca6f2ad8%29&filter=sc_long_sign&tn=SE_xueshusource_2kduw22v&sc_vurl=http%3A%2F%2Farxiv.org%2Fpdf%2F1511.08630&ie=utf-8&sc_us=5294540248844921011) Here is the scores of different models: |model |off-line |on-line | :---: |:---: |:---: word_char_cnn | 0.***88 | 0.***49 word_char_rnn | 0.***94 | 0.***63 deep_word_char_cnn | 0.***87 | 0.***28 word_rcnn_char_rnn | 0.***99 | 0.***79 word_rnn_char_rcnn | 0.9902 | 0.***72 word_char_cgru | 0.***96 | 0.***61 word_cgru_char_rcnn | 0.9904 | untested word_rcnn_char_cgru | 0.9910 | 0.***82 word_cgru_char_rnn | 0.***87 | untested word_rnn_char_cgru | 0.***99 | untested word_rnn_char_cnn | 0.***97 | 0.***62 word_char_rcnn | 0.***94 | 0.***84 * Note that rcnn comes from `A Hybrid Framework for Text Modeling with Convolutional RNN` while cgru comes from `A C-LSTM Neural Network for Text Classfication`
The source codes derives from https://github.com/fuliucansheng/360
We use `model` to create the architectures of models, and use `train` to train them ## 4.Ensemble
We use LightGBM for ensemble combined 12 models and extra statistical features, which is in `ensemble`, more details seen in https://github.com/TFknight/SMP-2018-Ensemble-Guide
In test dataset, we only adopt a simple but efficient voting mechanism for ensembling, which is in `evaluate/predict` ## 5.Main files - `my_utils/`: for data preprocessing - `my_utils/data`: convert origin data to csv file - `my_utils/data_preprocess`: create data sequences and batches for the input of deep learning models - `my_utils/w2v_process`: get the vocabs and pre-trained embeddings for words and chars - `my_utils/metrics`: calcuate the precision, recall and F1 scores for each categories of authors - `models/`: for creating deep learning models - `deepzoo`: for keeping all models - `init/config.py`: for saving the path of models, data and so on - `train`: for training models - `figure`: for saving the visualization of models # Acknowledgment
Thanks for all the efforts of my teammates in `GDUFS-iiip`
We hope that more people will join in our labs: `Data Mining Lab in GDUFS(广外数据挖掘实验室)`

近期下载者

相关文件


收藏者