BertCommentSum
所属分类:特征抽取
开发工具:Python
文件大小:324KB
下载次数:0
上传日期:2022-12-08 10:42:59
上 传 者:
sh-1993
说明: 网络新闻讨论线索的伯特抽象概括
(Bert Abstractive Summarization of Online News Discussion Threads)
文件列表:
BETO (0, 2021-06-10)
LICENSE (1065, 2021-06-10)
bert_data (0, 2021-06-10)
bert_job.sh (1217, 2021-06-10)
bert_job_nofinetune.sh (1241, 2021-06-10)
bert_job_nofinetune_nolikes.sh (1283, 2021-06-10)
bert_job_nofinetune_notitle.sh (1279, 2021-06-10)
bert_job_nofinetune_notitle_nolikes.sh (1322, 2021-06-10)
bert_job_notitles.sh (1251, 2021-06-10)
bert_job_notitles_nolikes.sh (1274, 2021-06-10)
bert_job_tricomment.sh (1181, 2021-06-10)
bert_models (0, 2021-06-10)
bert_models\cased (0, 2021-06-10)
bert_models\cased\config.json (313, 2021-06-10)
bert_models\cased\vocab.txt (242120, 2021-06-10)
bert_models\uncased (0, 2021-06-10)
bert_models\uncased\config.json (313, 2021-06-10)
bert_models\uncased\vocab.txt (248047, 2021-06-10)
checker.py (264, 2021-06-10)
data (0, 2021-06-10)
data_sample.json (68096, 2021-06-10)
format_to_lines.sh (555, 2021-06-10)
json_data (0, 2021-06-10)
logs (0, 2021-06-10)
models (0, 2021-06-10)
requirements.txt (748, 2021-06-10)
results (0, 2021-06-10)
src (0, 2021-06-10)
src\cal_rouge.py (4275, 2021-06-10)
... ...
# Bert Abstractive Comment Summarization
**This code is based on PreSumm from the EMNLP 2019 paper [Text Summarization with Pretrained Encoders](https://arxiv.org/abs/1908.08345) ([repo](https://github.com/nlpyang/PreSumm/))**
**Python version**: This code is in Python 3.6
**Package Requirements**: torch==1.1.0 ransformers tensorboardX multiprocess pyrouge
For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.
Some codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py)
## BERT Model used
[BETO](https://github.com/dccuchile/beto) from [Spanish Pre-Trained BERT Model and Evaluation Data](https://users.dcc.uchile.cl/~jperez/papers/pml4dc2020.pdf)
### Preprocessing
#### Sentence Splitting and Tokenization
```
python preprocess.py -mode tokenize -raw_path RAW_PATH -save_path TOKENIZED_PATH
```
* `RAW_PATH` is the directory containing story files (`../raw_stories`), `JSON_PATH` is the target directory to save the generated json files (`../merged_stories_tokenized`)
#### Format to Simpler Json Files
```
python preprocess.py -mode format_to_lines -raw_path RAW_PATH -save_path JSON_PATH -n_cpus 1 -use_bert_basic_tokenizer false -map_path MAP_PATH
```
* `RAW_PATH` is the directory containing tokenized files (`../merged_stories_tokenized`), `JSON_PATH` is the target directory to save the generated json files (`../json_data/cnndm`), `MAP_PATH` is the directory containing the urls files (`../urls`)
#### Format to PyTorch Files
```
python preprocess.py -mode format_to_bert -raw_path JSON_PATH -save_path BERT_DATA_PATH -lower -n_cpus 1 -log_file ../logs/preprocess.log
```
* `JSON_PATH` is the directory containing json files (`../json_data`), `BERT_DATA_PATH` is the target directory to save the generated binary files (`../bert_data`)
## Model Training
**First run: For the first time, you should use single-GPU, so the code can download the BERT model. Use ``-visible_gpus -1``, after downloading, you could kill the process and rerun the code with multi-GPUs.**
### Abstractive Setting
#### BertAbs
```
python train.py -task abs -mode train -bert_data_path BERT_DATA_PATH -dec_dropout 0.2 -model_path MODEL_PATH -sep_optim true -lr_bert 0.002 -lr_dec 0.2 -save_checkpoint_steps 2000 -batch_size 140 -train_steps 200000 -report_every 50 -accum_count 5 -use_bert_emb true -use_interval true -warmup_steps_bert 20000 -warmup_steps_dec 10000 -max_pos 512 -visible_gpus 0,1,2,3 -log_file ../logs/abs_bert_cnndm
```
近期下载者:
相关文件:
收藏者: