BertCommentSum

所属分类:特征抽取
开发工具:Python
文件大小:324KB
下载次数:0
上传日期:2022-12-08 10:42:59
上 传 者sh-1993
说明:  网络新闻讨论线索的伯特抽象概括
(Bert Abstractive Summarization of Online News Discussion Threads)

文件列表:
BETO (0, 2021-06-10)
LICENSE (1065, 2021-06-10)
bert_data (0, 2021-06-10)
bert_job.sh (1217, 2021-06-10)
bert_job_nofinetune.sh (1241, 2021-06-10)
bert_job_nofinetune_nolikes.sh (1283, 2021-06-10)
bert_job_nofinetune_notitle.sh (1279, 2021-06-10)
bert_job_nofinetune_notitle_nolikes.sh (1322, 2021-06-10)
bert_job_notitles.sh (1251, 2021-06-10)
bert_job_notitles_nolikes.sh (1274, 2021-06-10)
bert_job_tricomment.sh (1181, 2021-06-10)
bert_models (0, 2021-06-10)
bert_models\cased (0, 2021-06-10)
bert_models\cased\config.json (313, 2021-06-10)
bert_models\cased\vocab.txt (242120, 2021-06-10)
bert_models\uncased (0, 2021-06-10)
bert_models\uncased\config.json (313, 2021-06-10)
bert_models\uncased\vocab.txt (248047, 2021-06-10)
checker.py (264, 2021-06-10)
data (0, 2021-06-10)
data_sample.json (68096, 2021-06-10)
format_to_lines.sh (555, 2021-06-10)
json_data (0, 2021-06-10)
logs (0, 2021-06-10)
models (0, 2021-06-10)
requirements.txt (748, 2021-06-10)
results (0, 2021-06-10)
src (0, 2021-06-10)
src\cal_rouge.py (4275, 2021-06-10)
... ...

# Bert Abstractive Comment Summarization **This code is based on PreSumm from the EMNLP 2019 paper [Text Summarization with Pretrained Encoders](https://arxiv.org/abs/1908.08345) ([repo](https://github.com/nlpyang/PreSumm/))** **Python version**: This code is in Python 3.6 **Package Requirements**: torch==1.1.0 ransformers tensorboardX multiprocess pyrouge For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training. Some codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py) ## BERT Model used [BETO](https://github.com/dccuchile/beto) from [Spanish Pre-Trained BERT Model and Evaluation Data](https://users.dcc.uchile.cl/~jperez/papers/pml4dc2020.pdf) ### Preprocessing #### Sentence Splitting and Tokenization ``` python preprocess.py -mode tokenize -raw_path RAW_PATH -save_path TOKENIZED_PATH ``` * `RAW_PATH` is the directory containing story files (`../raw_stories`), `JSON_PATH` is the target directory to save the generated json files (`../merged_stories_tokenized`) #### Format to Simpler Json Files ``` python preprocess.py -mode format_to_lines -raw_path RAW_PATH -save_path JSON_PATH -n_cpus 1 -use_bert_basic_tokenizer false -map_path MAP_PATH ``` * `RAW_PATH` is the directory containing tokenized files (`../merged_stories_tokenized`), `JSON_PATH` is the target directory to save the generated json files (`../json_data/cnndm`), `MAP_PATH` is the directory containing the urls files (`../urls`) #### Format to PyTorch Files ``` python preprocess.py -mode format_to_bert -raw_path JSON_PATH -save_path BERT_DATA_PATH -lower -n_cpus 1 -log_file ../logs/preprocess.log ``` * `JSON_PATH` is the directory containing json files (`../json_data`), `BERT_DATA_PATH` is the target directory to save the generated binary files (`../bert_data`) ## Model Training **First run: For the first time, you should use single-GPU, so the code can download the BERT model. Use ``-visible_gpus -1``, after downloading, you could kill the process and rerun the code with multi-GPUs.** ### Abstractive Setting #### BertAbs ``` python train.py -task abs -mode train -bert_data_path BERT_DATA_PATH -dec_dropout 0.2 -model_path MODEL_PATH -sep_optim true -lr_bert 0.002 -lr_dec 0.2 -save_checkpoint_steps 2000 -batch_size 140 -train_steps 200000 -report_every 50 -accum_count 5 -use_bert_emb true -use_interval true -warmup_steps_bert 20000 -warmup_steps_dec 10000 -max_pos 512 -visible_gpus 0,1,2,3 -log_file ../logs/abs_bert_cnndm ```

近期下载者

相关文件


收藏者