News-summarization

所属分类:特征抽取
开发工具:Jupyter Notebook
文件大小:206KB
下载次数:0
上传日期:2020-02-24 02:42:02
上 传 者sh-1993
说明:  总结新闻故事的深度学习NLP项目
(A deep learning NLP project summarizing the news stories)

文件列表:
Final_project_biLSTM_final.ipynb (229879, 2020-02-24)
Final_project_generate_embeddings.ipynb (169243, 2020-02-24)
Final_project_preprocessing.ipynb (383576, 2020-02-24)
model_diag.jpg (68341, 2020-02-24)

## News article summarization Summarization of content is an important research area for Natural Language Processing. Summarizing news articles is an important branch of this research. News publications like Associated Press, Bloomberg and Reuters are actively working on automating stories in different beats such as finance and sports. Readers often don't have time to read through entire articles, and reading only the headline and sub-heading does not inform them of all the important points discussed in the article. Also, it is not possible for news publications to prepare summaries of every article they publish. So, it may be a good idea for news apps to have in-built features that can summarize articles for them. In this project, we attempt to generate summaries of news articles by extracting important sentences from a news article. Our work was inspired by a research paper on Content selection in Deep Learning Models of summarization by Prof. Kathleen McKeown and Chris Kedzie: https://arxiv.org/pdf/1810.12343v2.pdf Code from their research is available on the following repositories: * https://github.com/kedz/nnsum * https://github.com/kedz/summarization-datasets While the research paper uses corpora from different fields, such as news, personal stories, meetings and medical reports, we are only concentrating on extractive summaries of news articles. We have used the CNN/Daily Mail corpus for this project. The previous repo is for model training and evaluation, and the latter one is for preprocessing the different datasets used in the paper. Our analysis is of a smaller scale, so we went for a bidirectional LSTM model for this project. We used 23,000 articles to train the sentence embeddings and the model, 5,000 articles for validation and 5,000 articles for test. The Rough-1 score for our project is 0.351. We have used three notebooks for this project, two for preprocessing the data and one for training and evaluating the model. The following image shows the architecture of the project. ![Architecture](https://github.com/haoshuai999/News-summarization/blob/master/model_diag.jpg)

近期下载者

相关文件


收藏者