summarizer

所属分类:特征抽取
开发工具:JavaScript
文件大小:881KB
下载次数:0
上传日期:2017-03-23 19:52:05
上 传 者sh-1993
说明:  一个简单的web应用程序,总结任何新闻文章。
(A simple web app that summarizes any news article.)

文件列表:
.DS_Store (6148, 2014-10-31)
api.py (1638, 2014-10-31)
gitflow (0, 2014-10-31)
static (0, 2014-10-31)
static\.DS_Store (6148, 2014-10-31)
static\css (0, 2014-10-31)
static\css\bootstrap.min.css (109518, 2014-10-31)
static\css\clean-blog.css (8770, 2014-10-31)
static\css\clean-blog.min.css (7065, 2014-10-31)
static\css\cover.css (3763, 2014-10-31)
static\css\style.css (269, 2014-10-31)
static\flash (0, 2014-10-31)
static\flash\ZeroClipboard.swf (4036, 2014-10-31)
static\fonts (0, 2014-10-31)
static\fonts\glyphicons-halflings-regular.eot (20335, 2014-10-31)
static\fonts\glyphicons-halflings-regular.svg (62927, 2014-10-31)
static\fonts\glyphicons-halflings-regular.ttf (41280, 2014-10-31)
static\fonts\glyphicons-halflings-regular.woff (23320, 2014-10-31)
static\img (0, 2014-10-31)
static\img\about-bg.jpg (33097, 2014-10-31)
static\img\contact-bg.jpg (290070, 2014-10-31)
static\img\home-bg.jpg (172779, 2014-10-31)
static\img\octocat-spinner-smil.min.svg (3941, 2014-10-31)
static\img\post-bg.jpg (140909, 2014-10-31)
static\img\post-sample-image.jpg (115144, 2014-10-31)
static\js (0, 2014-10-31)
static\js\bootstrap.min.js (31819, 2014-10-31)
static\js\clean-blog.js (42004, 2014-10-31)
static\js\clean-blog.min.js (17300, 2014-10-31)
static\js\docs.min.js (31663, 2014-10-31)
static\js\ie-emulation-modes-warning.js (2132, 2014-10-31)
static\js\ie10-viewport-bug-workaround.js (694, 2014-10-31)
static\js\script.js (277, 2014-10-31)
summarizer.py (3538, 2014-10-31)
templates (0, 2014-10-31)
templates\index.html (5085, 2014-10-31)
templates\post.html (4572, 2014-10-31)
... ...

## Synopsis Summarize any news article The user flow is as follows: - Shoose a URL - Enter URL in the textbox - Click on the “Summarize” button - Read your summary ## Instructions As soon as you've got all the files downloaded you just need to run this command: ``` python api.py ``` Then open ```http://localhost:5000```. ## Features - NLP using ```TextBlog``` - Responsive design - Articles are extracted using ```newspaper``` library - Main img and title are extracted as well - Backend: ```Flask``` microframework ## Algorithm: 1. I split the text into sentences 2. I calculate an individual score for each sentence and store it in a key-value dictionary, where the sentence itself is the key and the value is the total score. The total score is just the sum of all its intersections with the other sentences in the text (not including itself). 3. I split the text into paragraphs (paragraphs have a min size of 3 sentences). 4. I choose the best sentence from each paragraph according to our sentences dictionary. ## Why is that working? - The first (and obvious) reason is that a paragraph is a logical atomic unit of the text. In simple words – there is probably a very good reason why the author decided to split his text that way. - if two sentences have a good intersection, they probably holds the same information. So if one sentence has a good intersection with many other sentences, it probably holds some information from each one of them- or in other words, this is probably a key sentence in our text! ## Future improvements: - Find a better way to extract article text - Modify intersection function and see how it improves the summarizer - Play with the min-size of a paragraph - Improve UX/UI - Group paragraphs considering their intersection - Use only nouns in Jaccard distance

近期下载者

相关文件


收藏者