NodeJsNewsCrawler

所属分类:搜索引擎
开发工具:JavaScript
文件大小:472KB
下载次数:0
上传日期:2017-05-23 02:16:33
上 传 者sh-1993
说明:  NodesJS中的新闻搜索引擎
(Search engine for news in NodesJS)

文件列表:
.node-persist (0, 2017-05-23)
.node-persist\storage (0, 2017-05-23)
.node-persist\storage\5d7e39248b8539d31db2992106fdc551 (41094, 2017-05-23)
.vscode (0, 2017-05-23)
.vscode\tasks.json (620, 2017-05-23)
CreateHeadlines.js (1519, 2017-05-23)
Frontend (0, 2017-05-23)
Frontend\index.html (4024, 2017-05-23)
Frontend\script.js (5895, 2017-05-23)
Frontend\style.css (1142, 2017-05-23)
Includes (0, 2017-05-23)
Includes\AnalyzeModule (0, 2017-05-23)
Includes\AnalyzeModule\SentimentModule (0, 2017-05-23)
Includes\AnalyzeModule\SentimentModule\SentimentAnalysis.js (2422, 2017-05-23)
Includes\AnalyzeModule\SentimentModule\testSentiment.js (185, 2017-05-23)
Includes\AnalyzeModule\testEntityExtraction.js (5263, 2017-05-23)
Includes\Article.js (929, 2017-05-23)
Includes\DataAPI.js (11248, 2017-05-23)
Includes\DataManager.js (6492, 2017-05-23)
Includes\DatabaseModule (0, 2017-05-23)
Includes\DatabaseModule\ElasticAPI.js (2754, 2017-05-23)
Includes\DatabaseModule\Tests (0, 2017-05-23)
Includes\DatabaseModule\Tests\testIndexArticles.js (167, 2017-05-23)
Includes\DatabaseModule\Tests\testInitDb.js (99, 2017-05-23)
Includes\Download.js (938, 2017-05-23)
Includes\Facebook.js (3314, 2017-05-23)
Includes\HeadlineWriter.js (4746, 2017-05-23)
Includes\Link.js (1847, 2017-05-23)
Includes\LinkScanner.js (1716, 2017-05-23)
Includes\ProcessLink.js (2833, 2017-05-23)
Includes\RetrievalModule (0, 2017-05-23)
Includes\RetrievalModule\Analyze (0, 2017-05-23)
Includes\RetrievalModule\Analyze\Article.js (3682, 2017-05-23)
Includes\RetrievalModule\Analyze\LinkCollection.js (1471, 2017-05-23)
Includes\RetrievalModule\Download (0, 2017-05-23)
Includes\RetrievalModule\Download\DownloadQueue.js (2841, 2017-05-23)
Includes\RetrievalModule\Test (0, 2017-05-23)
Includes\RetrievalModule\Test\testArticle.js (612, 2017-05-23)
... ...

# NewsCrawler This is a software which scans a given set of news sources and extracts its headlines. The headlines are processed and saved for later analysis. The data can be accessed from a web frontend or a rest API. Its all about the analysis of news trends. ## Features * Downloads article titles from the news data/sources.json * Saves them in a Redis database * Creates an inverted index for search * Creates left/right neighbour relations for every word (and day) * Creates same headline relations for every word (and day) * Counts the occurence for every word (and day) * Finds the most popular words for the day * Provides a rest api for data access * Web frontend which supports most of the backend functionalities * The frontend is responsive * Example news bot for facebook ## Rest API These are the API endpoints ``` /api/search/:query /api/rightneighbour/:word /api/rightneighbour/:word/:day /api/leftneighbour/:word /api/leftneighbour/:word/:day /api/sameheadline/:query /api/sameheadline/:query/:day /api/count/:word /api/popularwords/:day /api/popularwords/ /api/popularwordhistory/:word /api/link/:id /api/sources/ ``` To get the day today to give to the api you can use the following function ``` javascript getToday = function(){ return Math.floor(Date.now() / 1000 / 60 / 60 / 24); } ``` ## Backend Dependencies (npm) * express * express-rest * cheerio * redis * request * node-schedule * fbgraph * nlp_compromise ## Frontend Dependencies * bootstrap * JQuery ## Setup Download the files, install Redis, install the npm dpendencies and run the 'Start.js'. You can also specify the news sources in the data/sources.json. After start navigate with your favorite browser to http://localhost:3000 to see the frontend. If you want to use the bots, you have to create the config.json in the /data folder. There is an example config in the same folder. ## Usage To run the webserver and the downloader / processor just execute the 'Start.js' file ``` > node Start.js ``` ## Facebook bot It is possible to create news bots for facebook pages with this framework. I created an example bot. Create a /data/config.json file for your facebook page and run ``` > node WhoGotKilledBot.js ``` If you want to see a bot in action: https://www.facebook.com/pg/whogotkilled/ The WhoGotKilledBot searches for news with the word "killed" and tires to extract the information who got killed in which place. It will post these articles every 10 minutes to the facebook page and avoids doublicated posts. There is also an TrumpNews bot. ## Generating the most recent headlines Try generating the most important headlines of the day by executing 'CreateHeadlines.js'. This is heighly experimental ``` > node CreateHeadlines.js ``` ``` [ [ 'steven', 'mnuchin' ], [ 'womens', 'march' ], [ 'novak', 'djokovic', 'upset' ], [ 'largest', 'student', 'loan' ], [ 'agriculture', 'secretary' ], [ 'los', 'angeles' ], [ 'developer', 'rick', 'perry' ], [ 'treasury', 'pick' ], [ 'press', 'conference' ], [ 'chelsea', 'manning' ], [ 'avalanche', 'buries', 'hotel' ], [ 'second', 'round' ], [ 'full', 'article' ], [ 'takes', 'office' ], [ 'australian', 'open' ], [ 'least', '30', 'firefighters', 'killed' ], [ 'tehran', 'high', 'rise', 'collapses' ], [ 'peoples', 'choice', 'awards' ], [ 'first', 'lady' ], [ 'more', 'than', '100', 'lapd' ], [ 'donald', 'trumps', 'inauguration', 'day' ], ... ``` ## Frontend ![alt tag](https://raw.githubusercontent.com/MoritzGoeckel/NodeJSNewsCrawler/master/docs/newsscreen.PNG)

近期下载者

相关文件


收藏者