News-Event-Classification

所属分类:matlab编程
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2015-01-04 15:17:19
上 传 者sh-1993
说明:  新闻中极化事件的分类与检测
(Classification and detection of polarizing events in the news)

文件列表:
Event_Classify.py (3304, 2015-01-04)
Exploratory notebook.ipynb (12422853, 2015-01-04)
attach_event_score.py (2046, 2015-01-04)
attach_meta_data.py (1470, 2015-01-04)
attach_subtopics.py (1672, 2015-01-04)
classify_and_score_articles.py (4141, 2015-01-04)
classify_new_document.py (2012, 2015-01-04)
clean_scraped_data.py (6212, 2015-01-04)
combine_multiple_topics.py (313, 2015-01-04)
data/ (0, 2015-01-04)
data/nyt_abortion_data.csv (4919, 2015-01-04)
data/nyt_marijuana_data.csv (60825, 2015-01-04)
event_app.py (13992, 2015-01-04)
explore_nmf_topics.py (2278, 2015-01-04)
fonts/ (0, 2015-01-04)
fonts/glyphicons-halflings-regular.eot (20335, 2015-01-04)
fonts/glyphicons-halflings-regular.svg (62926, 2015-01-04)
fonts/glyphicons-halflings-regular.ttf (41280, 2015-01-04)
fonts/glyphicons-halflings-regular.woff (23320, 2015-01-04)
get_event_score_ranges.py (974, 2015-01-04)
graphs/ (0, 2015-01-04)
graphs/final_abortion.png (80731, 2015-01-04)
graphs/final_aca.png (59332, 2015-01-04)
graphs/final_ferguson.png (45783, 2015-01-04)
graphs/final_gay.png (72458, 2015-01-04)
graphs/final_gun_control.png (97198, 2015-01-04)
graphs/final_immigration.png (38789, 2015-01-04)
graphs/final_marijuana.png (132683, 2015-01-04)
graphs/final_palestine.png (38234, 2015-01-04)
graphs/final_terrorism.png (53371, 2015-01-04)
pickles/ (0, 2015-01-04)
pickles/google_data.pkl (8704, 2015-01-04)
pickles/last_update.pkl (76, 2015-01-04)
pickles/model.pkl (512100, 2015-01-04)
pickles/vectorizer.pkl (1908816, 2015-01-04)
presentation/ (0, 2015-01-04)
presentation/20- TedPetrou.key (757336, 2015-01-04)
presentation/bigger_Plot.png (45307, 2015-01-04)
presentation/event_score.csv (114403, 2015-01-04)
... ...

News Event Classification ========= Access the app here: [www.NewVentify.com](http://www.newventify.com) ####Screenshot of app ![alt tag](https://github.com/tdpetrou/News-Event-Classification/blob/master/static/preview.png) ### App Purpose Online news sites are good at producing relevant articles when given a search term. Some sites even classify articles into subtopics. What is missing is the actual degree to which an event has occurred. The image below is a Fox News search query result that is typical of major news outlets. Though the articles returned to the user are relevant to the search term, there is no method to search for the degree of the event within the topic searched. ![alt tag](https://raw.githubusercontent.com/tdpetrou/News-Event-Classification/master/static/news_search.png) Instead of showing articles by relevance, NewVentify returns articles to the user based on degree of events contained within the search. ###Specifics NewVentify presents the user with a list of major political and social issues (Obamacare, gun control, abortion, marijuana, etc..) to choose from. Once a major topic is selected, the user is presented a choice of subtopics. If the user has chosen marijuana, the suptopics 'drug cartels', 'marijuana legalization', 'drugs in sports', 'drugs and kids', etc... would be supplied as a choice for the user. The last step requires the user to choose a date range for the articles. Returned to the user would be articles split into two categories showing the highest ranked articles by event shown. For instance, for the drug cartel subtopic, articles best representing seizing, capturing and general confiscation of illegal drugs would be displayed on one side of the screen. Shown on the opposite side of the screen, would be articles best representing catostrophes caused by drug cartels. ###Step by step technical walk-through A large cohort of articles must be obtained to train the model to help precisely find subtopics * Articles from New York Times, NPR, Fox News, MSNBC and Google news are scraped, cleaned and combined into a single csv. Articles based on 8 major topics were obtained through either the new site's api or their search engine. * For each of the major topics a list of subtopics was generated. This was an iterative process using natural language processing to create a feature matrix based on tf-idf scores and split into topics with non-negative matrix factorization. * Bar plots showing nmf scores next to the top 15 words for each subtopic were generaged to easily put a label to the topic. A sample bar plot for words in the 4 subtopics from the gun category are shown below. ![alt tag](https://github.com/tdpetrou/News-Event-Classification/blob/master/static/nmf_words_by_topic.png) * Once a final group of subtopics were selected, the nmf and tf-idf objects for each subtopic were saved (pickled) to use on future articles. * Having topics is nice but there needs to be a way to detect polarizing events within these articles. Another round of nmf can work to do this but will not work as well as a customized dictionaray. Sentiment dictionaries do not work here either as each subtopic contains particular words that are domain-specific and could mean very different things. For instance - captured for the 'drug cartel' category would be a very positive event and possibly a negative event elsewhere. * To deal with this domain specificity, a list of words is generted for each subtopic and graded on a scale from -5 to 5. Two events per subtopic were generated * Now that a training set of articles has been created, new articles can be scraped, given suptopics, event scores and are stored online in a mysql db. * This mysql db is accessed through ajax requests which then display the articles by event type on the web. * A daily cron job is run to execute scraping, cleaning, labeling and writing to the database.

近期下载者

相关文件


收藏者