News-Article-LDA

所属分类:特征抽取
开发工具:Python
文件大小:4KB
下载次数:0
上传日期:2019-01-22 01:26:57
上 传 者sh-1993
说明:  涉及新闻文章文本清理的基本(无监督)LDA模型
(Basic (unsupervised) LDA model involving text cleaning of news articles)

文件列表:
lda_model.py (3677, 2019-01-22)
text_cleaning.py (2779, 2019-01-22)

# News-Article-LDA This is a basic (unsupervised) LDA model using Python's gensim. The majority of the work is in preparing the data for the LDA model; news article data can be extremely messy because of the scraping process, as well as because of irrelevant terms that may skew identified topics and keywords. # Observed Performance Using a dataset of 4776 news articles collected from the top 12 US news publications about Airbnb, was able to distinguish the following topics + keywords (with a default of 10 topics in the model, 6 displayed and 10 keywords per displayed topic):

Topic 0:

0.021*"trump" + 0.021*"said" + 0.018*"state" + 0.015*"presid" + 0.013*"tax" + 0.011*"bill" + 0.010*"would" + 0.010*"polit" + 0.010*"hous" + 0.010*"support"

Topic 1:

0.016*"travel" + 0.008*"trip" + 0.008*"hotel" + 0.007*"day" + 0.006*"com" + 0.006*"tour" + 0.005*"restaur" + 0.005*"flight" + 0.005*"one" + 0.005*"like"

Topic 2:

0.013*"internet" + 0.013*"appl" + 0.012*"facebook" + 0.012*"googl" + 0.011*"media" + 0.011*"app" + 0.011*"ad" + 0.010*"post" + 0.009*"user" + 0.009*"amazon"

Topic 3:

0.052*"airbnb" + 0.030*"rental" + 0.023*"hotel" + 0.021*"citi" + 0.018*"home" + 0.016*"rent" + 0.016*"said" + 0.015*"host" + 0.012*"new" + 0.012*"term"

Topic 4:

0.011*"one" + 0.010*"peopl" + 0.009*"say" + 0.008*"time" + 0.007*"go" + 0.007*"year" + 0.007*"like" + 0.006*"would" + 0.006*"get" + 0.006*"think"

Topic 5:

0.019*"home" + 0.016*"said" + 0.012*"hous" + 0.009*"live" + 0.009*"citi" + 0.008*"build" + 0.008*"year" + 0.007*"park" + 0.007*"rent" + 0.007*"space" # Note about Next Steps: I am currently looking into semi-supervised LDA models as described in the following resources:

http://www.cs.cmu.edu/~bbd/Ramnath-ecml-paper.pdf

https://medium.freecodecamp.org/how-we-changed-unsupervised-lda-to-semi-supervised-guidedlda-e36a95f3a1***

as well as reducing the corpus down to nouns or specific part of speech tags using spaCy, as found in:

http://aclweb.org/anthology/U15-1013

近期下载者

相关文件


收藏者