Headline-Similarity-by-Clustering

所属分类:聚类算法
开发工具:Jupyter Notebook
文件大小:3786KB
下载次数:0
上传日期:2019-09-22 11:12:50
上 传 者sh-1993
说明:  使用K-means对新闻标题进行聚类和匹配
(Clustering and Matching News Headlines using K-means)

文件列表:
.ipynb_checkpoints (0, 2019-09-22)
.ipynb_checkpoints\Clustering -checkpoint.ipynb (760458, 2019-09-22)
Clustering and Matching.ipynb (522366, 2019-09-22)
Clustering and Matching.py (4444, 2019-09-22)
news_training.xlsx (3685338, 2019-09-22)

# Clustering-Headlines Clustering and Matching News Headlines using K-means A 75K corpus of news headlines scraped from a news channel's API was used to create a model that clusters these headlines and matches new headlines with the closest matches from the corpus. K-means algorithm was used to cluster and determine the optimum number of clusters. Matching was done by finding the headline with the least euclidean distance from the processed search headline's coordinate point in the same cluster. The headlines were cleaned and processed by lemmatizing, removing stop words and tokenizing. Features from these headlines were subsequently vectorized and clustered using k-means elbow method to find the optimum number of clusters which was determined to be 3. The new headline would be processed and fitted on the model to predict its cluster and best matches.

近期下载者

相关文件


收藏者