news-combinator
所属分类:matlab编程
开发工具:C++
文件大小:0KB
下载次数:0
上传日期:2020-09-30 21:29:06
上 传 者:
sh-1993
说明: 使用爬虫获取新闻,组合相似的内容,并显示来自不同网站的评论
(Use crawlers to get news, combine the similar ones and display their comments from different websites)
文件列表:
chnsegmt/ (0, 2016-11-26)
chnsegmt/basicfuncs.py (2523, 2016-11-26)
chnsegmt/categorize.py (3907, 2016-11-26)
chnsegmt/extracttags.py (1401, 2016-11-26)
chnsegmt/findsimilarpassage.py (2280, 2016-11-26)
chnsegmt/getabstract.py (3970, 2016-11-26)
chnsegmt/jieba_example/ (0, 2016-11-26)
chnsegmt/jieba_example/1.log (89762, 2016-11-26)
chnsegmt/jieba_example/2.log (77951, 2016-11-26)
chnsegmt/jieba_example/dict/ (0, 2016-11-26)
chnsegmt/jieba_example/dict/userdict.txt (110, 2016-11-26)
chnsegmt/jieba_example/docs/ (0, 2016-11-26)
chnsegmt/jieba_example/docs/000913.json (5064, 2016-11-26)
chnsegmt/jieba_example/docs/000913.tags (72, 2016-11-26)
chnsegmt/jieba_example/docs/example/ (0, 2016-11-26)
chnsegmt/jieba_example/docs/example/1-1-29416628.json (3819, 2016-11-26)
chnsegmt/jieba_example/docs/example/1-1-29416628.tags (87, 2016-11-26)
chnsegmt/jieba_example/docs/example/1-杈藉畞闃滄柊瀹樺憳娑夊珜娣涔变簨浠朵妇鎶ヨ呰鍒戞嫎.txt (966, 2016-11-26)
chnsegmt/jieba_example/docs/example/18届三中UTF8.txt (66140, 2016-11-26)
chnsegmt/jieba_example/docs/example/18灞婁笁涓鍏ㄤ細.txt (44267, 2016-11-26)
chnsegmt/jieba_example/docs/example/4-English.txt (5593, 2016-11-26)
chnsegmt/jieba_example/docs/example/涓鑻辨枃娣锋潅绀轰緥.txt (157, 2016-11-26)
chnsegmt/jieba_example/docs/example/用户词典.txt (966, 2016-11-26)
chnsegmt/jieba_example/jb_f1_cut.py (532, 2016-11-26)
chnsegmt/jieba_example/jb_f2_userdict.py (708, 2016-11-26)
chnsegmt/jieba_example/jb_f3_extracttags.py (523, 2016-11-26)
chnsegmt/jieba_example/jb_f4_posseg.py (289, 2016-11-26)
chnsegmt/jieba_example/jb_f5_parallel.py (355, 2016-11-26)
chnsegmt/jieba_example/jb_f6_tokenize.py (661, 2016-11-26)
chnsegmt/jieba_example/jb_f7_chnanalyserforwhoosh.py (1725, 2016-11-26)
chnsegmt/jieba_example/test_extracttags.py (139, 2016-11-26)
chnsegmt/jieba_example/test_optionparser.py (417, 2016-11-26)
chnsegmt/jieba_example/tmp/ (0, 2016-11-26)
chnsegmt/jieba_example/tmp/MAIN_WRITELOCK (0, 2016-11-26)
chnsegmt/jieba_example/tmp/MAIN_umsn2xib579lmm0w.seg (12036, 2016-11-26)
chnsegmt/jieba_example/tmp/_MAIN_1.toc (1764, 2016-11-26)
chnsegmt/main.py (1209, 2016-11-26)
... ...
News Crawler
============
A repo for news crawling. Then combine similar news.
Temporarily used python and Scrapy framework.
Used Jieba and Scrapy
Version 1.0 is the current directory
Version 2.0 is in the `reconstruction` directory
1.0版本的代码参见当前目录
2.0版本的代码参见目录`reconstruction`
TODO
====
1. Find a proper Chinese segmentation tool
2. Split a JSON file into small files, every file contains only one piece of news
3. Read some articles about SVM (Did not use SVM, but tfidf and cosin similarity)
4. Try to categorize different news
5. Build another IDF dictionary from web news
6. Categorize those similar passages in sina and netease but not in tencent
7. Make a website display the results and show the comments
8. Improve categorization performance
9. Make the website more beautiful!
10. Reconstruct this project with php
Website
=======
http://news.reetsee.com
近期下载者:
相关文件:
收藏者: