Information_retrieva_Projectl-

所属分类:聚类算法
开发工具:Python
文件大小:5906KB
下载次数:0
上传日期:2016-08-02 14:20:13
上 传 者sh-1993
说明:  新闻检索:爬虫定向采集3-4个网页,实现网页信息的抽取、检索和索引。网页个数不少于10个,能按时间、相关度、热度等属性进行排序,并实现相似主题的自动聚类。可以实现:有相关搜索推荐、snippet生成、结果预览(鼠标移到相关结果, 能预...
(News retrieval: A crawler collects 3-4 web pages to extract, retrieve, and index web page information. There should be no less than 10 web pages, which can be sorted by attributes such as time, relevance, and popularity, and achieve automatic clustering of similar topics. Can achieve: relevant search recommendations, snippet generation, result preview (mouse over relevant results, can pre)

文件列表:
.idea (0, 2016-08-02)
.idea\Information_retrieva_Projectl-.iml (398, 2016-08-02)
.idea\codeStyleSettings.xml (270, 2016-08-02)
.idea\misc.xml (208, 2016-08-02)
.idea\modules.xml (312, 2016-08-02)
.idea\vcs.xml (180, 2016-08-02)
.idea\workspace.xml (42009, 2016-08-02)
Dictionary.py (9320, 2016-08-02)
LICENSE (1098, 2016-08-02)
News_Recommend.py (12612, 2016-08-02)
config.py (2087, 2016-08-02)
crawl (0, 2016-08-02)
crawl\__init__.py (0, 2016-08-02)
crawl\items.py (699, 2016-08-02)
crawl\pipelines.py (506, 2016-08-02)
crawl\scrapy.cfg (260, 2016-08-02)
crawl\settings.py (2992, 2016-08-02)
crawl\spiders (0, 2016-08-02)
crawl\spiders\__init__.py (161, 2016-08-02)
crawl\spiders\netease_spider.py (4184, 2016-08-02)
crawl\spiders\toutiao_spider.py (2451, 2016-08-02)
data (0, 2016-08-02)
data\stopword.txt (8116, 2016-08-02)
data\test.txt~ (0, 2016-08-02)
inverted_files.py (8590, 2016-08-02)
main.py (2379, 2016-08-02)
merge_inverted_files.py (7387, 2016-08-02)
report.pdf (5833012, 2016-08-02)
scrapy.cfg (254, 2016-08-02)
screenshot (0, 2016-08-02)
screenshot\2016-05-29 011426屏幕截图.png (274704, 2016-08-02)
screenshot\2016-05-29 15_59_57____________.png (214916, 2016-08-02)
screenshot\2016-05-29 16_00_35____________.png (246821, 2016-08-02)
screenshot\2016-05-29 20_10_07____________.png (14915, 2016-08-02)
screenshot\2016-05-29 20_40_30____________.png (74777, 2016-08-02)
similar_doc.py (11809, 2016-08-02)
web (0, 2016-08-02)
... ...

# Information_retrieva_Projectl- 新闻检索:定向采集3-4个网页,实现网页信息的抽取、检索和索引。网页个数不少于10个,能按时间、相关度、热度等属性进行排序,并实现相似主题的自动聚类。要求有:有相关搜索推荐、snippet生成、结果预览(鼠标移到相关结果, 能预览)功能 #依赖项 scrapy 安装方法:pip install Scrapy webpy 安装方法:sudo easy_install web.py 官方网站:http://webpy.org/ jieba 安装方法:pip install jieba 官方网站:https://pypi.python.org/pypi/jieba 数据10万条网易新闻网页、倒排索引等数据 baidu网盘http://pan.baidu.com/s/1gfkDb4B 下载后,将data文件夹放在Information_retrieva_Projectl-目录下即可 #使用方法: 交互式查询:linux下cd 至web/ 文件夹下 终端下键入python main.py 浏览器中打开:http://0.0.0.0:8080/ #参考文献: 1.scrapy手册 http://scrapy-chs.readthedocs.org/zh_CN/1.0/intro/tutorial.html 2.webpy 手册 http://webpy.org/ #运行效果 ![image](https://github.com/Google1234/Information_retrieva_Projectl-/raw/master/screenshot/2016-05-29%2020_10_07____________.png) ![image](https://github.com/Google1234/Information_retrieva_Projectl-/raw/master/screenshot/2016-05-29%2020_40_30____________.png) ![image](https://github.com/Google1234/Information_retrieva_Projectl-/raw/master/screenshot/2016-05-29%20011426%E5%B1%8F%E5%B9%95%E6%88%AA%E5%9B%BE.png) ![image](https://github.com/Google1234/Information_retrieva_Projectl-/raw/master/screenshot/2016-05-29%2015_59_57____________.png) ![image](https://github.com/Google1234/Information_retrieva_Projectl-/raw/master/screenshot/2016-05-29%20011426%E5%B1%8F%E5%B9%95%E6%88%AA%E5%9B%BE.png) !!!更多技术细节、学习资料请查看report文件。

近期下载者

相关文件


收藏者