news-webscraping

所属分类:大数据
开发工具:Python
文件大小:45KB
下载次数:0
上传日期:2019-11-06 16:45:41
上 传 者sh-1993
说明:  基于Scrapy的新闻爬虫,利用Redis和MongoDB来避免重复爬取和数据的保存,有用到代理池来反反爬,保存的字段为标题、时间、正文、URL、作者 来源、来源URL。爬取对象为网易 腾讯 新浪 搜狐这四个门户网站,爬取板块为新闻 ...
(The news crawler based on Scrapy uses Redis and MongoDB to avoid repeated crawling and data saving. It can use the proxy pool to reverse crawling. The saved fields are title, time, body, URL, author source, and source URL. The target of crawling is the four portals of Netease, Tencent, Sina, Sohu, and the crawling section is news)

文件列表:
Dockerfile (241, 2019-11-07)
crontab_task (33, 2019-11-07)
log (2227, 2019-11-07)
news (0, 2019-11-07)
news\__pycache__ (0, 2019-11-07)
news\__pycache__\__init__.cpython-37.pyc (129, 2019-11-07)
news\__pycache__\items.cpython-37.pyc (443, 2019-11-07)
news\__pycache__\middlewares.cpython-37.pyc (4913, 2019-11-07)
news\__pycache__\pipelines.cpython-37.pyc (2416, 2019-11-07)
news\__pycache__\settings.cpython-37.pyc (861, 2019-11-07)
news\items.py (530, 2019-11-07)
news\middlewares.py (6334, 2019-11-07)
news\pipelines.py (2911, 2019-11-07)
news\settings.py (4151, 2019-11-07)
news\spiders (0, 2019-11-07)
news\spiders\__init__.py (161, 2019-11-07)
news\spiders\__pycache__ (0, 2019-11-07)
news\spiders\__pycache__\__init__.cpython-37.pyc (137, 2019-11-07)
news\spiders\__pycache__\a163.cpython-37.pyc (4617, 2019-11-07)
news\spiders\__pycache__\qq.cpython-37.pyc (3171, 2019-11-07)
news\spiders\__pycache__\sina.cpython-37.pyc (2122, 2019-11-07)
news\spiders\__pycache__\sohu.cpython-37.pyc (2674, 2019-11-07)
news\spiders\a163.py (7785, 2019-11-07)
news\spiders\qq.py (4874, 2019-11-07)
news\spiders\sina.py (4580, 2019-11-07)
news\spiders\sohu.py (3363, 2019-11-07)
news\tools (0, 2019-11-07)
news\tools\__pycache__ (0, 2019-11-07)
news\tools\__pycache__\__init__.cpython-37.pyc (135, 2019-11-07)
news\tools\a163 (0, 2019-11-07)
news\tools\a163\__pycache__ (0, 2019-11-07)
news\tools\a163\__pycache__\__init__.cpython-37.pyc (140, 2019-11-07)
news\tools\a163\__pycache__\get_attribute.cpython-37.pyc (942, 2019-11-07)
news\tools\a163\__pycache__\get_data_163.cpython-37.pyc (428, 2019-11-07)
news\tools\a163\__pycache__\not_news_url.cpython-37.pyc (332, 2019-11-07)
news\tools\a163\get_attribute.py (1285, 2019-11-07)
news\tools\a163\get_data_163.py (235, 2019-11-07)
news\tools\a163\not_news_url.py (149, 2019-11-07)
news\tools\general (0, 2019-11-07)
... ...

近期下载者

相关文件


收藏者