Covid19-News-Crawl

所属分类:生物医药技术
开发工具:Python
文件大小:44KB
下载次数:0
上传日期:2023-05-14 11:35:48
上 传 者sh-1993
说明:  Covid19-News-Crawl,全国省级政务网站新闻数据收集(主要面向新闻发布会)
(Covid19 News Crawl, National Provincial Government Website News Data Collection (mainly for press conferences))

文件列表:
.DS_Store (6148, 2020-10-17)
.vscode (0, 2020-10-17)
.vscode\settings.json (80, 2020-10-17)
covid_19 (0, 2020-10-17)
covid_19\__init__.py (0, 2020-10-17)
covid_19\items.py (603, 2020-10-17)
covid_19\middlewares.py (5041, 2020-10-17)
covid_19\pipelines.py (1587, 2020-10-17)
covid_19\settings.py (3768, 2020-10-17)
covid_19\spiders (0, 2020-10-17)
covid_19\spiders\__init__.py (161, 2020-10-17)
covid_19\spiders\anhuiSpider.py (2517, 2020-10-17)
covid_19\spiders\beijingSpider.py (2767, 2020-10-17)
covid_19\spiders\chongqingSpider.py (2759, 2020-10-17)
covid_19\spiders\fujianSpider.py (2108, 2020-10-17)
covid_19\spiders\fujiandjzwSpider.py (1745, 2020-10-17)
covid_19\spiders\gansuSpider.py (2491, 2020-10-17)
covid_19\spiders\gansudtSpider.py (2714, 2020-10-17)
covid_19\spiders\guangdongSpider.py (2609, 2020-10-17)
covid_19\spiders\guangxiSpider.py (2241, 2020-10-17)
covid_19\spiders\hainanSpider.py (2735, 2020-10-17)
covid_19\spiders\heilongjiangSpider.py (2630, 2020-10-17)
covid_19\spiders\henanSpider.py (2107, 2020-10-17)
covid_19\spiders\hubeiSpider.py (3148, 2020-10-17)
covid_19\spiders\hunanSpider.py (2168, 2020-10-17)
covid_19\spiders\jiangsuSpider.py (3092, 2020-10-17)
covid_19\spiders\jiangxiSpider.py (2741, 2020-10-17)
covid_19\spiders\jilinSpider.py (2264, 2020-10-17)
covid_19\spiders\liaoningSpider.py (3492, 2020-10-17)
covid_19\spiders\ningxiaSpider.py (2476, 2020-10-17)
covid_19\spiders\nmgSpider.py (2550, 2020-10-17)
covid_19\spiders\qinghaiSpider.py (1998, 2020-10-17)
covid_19\spiders\shan_xiSpider.py (2741, 2020-10-17)
covid_19\spiders\shandongSpider.py (2181, 2020-10-17)
covid_19\spiders\shanghaiSpider.py (2609, 2020-10-17)
covid_19\spiders\shanxiSpider.py (3351, 2020-10-17)
covid_19\spiders\sichuanSpider.py (2216, 2020-10-17)
... ...

## 疫情信息收集项目 此爬虫爬取不同地区政务网站发布的新冠疫情历史发布会,用于数据分析,用到的技术栈有 scrapy、selenium、mongodb **需要下载最新环境chromedriver** ------------- sudo mv ~/Downloads/chromedriver /usr/bin vi ~/.bash_profile export PATH=$PATH:/usr/local/bin/ChromeDriver **下载mongodb** ------------- **进入 /usr/local** cd /usr/local **下载** sudo curl -O https://fastdl.mongodb.org/osx/mongodb-osx-ssl-x86_***-4.0.9.tgz **解压** sudo tar -zxvf mongodb-osx-ssl-x86_***-4.0.9.tgz **重命名为 mongodb 目录** sudo mv mongodb-osx-x86_***-4.0.9/ mongodb **安装完成更新bash_profile** export PATH=/usr/local/mongodb/bin:$PATH **数据存放路径:** sudo mkdir -p /usr/local/var/mongodb **日志文件路径:** sudo mkdir -p /usr/local/var/log/mongodb **确保权限** sudo chown 账户名 /usr/local/var/mongodb sudo chown 账户名 /usr/local/var/log/mongodb **后台启动mongodb服务** *启动之前记得更新配置 source ~/.bash_profile* mongod --dbpath /usr/local/var/mongodb --logpath /usr/local/var/log/mongodb/mongo.log --fork **安装 python包** ------------- pip install selenium pip install scrapy pip install xlwt pip install pymongo 大佬做的匹配文本的项目,可以保证无论数据量多大处理的时间都是不变的,本项目用于做mongo数据清洗 对他的实现感兴趣可以看他[论文](https://arxiv.org/pdf/1711.00046.pdf) pip install flashtext **项目根目录下创建logs来存放日志文件** ------------- mkdir logs

近期下载者

相关文件


收藏者