Covid19-News-Crawl
所属分类:生物医药技术
开发工具:Python
文件大小:44KB
下载次数:0
上传日期:2023-05-14 11:35:48
上 传 者:
sh-1993
说明: Covid19-News-Crawl,全国省级政务网站新闻数据收集(主要面向新闻发布会)
(Covid19 News Crawl, National Provincial Government Website News Data Collection (mainly for press conferences))
文件列表:
.DS_Store (6148, 2020-10-17)
.vscode (0, 2020-10-17)
.vscode\settings.json (80, 2020-10-17)
covid_19 (0, 2020-10-17)
covid_19\__init__.py (0, 2020-10-17)
covid_19\items.py (603, 2020-10-17)
covid_19\middlewares.py (5041, 2020-10-17)
covid_19\pipelines.py (1587, 2020-10-17)
covid_19\settings.py (3768, 2020-10-17)
covid_19\spiders (0, 2020-10-17)
covid_19\spiders\__init__.py (161, 2020-10-17)
covid_19\spiders\anhuiSpider.py (2517, 2020-10-17)
covid_19\spiders\beijingSpider.py (2767, 2020-10-17)
covid_19\spiders\chongqingSpider.py (2759, 2020-10-17)
covid_19\spiders\fujianSpider.py (2108, 2020-10-17)
covid_19\spiders\fujiandjzwSpider.py (1745, 2020-10-17)
covid_19\spiders\gansuSpider.py (2491, 2020-10-17)
covid_19\spiders\gansudtSpider.py (2714, 2020-10-17)
covid_19\spiders\guangdongSpider.py (2609, 2020-10-17)
covid_19\spiders\guangxiSpider.py (2241, 2020-10-17)
covid_19\spiders\hainanSpider.py (2735, 2020-10-17)
covid_19\spiders\heilongjiangSpider.py (2630, 2020-10-17)
covid_19\spiders\henanSpider.py (2107, 2020-10-17)
covid_19\spiders\hubeiSpider.py (3148, 2020-10-17)
covid_19\spiders\hunanSpider.py (2168, 2020-10-17)
covid_19\spiders\jiangsuSpider.py (3092, 2020-10-17)
covid_19\spiders\jiangxiSpider.py (2741, 2020-10-17)
covid_19\spiders\jilinSpider.py (2264, 2020-10-17)
covid_19\spiders\liaoningSpider.py (3492, 2020-10-17)
covid_19\spiders\ningxiaSpider.py (2476, 2020-10-17)
covid_19\spiders\nmgSpider.py (2550, 2020-10-17)
covid_19\spiders\qinghaiSpider.py (1998, 2020-10-17)
covid_19\spiders\shan_xiSpider.py (2741, 2020-10-17)
covid_19\spiders\shandongSpider.py (2181, 2020-10-17)
covid_19\spiders\shanghaiSpider.py (2609, 2020-10-17)
covid_19\spiders\shanxiSpider.py (3351, 2020-10-17)
covid_19\spiders\sichuanSpider.py (2216, 2020-10-17)
... ...
## 疫情信息收集项目
此爬虫爬取不同地区政务网站发布的新冠疫情历史发布会,用于数据分析,用到的技术栈有 scrapy、selenium、mongodb
**需要下载最新环境chromedriver**
-------------
sudo mv ~/Downloads/chromedriver /usr/bin
vi ~/.bash_profile
export PATH=$PATH:/usr/local/bin/ChromeDriver
**下载mongodb**
-------------
**进入 /usr/local**
cd /usr/local
**下载**
sudo curl -O https://fastdl.mongodb.org/osx/mongodb-osx-ssl-x86_***-4.0.9.tgz
**解压**
sudo tar -zxvf mongodb-osx-ssl-x86_***-4.0.9.tgz
**重命名为 mongodb 目录**
sudo mv mongodb-osx-x86_***-4.0.9/ mongodb
**安装完成更新bash_profile**
export PATH=/usr/local/mongodb/bin:$PATH
**数据存放路径:**
sudo mkdir -p /usr/local/var/mongodb
**日志文件路径:**
sudo mkdir -p /usr/local/var/log/mongodb
**确保权限**
sudo chown 账户名 /usr/local/var/mongodb
sudo chown 账户名 /usr/local/var/log/mongodb
**后台启动mongodb服务**
*启动之前记得更新配置 source ~/.bash_profile*
mongod --dbpath /usr/local/var/mongodb --logpath /usr/local/var/log/mongodb/mongo.log --fork
**安装 python包**
-------------
pip install selenium
pip install scrapy
pip install xlwt
pip install pymongo
大佬做的匹配文本的项目,可以保证无论数据量多大处理的时间都是不变的,本项目用于做mongo数据清洗 对他的实现感兴趣可以看他[论文](https://arxiv.org/pdf/1711.00046.pdf)
pip install flashtext
**项目根目录下创建logs来存放日志文件**
-------------
mkdir logs
近期下载者:
相关文件:
收藏者: