ECommerceCrawlers
所属分类:人工智能/神经网络/深度学习
开发工具:Python
文件大小:7036KB
下载次数:0
上传日期:2023-02-15 22:59:39
上 传 者:
sh-1993
说明: ECommerceCrawlers,实战 多种网站、电商数据爬虫 。包含 :淘宝商品、微信公众号、大众点评、企查查、招聘网站、闲鱼、阿里任务、博客园、微博、百度贴吧、豆瓣电影、包图网、全景网、豆瓣音乐、某省药监局、搜狐新闻、机器学习文本采集、fofa资产采集、汽车之家、...
(ECommerceCrawlers, practical for various websites and e-commerce data crawlers. Including: Taobao commodity, WeChat official account, public comment, enterprise investigation, recruitment website, Xianyu, Alibaba Task, Blog Park, Weibo, Baidu Post Bar, douban movie, Baotu, Panorama, douban music, a provincial drug administration, Sohu News, machine learning text collection, fofa asset collection, Auto Home)
文件列表:
.DS_Store (10244, 2022-04-26)
CODE_OF_CONDUCT.md (6304, 2022-04-26)
DianpingCrawler (0, 2022-04-26)
DianpingCrawler\capth.png (47732, 2022-04-26)
DianpingCrawler\clear_data.py (1899, 2022-04-26)
DianpingCrawler\dazhong.py (8284, 2022-04-26)
DianpingCrawler\demo.py (1214, 2022-04-26)
East_money (0, 2022-04-26)
East_money\east_money (0, 2022-04-26)
East_money\east_money\__init__.py (0, 2022-04-26)
East_money\east_money\items.py (390, 2022-04-26)
East_money\east_money\middlewares.py (1883, 2022-04-26)
East_money\east_money\pipelines.py (681, 2022-04-26)
East_money\east_money\settings.py (3292, 2022-04-26)
East_money\east_money\spiders (0, 2022-04-26)
East_money\east_money\spiders\__init__.py (161, 2022-04-26)
East_money\east_money\spiders\east_spider.py (2753, 2022-04-26)
East_money\main.py (80, 2022-04-26)
East_money\questions.txt (554, 2022-04-26)
East_money\scrapy.cfg (264, 2022-04-26)
LICENSE (1071, 2022-04-26)
OthertCrawler (0, 2022-04-26)
OthertCrawler\0x01baidutieba (0, 2022-04-26)
OthertCrawler\0x01baidutieba\0x01baidutieba.py (1997, 2022-04-26)
OthertCrawler\0x02douban (0, 2022-04-26)
OthertCrawler\0x02douban\0x02douban.py (1074, 2022-04-26)
OthertCrawler\0x03alitask (0, 2022-04-26)
OthertCrawler\0x03alitask\0x03alitask.py (5058, 2022-04-26)
OthertCrawler\0x03alitask\alitask.py (3421, 2022-04-26)
OthertCrawler\0x03alitask\阿里v任务.md (1534, 2022-04-26)
... ...
# 简单的scrapy实例
## 爬取博客园首页:[srapy_cnblog](https://www.cnblogs.com/sitehome/p/1)
首先还是命令行创建project,然后依次编写各项文件
### 首先是编写item文件,根据爬取的内容,定义爬取字段。代码如下:
import scrapy
class CnblogItem(scrapy.Item):
title = scrapy.Field() #定义爬取的标题
link = scrapy.Field() #定义爬取的连接
### 在spiders目录下编写spider文件(这是关键),这里命名为cnblog_spider,代码如下:
import scrapy
from cnblog.items import CnblogItem
class CnblogSpiderSpider(scrapy.Spider):
name = "cnblog_spider"
allowed_domains = ["cnblogs.com"]
url = 'https://www.cnblogs.com/sitehome/p/'
offset = 1
start_urls = [url+str(offset)]
def parse(self, response):
item = CnblogItem()
item['title'] = response.xpath('//a[@class="titlelnk"]/text()').extract() #使用xpath搜索
item['link'] = response.xpath('//a[@class="titlelnk"]/@href').extract()
yield item
print("第{0}页爬取完成".format(self.offset))
if self.offset < 10: #爬取到第几页
self.offset += 1
url2 = self.url+str(self.offset) #拼接url
print(url2)
yield scrapy.Request(url=url2, callback=self.parse)
### 编写pipelines文件,用于把我们爬取到的数据写入TXT文件。
class FilePipeline(object):
def process_item(self, item, spider):
data = ''
with open('cnblog.txt', 'a', encoding='utf-8') as f:
titles = item['title']
links = item['link']
for i, j in zip(titles, links):
data += i+' '+j+'\n'
f.write(data)
f.close()
return item
### 更改setting文件
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
#user-agent新添加
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win***; x***) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}
#新修改
ITEM_PIPELINES = {
'cnblog.pipelines.FilePipeline': 300, #实现保存到txt文件
}
### 编写一个main文件,scrapy是不能在编译器里面调试的,但我们可以自己写一个主文件,运行这个主文件就可以像普通的工程一样在编译器里调式了。代码如下
from scrapy import cmdline
cmdline.execute("scrapy crawl cnblog_spider --nolog".split()) #--nolog是以不显示日志的形式运行,如果需要看详细信息,可以去掉
现在,我们这个例子就算是写完了,运行main.py,就会生成一个cnblog.Ttxt的文件,里面就是我们爬取下来的内容了。
近期下载者:
相关文件:
收藏者: