ECommerceCrawlers

所属分类:人工智能/神经网络/深度学习
开发工具:Python
文件大小:7036KB
下载次数:0
上传日期:2023-02-15 22:59:39
上 传 者sh-1993
说明:  ECommerceCrawlers,实战 多种网站、电商数据爬虫 。包含 :淘宝商品、微信公众号、大众点评、企查查、招聘网站、闲鱼、阿里任务、博客园、微博、百度贴吧、豆瓣电影、包图网、全景网、豆瓣音乐、某省药监局、搜狐新闻、机器学习文本采集、fofa资产采集、汽车之家、...
(ECommerceCrawlers, practical for various websites and e-commerce data crawlers. Including: Taobao commodity, WeChat official account, public comment, enterprise investigation, recruitment website, Xianyu, Alibaba Task, Blog Park, Weibo, Baidu Post Bar, douban movie, Baotu, Panorama, douban music, a provincial drug administration, Sohu News, machine learning text collection, fofa asset collection, Auto Home)

文件列表:
.DS_Store (10244, 2022-04-26)
CODE_OF_CONDUCT.md (6304, 2022-04-26)
DianpingCrawler (0, 2022-04-26)
DianpingCrawler\capth.png (47732, 2022-04-26)
DianpingCrawler\clear_data.py (1899, 2022-04-26)
DianpingCrawler\dazhong.py (8284, 2022-04-26)
DianpingCrawler\demo.py (1214, 2022-04-26)
East_money (0, 2022-04-26)
East_money\east_money (0, 2022-04-26)
East_money\east_money\__init__.py (0, 2022-04-26)
East_money\east_money\items.py (390, 2022-04-26)
East_money\east_money\middlewares.py (1883, 2022-04-26)
East_money\east_money\pipelines.py (681, 2022-04-26)
East_money\east_money\settings.py (3292, 2022-04-26)
East_money\east_money\spiders (0, 2022-04-26)
East_money\east_money\spiders\__init__.py (161, 2022-04-26)
East_money\east_money\spiders\east_spider.py (2753, 2022-04-26)
East_money\main.py (80, 2022-04-26)
East_money\questions.txt (554, 2022-04-26)
East_money\scrapy.cfg (264, 2022-04-26)
LICENSE (1071, 2022-04-26)
OthertCrawler (0, 2022-04-26)
OthertCrawler\0x01baidutieba (0, 2022-04-26)
OthertCrawler\0x01baidutieba\0x01baidutieba.py (1997, 2022-04-26)
OthertCrawler\0x02douban (0, 2022-04-26)
OthertCrawler\0x02douban\0x02douban.py (1074, 2022-04-26)
OthertCrawler\0x03alitask (0, 2022-04-26)
OthertCrawler\0x03alitask\0x03alitask.py (5058, 2022-04-26)
OthertCrawler\0x03alitask\alitask.py (3421, 2022-04-26)
OthertCrawler\0x03alitask\阿里v任务.md (1534, 2022-04-26)
... ...

# 简单的scrapy实例 ## 爬取博客园首页:[srapy_cnblog](https://www.cnblogs.com/sitehome/p/1) 首先还是命令行创建project,然后依次编写各项文件 ### 首先是编写item文件,根据爬取的内容,定义爬取字段。代码如下: import scrapy class CnblogItem(scrapy.Item): title = scrapy.Field() #定义爬取的标题 link = scrapy.Field() #定义爬取的连接 ### 在spiders目录下编写spider文件(这是关键),这里命名为cnblog_spider,代码如下: import scrapy from cnblog.items import CnblogItem class CnblogSpiderSpider(scrapy.Spider): name = "cnblog_spider" allowed_domains = ["cnblogs.com"] url = 'https://www.cnblogs.com/sitehome/p/' offset = 1 start_urls = [url+str(offset)] def parse(self, response): item = CnblogItem() item['title'] = response.xpath('//a[@class="titlelnk"]/text()').extract() #使用xpath搜索 item['link'] = response.xpath('//a[@class="titlelnk"]/@href').extract() yield item print("第{0}页爬取完成".format(self.offset)) if self.offset < 10: #爬取到第几页 self.offset += 1 url2 = self.url+str(self.offset) #拼接url print(url2) yield scrapy.Request(url=url2, callback=self.parse) ### 编写pipelines文件,用于把我们爬取到的数据写入TXT文件。 class FilePipeline(object): def process_item(self, item, spider): data = '' with open('cnblog.txt', 'a', encoding='utf-8') as f: titles = item['title'] links = item['link'] for i, j in zip(titles, links): data += i+' '+j+'\n' f.write(data) f.close() return item ### 更改setting文件 DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', #user-agent新添加 'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win***; x***) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" } #新修改 ITEM_PIPELINES = { 'cnblog.pipelines.FilePipeline': 300, #实现保存到txt文件 } ### 编写一个main文件,scrapy是不能在编译器里面调试的,但我们可以自己写一个主文件,运行这个主文件就可以像普通的工程一样在编译器里调式了。代码如下 from scrapy import cmdline cmdline.execute("scrapy crawl cnblog_spider --nolog".split()) #--nolog是以不显示日志的形式运行,如果需要看详细信息,可以去掉 现在,我们这个例子就算是写完了,运行main.py,就会生成一个cnblog.Ttxt的文件,里面就是我们爬取下来的内容了。

近期下载者

相关文件


收藏者