scrapy-newsutils
所属分类:工具库
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2023-09-21 15:37:21
上 传 者:
sh-1993
说明: 快乐新闻刮擦的模型抽象和NLP任务。
(Model abstractions and NLP tasks for happy news scrapping.)
文件列表:
demo/ (0, 2023-09-21)
demo/__init__.py (0, 2023-09-21)
demo/commands/ (0, 2023-09-21)
demo/commands/__init__.py (0, 2023-09-21)
demo/commands/crawlall.py (2609, 2023-09-21)
demo/default_settings.py (3059, 2023-09-21)
demo/items.py (130, 2023-09-21)
demo/middlewares.py (3650, 2023-09-21)
demo/scrapy.cfg (249, 2023-09-21)
demo/settings.py (435, 2023-09-21)
demo/spiders.json (884, 2023-09-21)
demo/spiders/ (0, 2023-09-21)
demo/spiders/__init__.py (161, 2023-09-21)
demo/spiders/gn_africaguinee.py (768, 2023-09-21)
demo/spiders/gn_guineematin.py (795, 2023-09-21)
requirements.txt (123, 2023-09-21)
setup.cfg (938, 2023-09-21)
setup.py (95, 2023-09-21)
src/ (0, 2023-09-21)
src/newsutils/ (0, 2023-09-21)
src/newsutils/__init__.py (46, 2023-09-21)
src/newsutils/appsettings.py (13340, 2023-09-21)
src/newsutils/conf/ (0, 2023-09-21)
src/newsutils/conf/__init__.py (191, 2023-09-21)
src/newsutils/conf/constants.py (1526, 2023-09-21)
src/newsutils/conf/globals.py (2719, 2023-09-21)
src/newsutils/conf/mixins.py (6256, 2023-09-21)
src/newsutils/conf/post_item.py (2065, 2023-09-21)
src/newsutils/conf/posts.py (6422, 2023-09-21)
src/newsutils/conf/utils.py (3353, 2023-09-21)
src/newsutils/console.py (3102, 2023-09-21)
src/newsutils/crawl/ (0, 2023-09-21)
src/newsutils/crawl/__init__.py (112, 2023-09-21)
src/newsutils/crawl/commands.py (1380, 2023-09-21)
src/newsutils/crawl/day.py (5531, 2023-09-21)
src/newsutils/crawl/items.py (928, 2023-09-21)
src/newsutils/crawl/pipelines.py (5166, 2023-09-21)
src/newsutils/crawl/spiders.py (9598, 2023-09-21)
... ...
# scrapy-newsutils
This Python library makes scraping online articles a breeze. What you can do:
* Download articles from any format into a standardized `Post` structure,
* Gather posts by affinity, generates authentic titles and summaries from articles or group of articles, etc.
* Publish summarized posts to Telegram, Facebook and Twitter.
* Initialise scrapers dynamically from database settings.
* Create schedules to start crawl jobs.
It Provides following two major components for news fetching:
- `newsutils.crawl` leverages Scrapy to crawl news.
- `newsutils.ezines` downloads news by performing http requests to news API endpoints.
## Misc. Features
* Skips duplicate articles while crawling.
* Customizable names for article data fields.
* Customizable strategies for text extraction `summary_as_text`
* Comes with default settings overridable from the `.env` or `settings.py`
* Supply fuly-functional base classes for building customized commands, spiders and configuration that consume NLP pipelines.
* Mechanism to quit running a command if another instance is already running.
## Usage
* Setup python env
```shell
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
```
* Define a posts spider manually.
Traditional way, by defining the spider subclass in a module inside `settings.SPIDER_MODULES`
```shell
cat < spiders/gn_guineematin.py
from newsutils.crawl.spiders import BasePostCrawler
class GnGuineeMatin(BasePostCrawler):
"""
Scraps news posts at 'https://guineematin.com'
"""
# by `scrapy.spider.Spider`
name = 'gn-guineematin'
allowed_domains = ['guineematin.com']
start_urls = ['https://guineematin.com/']
# by `newsutils.scrapy.base.spiders.NewsSpider`
country_code = 'GN'
language = 'fr'
post_images = "//figure/img/@src"
post_texts = {
"featured": '//*[(@id = "tdi_82")]//a | //*[(@id = "tdi_84")]//a',
"default": '//*[contains(concat( " ", @class, " " ), concat( " ", "td-animation-stack", " " ))]//a',
}
# days_from = '2022-04-19'
# days_to = '2022-04-25'
# days = ['2022-04-12', '2022-04-09']
EOF
```
* Define a posts spider dynamically.
Initialisation context is read from the database as part of the project.
cf. `settings.CRAWL_DB_URI`. Eg., run following and import generated `spiders.json` to MongoDB
```shell
cat < spiders.json
[
{
"name": "gn-guineematin ",
"allowed_domains": ["guineematin.com"],
"start_urls": ["https://guineematin.com/"],
"country_code": "GN",
"language": "fr",
"post_images": "//figure/img/@src",
"post_texts": {
"featured": "//*[(@id = \"tdi_82\")]//a | //*[(@id = \"tdi_84\")]//a",
"default": "//*[contains(concat( \" \", @class, \" \" ), concat( \" \", \"td-animation-stack\", \" \" ))]//a"
}
}
]
```
* Run the spider (chdir to project directory)
```shell
# redirects output to json file
scrapy crawl gn-guineematin -O gn-guineematin.json
```
### Optimisations
* Supports `conda` envs suited for machine learning.
* `scrapy nlp` command downloading 2G+ models data !
It is recommended to mount the NLP data directories as a volume when using Docker.
Cf. example multistage `Dockerfile` in the `leeram-news/newsbot` project.
* Throttled e-zines API requests rate at thesportsdb.com
Configurable through env vars.
* [TODO] Skip NLP inference, ie. quit generating a metapost if exists a metapost with the same version in the db
ie. iff same siblings detected.
* [TODO] Create a single source of through for settings: settings.py, envs loaded by the run.py script
eg. SIMILARITY_RELATED_THRESHOLD, etc.
* [TODO] [NER as middleware](https://github.com/vu3jej/scrapy-corenlp) -> new post field
* [TODO] [Scrapy DeltaFetch](https://github.com/ScrapeOps/python-scrapy-playbook/tree/master/2_Scrapy_Guides) ensures that your scrapers only ever crawl the content once
* [TODO] Bulk (batch) Insert to Database
append to some bucket during `process_item()`. to only flush to db during pieline's `close_spider()`
https://jerrynsh.com/5-useful-tips-while-working-with-python-scrapy/
* [TODO] Move code for fetching post's `.images`, `.top_image` to Pipeline/Middleware
Currently parses/downloads images event for dup posts uti !
https://github.com/scrapy/scrapy/issues/2436
https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.SpiderMiddleware.process_spider_output
### Feature request
* Bypass scraper blocking
Refs: [1](https://scrapfly.io/blog/web-scraping-with-scrapy/)
Test the various plugins for proxy management, eg.:
- (scrapy-rotating-proxies)[https://github.com/TeamHG-Memex/scrapy-rotating-proxies],
- (scrapy-fake-useragent)[https://github.com/alecxe/scrapy-fake-useragent], for randomizing user agent headers.
* Browser emulation and scraping dynamic pages (JS) using:
- scrapy-selenium (+GCP):
[1](https://youtu.be/2LwrUu9yTAo),
[2](https://www.roelpeters.be/how-to-deploy-a-scraping-script-and-selenium-in-google-cloud-run/)
- [scrapy-playwright](https://pypi.org/project/scrapy-playwright/)
- JS support via (Splash)[https://splash.readthedocs.io/en/stable/faq.html] \
Won't do: seem to require running in docker container??
* Migrate to Distributed scraping, eg. [Frontera](https://github.com/scrapinghub/frontera)
近期下载者:
相关文件:
收藏者: