scrapy-newsutils

所属分类:工具库
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2023-09-21 15:37:21
上 传 者sh-1993
说明:  快乐新闻刮擦的模型抽象和NLP任务。
(Model abstractions and NLP tasks for happy news scrapping.)

文件列表:
demo/ (0, 2023-09-21)
demo/__init__.py (0, 2023-09-21)
demo/commands/ (0, 2023-09-21)
demo/commands/__init__.py (0, 2023-09-21)
demo/commands/crawlall.py (2609, 2023-09-21)
demo/default_settings.py (3059, 2023-09-21)
demo/items.py (130, 2023-09-21)
demo/middlewares.py (3650, 2023-09-21)
demo/scrapy.cfg (249, 2023-09-21)
demo/settings.py (435, 2023-09-21)
demo/spiders.json (884, 2023-09-21)
demo/spiders/ (0, 2023-09-21)
demo/spiders/__init__.py (161, 2023-09-21)
demo/spiders/gn_africaguinee.py (768, 2023-09-21)
demo/spiders/gn_guineematin.py (795, 2023-09-21)
requirements.txt (123, 2023-09-21)
setup.cfg (938, 2023-09-21)
setup.py (95, 2023-09-21)
src/ (0, 2023-09-21)
src/newsutils/ (0, 2023-09-21)
src/newsutils/__init__.py (46, 2023-09-21)
src/newsutils/appsettings.py (13340, 2023-09-21)
src/newsutils/conf/ (0, 2023-09-21)
src/newsutils/conf/__init__.py (191, 2023-09-21)
src/newsutils/conf/constants.py (1526, 2023-09-21)
src/newsutils/conf/globals.py (2719, 2023-09-21)
src/newsutils/conf/mixins.py (6256, 2023-09-21)
src/newsutils/conf/post_item.py (2065, 2023-09-21)
src/newsutils/conf/posts.py (6422, 2023-09-21)
src/newsutils/conf/utils.py (3353, 2023-09-21)
src/newsutils/console.py (3102, 2023-09-21)
src/newsutils/crawl/ (0, 2023-09-21)
src/newsutils/crawl/__init__.py (112, 2023-09-21)
src/newsutils/crawl/commands.py (1380, 2023-09-21)
src/newsutils/crawl/day.py (5531, 2023-09-21)
src/newsutils/crawl/items.py (928, 2023-09-21)
src/newsutils/crawl/pipelines.py (5166, 2023-09-21)
src/newsutils/crawl/spiders.py (9598, 2023-09-21)
... ...

# scrapy-newsutils This Python library makes scraping online articles a breeze. What you can do: * Download articles from any format into a standardized `Post` structure, * Gather posts by affinity, generates authentic titles and summaries from articles or group of articles, etc. * Publish summarized posts to Telegram, Facebook and Twitter. * Initialise scrapers dynamically from database settings. * Create schedules to start crawl jobs. It Provides following two major components for news fetching: - `newsutils.crawl` leverages Scrapy to crawl news. - `newsutils.ezines` downloads news by performing http requests to news API endpoints. ## Misc. Features * Skips duplicate articles while crawling. * Customizable names for article data fields. * Customizable strategies for text extraction `summary_as_text` * Comes with default settings overridable from the `.env` or `settings.py` * Supply fuly-functional base classes for building customized commands, spiders and configuration that consume NLP pipelines. * Mechanism to quit running a command if another instance is already running. ## Usage * Setup python env ```shell python -m venv venv && source venv/bin/activate pip install -r requirements.txt ``` * Define a posts spider manually. Traditional way, by defining the spider subclass in a module inside `settings.SPIDER_MODULES` ```shell cat < spiders/gn_guineematin.py from newsutils.crawl.spiders import BasePostCrawler class GnGuineeMatin(BasePostCrawler): """ Scraps news posts at 'https://guineematin.com' """ # by `scrapy.spider.Spider` name = 'gn-guineematin' allowed_domains = ['guineematin.com'] start_urls = ['https://guineematin.com/'] # by `newsutils.scrapy.base.spiders.NewsSpider` country_code = 'GN' language = 'fr' post_images = "//figure/img/@src" post_texts = { "featured": '//*[(@id = "tdi_82")]//a | //*[(@id = "tdi_84")]//a', "default": '//*[contains(concat( " ", @class, " " ), concat( " ", "td-animation-stack", " " ))]//a', } # days_from = '2022-04-19' # days_to = '2022-04-25' # days = ['2022-04-12', '2022-04-09'] EOF ``` * Define a posts spider dynamically. Initialisation context is read from the database as part of the project. cf. `settings.CRAWL_DB_URI`. Eg., run following and import generated `spiders.json` to MongoDB ```shell cat < spiders.json [ { "name": "gn-guineematin ", "allowed_domains": ["guineematin.com"], "start_urls": ["https://guineematin.com/"], "country_code": "GN", "language": "fr", "post_images": "//figure/img/@src", "post_texts": { "featured": "//*[(@id = \"tdi_82\")]//a | //*[(@id = \"tdi_84\")]//a", "default": "//*[contains(concat( \" \", @class, \" \" ), concat( \" \", \"td-animation-stack\", \" \" ))]//a" } } ] ``` * Run the spider (chdir to project directory) ```shell # redirects output to json file scrapy crawl gn-guineematin -O gn-guineematin.json ``` ### Optimisations * Supports `conda` envs suited for machine learning. * `scrapy nlp` command downloading 2G+ models data ! It is recommended to mount the NLP data directories as a volume when using Docker. Cf. example multistage `Dockerfile` in the `leeram-news/newsbot` project. * Throttled e-zines API requests rate at thesportsdb.com Configurable through env vars. * [TODO] Skip NLP inference, ie. quit generating a metapost if exists a metapost with the same version in the db ie. iff same siblings detected. * [TODO] Create a single source of through for settings: settings.py, envs loaded by the run.py script eg. SIMILARITY_RELATED_THRESHOLD, etc. * [TODO] [NER as middleware](https://github.com/vu3jej/scrapy-corenlp) -> new post field * [TODO] [Scrapy DeltaFetch](https://github.com/ScrapeOps/python-scrapy-playbook/tree/master/2_Scrapy_Guides) ensures that your scrapers only ever crawl the content once * [TODO] Bulk (batch) Insert to Database append to some bucket during `process_item()`. to only flush to db during pieline's `close_spider()` https://jerrynsh.com/5-useful-tips-while-working-with-python-scrapy/ * [TODO] Move code for fetching post's `.images`, `.top_image` to Pipeline/Middleware Currently parses/downloads images event for dup posts uti ! https://github.com/scrapy/scrapy/issues/2436 https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.SpiderMiddleware.process_spider_output ### Feature request * Bypass scraper blocking Refs: [1](https://scrapfly.io/blog/web-scraping-with-scrapy/) Test the various plugins for proxy management, eg.: - (scrapy-rotating-proxies)[https://github.com/TeamHG-Memex/scrapy-rotating-proxies], - (scrapy-fake-useragent)[https://github.com/alecxe/scrapy-fake-useragent], for randomizing user agent headers. * Browser emulation and scraping dynamic pages (JS) using: - scrapy-selenium (+GCP): [1](https://youtu.be/2LwrUu9yTAo), [2](https://www.roelpeters.be/how-to-deploy-a-scraping-script-and-selenium-in-google-cloud-run/) - [scrapy-playwright](https://pypi.org/project/scrapy-playwright/) - JS support via (Splash)[https://splash.readthedocs.io/en/stable/faq.html] \ Won't do: seem to require running in docker container?? * Migrate to Distributed scraping, eg. [Frontera](https://github.com/scrapinghub/frontera)

近期下载者

相关文件


收藏者