fundus 联合开发网

Pudn.com > 下载中心 > 自然语言处理 > fundus

fundus

python NLP RSS sitemap crawler

去下载(0) 赞(36) 踩(0) 评论(0) 收藏(0)

所属分类：自然语言处理
开发工具：Python
文件大小：0KB
下载次数：0
上传日期：2024-04-01 15:07:30
上传者：sh-1993

说明：一个非常简单的新闻爬虫，名字很有趣
(A very simple news crawler with a funny name)

文件列表:

docs/
resources/logo/
scripts/
src/fundus/
tests/
CODE_OF_CONDUCT.md
LICENSE
MANIFEST.in
pyproject.toml

A very simple news crawler in Python. Developed at Humboldt University of Berlin.

Publisher Coverage

[Quick Start](https://github.com/flairNLP/fundus/blob/master/#quick-start) | [Tutorials](https://github.com/flairNLP/fundus/blob/master/#tutorials) | [News Sources](https://github.com/flairNLP/fundus/blob/master//docs/supported_publishers.md)

--- Fundus is: * **A static news crawler.** Fundus lets you crawl online news articles with only a few lines of Python code! Be it from live websites or the CC-NEWS dataset. * **An open-source Python package.** Fundus is built on the idea of building something together. We welcome your contribution to help Fundus [grow](https://github.com/flairNLP/fundus/blob/master/docs/how_to_contribute.md)!

## Quick Start To install from pip, simply do: ``` pip install fundus ``` Fundus requires Python 3.8+. ## Example 1: Crawl a bunch of English-language news articles Let's use Fundus to crawl 2 articles from publishers based in the US. ```python from fundus import PublisherCollection, Crawler # initialize the crawler for news publishers based in the US crawler = Crawler(PublisherCollection.us) # crawl 2 articles and print for article in crawler.crawl(max_articles=2): print(article) ``` That's already it! If you run this code, it should print out something like this: ```console Fundus-Article: - Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]" - Text: "Democrats jammed three of President Joe Biden's controversial court nominees through committee votes on Thursday thanks to a last-minute [...]" - URL: https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/ - From: FreeBeacon (2023-05-11 18:41) Fundus-Article: - Title: "Northwestern student government freezes College Republicans funding over [...]" - Text: "Student government at Northwestern University in Illinois "indefinitely" froze the funds of the university's chapter of College Republicans [...]" - URL: https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community - From: FoxNews (2023-05-09 14:37) ``` This printout tells you that you successfully crawled two articles! For each article, the printout details: - the "Title" of the article, i.e. its headline - the "Text", i.e. the main article body text - the "URL" from which it was crawled - the news source it is "From" ## Example 2: Crawl a specific news source Maybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only: ```python from fundus import PublisherCollection, Crawler # initialize the crawler for Washington Times crawler = Crawler(PublisherCollection.us.WashingtonTimes) # crawl 2 articles and print for article in crawler.crawl(max_articles=2): print(article) ``` ## Example 3: Crawl articles from CC-NEWS If you're not familiar with CC-NEWS, check out their [paper](https://github.com/flairNLP/fundus/blob/master/https://paperswithcode.com/dataset/cc-news). ````python from fundus import PublisherCollection, CCNewsCrawler # initialize the crawler for news publishers based in the US crawler = CCNewsCrawler(*PublisherCollection.us) # crawl 2 articles and print for article in crawler.crawl(max_articles=2): print(article) ```` ## Tutorials We provide **quick tutorials** to get you started with the library: 1. [**Tutorial 1: How to crawl news with Fundus**](https://github.com/flairNLP/fundus/blob/master/docs/1_getting_started.md) 2. [**Tutorial 2: How to crawl articles from CC-NEWS**](https://github.com/flairNLP/fundus/blob/master/docs/2_crawl_from_cc_news.md) 3. [**Tutorial 3: The Article Class**](https://github.com/flairNLP/fundus/blob/master/docs/3_the_article_class.md) 4. [**Tutorial 4: How to filter articles**](https://github.com/flairNLP/fundus/blob/master/docs/4_how_to_filter_articles.md) 5. [**Tutorial 5: How to search for publishers**](https://github.com/flairNLP/fundus/blob/master/docs/5_how_to_search_for_publishers.md) If you wish to contribute check out these tutorials: 1. [**How to contribute**](https://github.com/flairNLP/fundus/blob/master/docs/how_to_contribute.md) 2. [**How to add a publisher**](https://github.com/flairNLP/fundus/blob/master/docs/how_to_add_a_publisher.md) ## Currently Supported News Sources You can find the publishers currently supported [**here**](https://github.com/flairNLP/fundus/blob/master//docs/supported_publishers.md). Also: **Adding a new publisher is easy - consider contributing to the project!** ## Contact Please email your questions or comments to [**Max Dallabetta**](https://github.com/flairNLP/fundus/blob/master/mailto:max.dallabetta@googlemail.com?subject=[GitHub]%20Fundus) ## Contributing Thanks for your interest in contributing! There are many ways to get involved; start with our [contributor guidelines](https://github.com/flairNLP/fundus/blob/master/docs/how_to_contribute.md) and then check these [open issues](https://github.com/flairNLP/fundus/blob/master/https://github.com/flairNLP/fundus/issues) for specific tasks. ## License [MIT](https://github.com/flairNLP/fundus/blob/master/LICENSE)

近期下载者：

相关文件：

评论：[我要评论] [举报此文件]

收藏者：