seldonite

所属分类:云计算
开发工具:Python
文件大小:66KB
下载次数:0
上传日期:2023-03-31 23:47:19
上 传 者sh-1993
说明:  新闻文章收藏库
(A News Article Collection Library)

文件列表:
LICENSE (1077, 2023-04-01)
Makefile (130, 2023-04-01)
environment.yml (588, 2023-04-01)
examples (0, 2023-04-01)
examples\collect (0, 2023-04-01)
examples\collect\collect_by_keyword.py (533, 2023-04-01)
examples\collect\collect_countries.py (594, 2023-04-01)
examples\collect\collect_political_articles.py (521, 2023-04-01)
examples\embed (0, 2023-04-01)
examples\embed\news2vec_embed.py (800, 2023-04-01)
examples\graphs (0, 2023-04-01)
examples\graphs\build_entity_dag.py (812, 2023-04-01)
examples\graphs\build_news2vec.py (843, 2023-04-01)
examples\nlp (0, 2023-04-01)
examples\nlp\entities.py (546, 2023-04-01)
examples\nlp\tfidf.py (613, 2023-04-01)
examples\sources (0, 2023-04-01)
examples\sources\common_crawl_news.py (970, 2023-04-01)
examples\sources\google_news.py (749, 2023-04-01)
examples\sources\mongo_news.py (648, 2023-04-01)
examples\sources\news_crawl_news.py (1061, 2023-04-01)
examples\visualize (0, 2023-04-01)
examples\visualize\visualize_entity_dag.py (544, 2023-04-01)
seldonite (0, 2023-04-01)
seldonite\__init__.py (0, 2023-04-01)
seldonite\analyze.py (3148, 2023-04-01)
seldonite\base.py (453, 2023-04-01)
seldonite\collect.py (9362, 2023-04-01)
seldonite\commoncrawl (0, 2023-04-01)
seldonite\commoncrawl\__init__.py (0, 2023-04-01)
seldonite\commoncrawl\cc-index-schema-flat.json (8040, 2023-04-01)
seldonite\commoncrawl\cc_index_fetch_news.py (2919, 2023-04-01)
seldonite\commoncrawl\fetch_news.py (2476, 2023-04-01)
seldonite\commoncrawl\sparkcc.py (15707, 2023-04-01)
seldonite\embed.py (7950, 2023-04-01)
seldonite\filters (0, 2023-04-01)
seldonite\filters\__init__.py (812, 2023-04-01)
... ...

# Seldonite ### A News Article Collection and Processing Library Define a news source, set your search method, and collect news articles or create news graphs. Usage: ```python import os from seldonite import sources, collect, run aws_access_key = os.environ['AWS_ACCESS_KEY'] aws_secret_key = os.environ['AWS_SECRET_KEY'] source = sources.news.CommonCrawl(aws_access_key, aws_secret_key) collector = collect.Collector(source) \ .on_sites(['cbc.ca', 'bbc.com']) \ .by_keywords(['afghanistan', 'withdrawal']) graph = graphs.Graph(collector) \ .build_tfidf_graph() articles_df, words_df, edges_df = run.Runner(graph) .to_pandas() ``` Please see the wiki for more detail on sources and methods ## Setup To install seldonite as editable, and dependencies via conda: ``` conda env create -f ./environment.yml ``` This library uses a variety of third party libraries, please see limited setup instructions below: ### Spacy To use NLP methods that require the use of spacy: ``` python -m spacy download en_core_web_sm ``` ### Spark To make Python dependencies available to Spark executors, use the dependency packaging script: ``` bash ./seldonite/spark/package_pyspark_deps.sh ``` ## Tests We use `pytest`. To run tests, run these commands from the top level directory: ``` pytest ``` ## Credits * Spark based pipeline based on Sebastien Nagel's [cc-pyspark](https://github.com/commoncrawl/cc-pyspark) library. * Heuristics for News Articles adapted from [newsplease](https://github.com/fhamborg/news-please) * Political news classifier taken from [Political-News-Filter](https://github.com/lukasgebhard/Political-News-Filter)

近期下载者

相关文件


收藏者