seldonite
所属分类:云计算
开发工具:Python
文件大小:66KB
下载次数:0
上传日期:2023-03-31 23:47:19
上 传 者:
sh-1993
说明: 新闻文章收藏库
(A News Article Collection Library)
文件列表:
LICENSE (1077, 2023-04-01)
Makefile (130, 2023-04-01)
environment.yml (588, 2023-04-01)
examples (0, 2023-04-01)
examples\collect (0, 2023-04-01)
examples\collect\collect_by_keyword.py (533, 2023-04-01)
examples\collect\collect_countries.py (594, 2023-04-01)
examples\collect\collect_political_articles.py (521, 2023-04-01)
examples\embed (0, 2023-04-01)
examples\embed\news2vec_embed.py (800, 2023-04-01)
examples\graphs (0, 2023-04-01)
examples\graphs\build_entity_dag.py (812, 2023-04-01)
examples\graphs\build_news2vec.py (843, 2023-04-01)
examples\nlp (0, 2023-04-01)
examples\nlp\entities.py (546, 2023-04-01)
examples\nlp\tfidf.py (613, 2023-04-01)
examples\sources (0, 2023-04-01)
examples\sources\common_crawl_news.py (970, 2023-04-01)
examples\sources\google_news.py (749, 2023-04-01)
examples\sources\mongo_news.py (648, 2023-04-01)
examples\sources\news_crawl_news.py (1061, 2023-04-01)
examples\visualize (0, 2023-04-01)
examples\visualize\visualize_entity_dag.py (544, 2023-04-01)
seldonite (0, 2023-04-01)
seldonite\__init__.py (0, 2023-04-01)
seldonite\analyze.py (3148, 2023-04-01)
seldonite\base.py (453, 2023-04-01)
seldonite\collect.py (9362, 2023-04-01)
seldonite\commoncrawl (0, 2023-04-01)
seldonite\commoncrawl\__init__.py (0, 2023-04-01)
seldonite\commoncrawl\cc-index-schema-flat.json (8040, 2023-04-01)
seldonite\commoncrawl\cc_index_fetch_news.py (2919, 2023-04-01)
seldonite\commoncrawl\fetch_news.py (2476, 2023-04-01)
seldonite\commoncrawl\sparkcc.py (15707, 2023-04-01)
seldonite\embed.py (7950, 2023-04-01)
seldonite\filters (0, 2023-04-01)
seldonite\filters\__init__.py (812, 2023-04-01)
... ...
# Seldonite
### A News Article Collection and Processing Library
Define a news source, set your search method, and collect news articles or create news graphs.
Usage:
```python
import os
from seldonite import sources, collect, run
aws_access_key = os.environ['AWS_ACCESS_KEY']
aws_secret_key = os.environ['AWS_SECRET_KEY']
source = sources.news.CommonCrawl(aws_access_key, aws_secret_key)
collector = collect.Collector(source) \
.on_sites(['cbc.ca', 'bbc.com']) \
.by_keywords(['afghanistan', 'withdrawal'])
graph = graphs.Graph(collector) \
.build_tfidf_graph()
articles_df, words_df, edges_df = run.Runner(graph)
.to_pandas()
```
Please see the wiki for more detail on sources and methods
## Setup
To install seldonite as editable, and dependencies via conda:
```
conda env create -f ./environment.yml
```
This library uses a variety of third party libraries, please see limited setup instructions below:
### Spacy
To use NLP methods that require the use of spacy:
```
python -m spacy download en_core_web_sm
```
### Spark
To make Python dependencies available to Spark executors, use the dependency packaging script:
```
bash ./seldonite/spark/package_pyspark_deps.sh
```
## Tests
We use `pytest`.
To run tests, run these commands from the top level directory:
```
pytest
```
## Credits
* Spark based pipeline based on Sebastien Nagel's [cc-pyspark](https://github.com/commoncrawl/cc-pyspark) library.
* Heuristics for News Articles adapted from [newsplease](https://github.com/fhamborg/news-please)
* Political news classifier taken from [Political-News-Filter](https://github.com/lukasgebhard/Political-News-Filter)
近期下载者:
相关文件:
收藏者: