newsnlp
所属分类:工具库
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2023-09-21 16:00:08
上 传 者:
sh-1993
说明: 新闻摘要、分类和广告抑制变得容易。
(News summarization, categorization and ad suppression made easy.)
文件列表:
demo/ (0, 2023-01-03)
demo/__init__.py (0, 2023-01-03)
demo/sum.py (1560, 2023-01-03)
demo/tfidf.py (2246, 2023-01-03)
doc/ (0, 2023-01-03)
doc/lec38.pdf (981026, 2023-01-03)
setup.cfg (1397, 2023-01-03)
setup.py (95, 2023-01-03)
src/ (0, 2023-01-03)
src/__init__.py (0, 2023-01-03)
src/metrics.py (850, 2023-01-03)
src/newsnlp/ (0, 2023-01-03)
src/newsnlp/__init__.py (127, 2023-01-03)
src/newsnlp/base.py (645, 2023-01-03)
src/newsnlp/categorizer.py (1187, 2023-01-03)
src/newsnlp/summarizer.py (2588, 2023-01-03)
src/newsnlp/translator.py (309, 2023-01-03)
src/newsnlp/utils/ (0, 2023-01-03)
src/newsnlp/utils/__init__.py (0, 2023-01-03)
src/newsnlp/utils/token_prep.py (4350, 2023-01-03)
src/newsnlp/utils/vocab.py (1596, 2023-01-03)
src/newsnlp/vect/ (0, 2023-01-03)
src/newsnlp/vect/__init__.py (35, 2023-01-03)
src/newsnlp/vect/tfidf.py (5623, 2023-01-03)
src/requirements.txt (2498, 2023-01-03)
# newsnlp
Python library that transforms news data using a variety of NLP-based processing tasks.
You may:
* Generate summary, caption, category for scraped content using NLP.
* Gather posts by affinity, eg. all similar titles from a bunch of websites.
* Generate meta-summaries from posts grouped by affinty (concept of sibling posts) using NLP.
* Remove ads and undesirable content using NLP.
* Downloads and caches pre-trained Transformer models from HuggingFace for the summarization tasks (check supported languages),
* Supports Anaconda envs for scientific computing
## Features
- Uses [Playwright](https://playwright.dev/) to inflat dynamic content from websites (eg. Ads, JavaScript) before processing.
- Downloads and caches pretrained NLP models locally, suitable for fast inference.
- Pretrained deep learning Ad detection model.
- Pretrained Transformer-based LLM for the summarization tasks (only French supported yet).
- Ad detection implements the (Kushmerick, 1999)](./doc/kushmerick99learning.pdf) paper partially,
but relies on Deep Learning rather than statistical fitting.
## Dev
- Env setup
```shell
conda create --name newsbot python=3
pip-sync
```
###
- Summariser's dependencies: `sentencepiece`
```shell
conda install -c conda-forge sentencepiece
conda install pytorch torchvision -c pytorch
#conda install -c conda-forge transformers
conda install -c huggingface transformers
```
- TFIDF deps: ``
```shell
conda install -c conda-forge spacy
python -m spacy download en_core_web_sm
```
## TODO
### Optimisations
- Use optimised TF-IDF from Spacy or SkLearn
- Utilise only half of the symmetric TF-IDF matrix
- Resume vectorization of corpus where last task left off.
This implies saving vectorization result to disk, and merging with docs newly added to the db.
- Cython ??
## Feature request
- Multiple languages support - OK O7/06/2023
- Summarizer currently only supports 1024 words max. Find more powerful model? push model capacity?
- Require `conda` in setuptools/pyproject.toml?
- WTF is the [sumy](https://github.com/miso-belica/sumy) summariser?
近期下载者:
相关文件:
收藏者: