newsnlp

所属分类:工具库
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2023-09-21 16:00:08
上 传 者sh-1993
说明:  新闻摘要、分类和广告抑制变得容易。
(News summarization, categorization and ad suppression made easy.)

文件列表:
demo/ (0, 2023-01-03)
demo/__init__.py (0, 2023-01-03)
demo/sum.py (1560, 2023-01-03)
demo/tfidf.py (2246, 2023-01-03)
doc/ (0, 2023-01-03)
doc/lec38.pdf (981026, 2023-01-03)
setup.cfg (1397, 2023-01-03)
setup.py (95, 2023-01-03)
src/ (0, 2023-01-03)
src/__init__.py (0, 2023-01-03)
src/metrics.py (850, 2023-01-03)
src/newsnlp/ (0, 2023-01-03)
src/newsnlp/__init__.py (127, 2023-01-03)
src/newsnlp/base.py (645, 2023-01-03)
src/newsnlp/categorizer.py (1187, 2023-01-03)
src/newsnlp/summarizer.py (2588, 2023-01-03)
src/newsnlp/translator.py (309, 2023-01-03)
src/newsnlp/utils/ (0, 2023-01-03)
src/newsnlp/utils/__init__.py (0, 2023-01-03)
src/newsnlp/utils/token_prep.py (4350, 2023-01-03)
src/newsnlp/utils/vocab.py (1596, 2023-01-03)
src/newsnlp/vect/ (0, 2023-01-03)
src/newsnlp/vect/__init__.py (35, 2023-01-03)
src/newsnlp/vect/tfidf.py (5623, 2023-01-03)
src/requirements.txt (2498, 2023-01-03)

# newsnlp Python library that transforms news data using a variety of NLP-based processing tasks. You may: * Generate summary, caption, category for scraped content using NLP. * Gather posts by affinity, eg. all similar titles from a bunch of websites. * Generate meta-summaries from posts grouped by affinty (concept of sibling posts) using NLP. * Remove ads and undesirable content using NLP. * Downloads and caches pre-trained Transformer models from HuggingFace for the summarization tasks (check supported languages), * Supports Anaconda envs for scientific computing ## Features - Uses [Playwright](https://playwright.dev/) to inflat dynamic content from websites (eg. Ads, JavaScript) before processing. - Downloads and caches pretrained NLP models locally, suitable for fast inference. - Pretrained deep learning Ad detection model. - Pretrained Transformer-based LLM for the summarization tasks (only French supported yet). - Ad detection implements the (Kushmerick, 1999)](./doc/kushmerick99learning.pdf) paper partially, but relies on Deep Learning rather than statistical fitting. ## Dev - Env setup ```shell conda create --name newsbot python=3 pip-sync ``` ### - Summariser's dependencies: `sentencepiece` ```shell conda install -c conda-forge sentencepiece conda install pytorch torchvision -c pytorch #conda install -c conda-forge transformers conda install -c huggingface transformers ``` - TFIDF deps: `` ```shell conda install -c conda-forge spacy python -m spacy download en_core_web_sm ``` ## TODO ### Optimisations - Use optimised TF-IDF from Spacy or SkLearn - Utilise only half of the symmetric TF-IDF matrix - Resume vectorization of corpus where last task left off. This implies saving vectorization result to disk, and merging with docs newly added to the db. - Cython ?? ## Feature request - Multiple languages support - OK O7/06/2023 - Summarizer currently only supports 1024 words max. Find more powerful model? push model capacity? - Require `conda` in setuptools/pyproject.toml? - WTF is the [sumy](https://github.com/miso-belica/sumy) summariser?

近期下载者

相关文件


收藏者