antiSMI-Collector

所属分类:数据库系统
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2023-09-07 20:42:07
上 传 者sh-1993
说明:  新闻分析器
(news Parser)

文件列表:
.dockerignore (38, 2024-01-06)
Dockerfile (324, 2024-01-06)
docker-compose.yml (189, 2024-01-06)
img/ (0, 2024-01-06)
img/AntiSMI structure small.png (129524, 2024-01-06)
img/AntiSMI structure.png (218196, 2024-01-06)
img/Parser stats.png (127946, 2024-01-06)
img/antismi_common.png (446603, 2024-01-06)
imports/ (0, 2024-01-06)
imports/__init__.py (0, 2024-01-06)
imports/imports.py (1363, 2024-01-06)
imports/phrase_dicts.py (2398, 2024-01-06)
main.py (2486, 2024-01-06)
pyproject.toml (288, 2024-01-06)
requirements.txt (1105, 2024-01-06)
scripts/ (0, 2024-01-06)
scripts/__init__.py (0, 2024-01-06)
scripts/cook.py (6674, 2024-01-06)
scripts/db.py (10019, 2024-01-06)
scripts/fix.py (2062, 2024-01-06)
scripts/shop.py (11555, 2024-01-06)
scripts/taste.py (5353, 2024-01-06)

# antiSMI-Collector ![Parser stats](https://github.com/maxlethal/antiSMI-Collector/blob/master/img/Parser%20stats.png?raw=true) ## Table of contents * [About](#about) * [Stats](#stats) * [Stack](#stack) * [ML models](#ml-models) * [Development Tools](#development-tools) * [Code's structure](#code-structure) ## About The Collector is one of three parts of the [AntiSMI Project](https://github.com/maxlethal/antiSMI-Project). It is designed to constantly collect fresh news from various sources, process and store them for further use within the Project by other's parts: * [Bot](https://github.com/maxlethal/antiSMI-Bot) - to create and send personal smart news digest via telegram interface * **Observer** - to research social trends, make dashboards and to create NLP models In news processing, trained machine learning models are used to categorize the news and create its summary and title. ## Stats * **Start:** 2022-07-01 [project suspended for 2 months in 2022] * **Capacity:** 40 news agencies, 500 news/day * **Bot database capacity:** > 100,000 news articles [07.2022 - today] * **Archive base capacity:** > 1.5 million articles [08.1999 - 04.2019] ## Stack * **Language:** python, sql * **Databases:** postgreSQL, sqlalchemy * **Validation:** pydantic * **Logging:** loguru * **BI**: apache SuperSet * **Scraping:** requests, Beatufill Soup 4 ## ML models: - **Summarization:** - mBart, Seq2Seq, pre-trained [news summary] - ruT5, pre-trained [headline] - **Categorization:** - fasttext, supervised pre-training, 7 classes (categories) **Clustering** problems are solved by [AntiSMI-Bot](https://github.com/maxlethal/antiSMI-Bot). ## Development Tools: - Pycharm - Docker - GitHub - Linux shell ## Code structure: **main.py** serves as an interface for managing 3 modules from _scripts_ directory: - **shop.py**: parses news from news agencies (similar to news "shopping") - **cook.py**: generates titles and summaries for news collected by the shop.py module (similar to news "cooking") - **taste.py**: validates "prepared" news for compliance with field types in databases and recording only correct data There are two additional modules in the scripts folder: - **db.py**: mandatory auxiliary module, contains a set of variables, classes and function for working with databases - **fix.py**: optional auxiliary module, fixes problems with article's resumes in the Bot's "past machine" module

近期下载者

相关文件


收藏者