antiSMI-Collector
所属分类:数据库系统
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2023-09-07 20:42:07
上 传 者:
sh-1993
说明: 新闻分析器
(news Parser)
文件列表:
.dockerignore (38, 2024-01-06)
Dockerfile (324, 2024-01-06)
docker-compose.yml (189, 2024-01-06)
img/ (0, 2024-01-06)
img/AntiSMI structure small.png (129524, 2024-01-06)
img/AntiSMI structure.png (218196, 2024-01-06)
img/Parser stats.png (127946, 2024-01-06)
img/antismi_common.png (446603, 2024-01-06)
imports/ (0, 2024-01-06)
imports/__init__.py (0, 2024-01-06)
imports/imports.py (1363, 2024-01-06)
imports/phrase_dicts.py (2398, 2024-01-06)
main.py (2486, 2024-01-06)
pyproject.toml (288, 2024-01-06)
requirements.txt (1105, 2024-01-06)
scripts/ (0, 2024-01-06)
scripts/__init__.py (0, 2024-01-06)
scripts/cook.py (6674, 2024-01-06)
scripts/db.py (10019, 2024-01-06)
scripts/fix.py (2062, 2024-01-06)
scripts/shop.py (11555, 2024-01-06)
scripts/taste.py (5353, 2024-01-06)
# antiSMI-Collector
![Parser stats](https://github.com/maxlethal/antiSMI-Collector/blob/master/img/Parser%20stats.png?raw=true)
## Table of contents
* [About](#about)
* [Stats](#stats)
* [Stack](#stack)
* [ML models](#ml-models)
* [Development Tools](#development-tools)
* [Code's structure](#code-structure)
## About
The Collector is one of three parts of the [AntiSMI Project](https://github.com/maxlethal/antiSMI-Project).
It is designed to constantly collect fresh news from various sources, process and store them for further use within the Project by other's parts:
* [Bot](https://github.com/maxlethal/antiSMI-Bot) - to create and send personal smart news digest via telegram interface
* **Observer** - to research social trends, make dashboards and to create NLP models
In news processing, trained machine learning models are used to categorize the news and create its summary and title.
## Stats
* **Start:** 2022-07-01 [project suspended for 2 months in 2022]
* **Capacity:** 40 news agencies, 500 news/day
* **Bot database capacity:** > 100,000 news articles [07.2022 - today]
* **Archive base capacity:** > 1.5 million articles [08.1999 - 04.2019]
## Stack
* **Language:** python, sql
* **Databases:** postgreSQL, sqlalchemy
* **Validation:** pydantic
* **Logging:** loguru
* **BI**: apache SuperSet
* **Scraping:** requests, Beatufill Soup 4
## ML models:
- **Summarization:**
- mBart, Seq2Seq, pre-trained [news summary]
- ruT5, pre-trained [headline]
- **Categorization:**
- fasttext, supervised pre-training, 7 classes (categories)
**Clustering** problems are solved by [AntiSMI-Bot](https://github.com/maxlethal/antiSMI-Bot).
## Development Tools:
- Pycharm
- Docker
- GitHub
- Linux shell
## Code structure:
**main.py** serves as an interface for managing 3 modules from _scripts_ directory:
- **shop.py**: parses news from news agencies (similar to news "shopping")
- **cook.py**: generates titles and summaries for news collected by the shop.py module (similar to news "cooking")
- **taste.py**: validates "prepared" news for compliance with field types in databases and recording only correct data
There are two additional modules in the scripts folder:
- **db.py**: mandatory auxiliary module, contains a set of variables, classes and function for working with databases
- **fix.py**: optional auxiliary module, fixes problems with article's resumes in the Bot's "past machine" module
近期下载者:
相关文件:
收藏者: