datatank_news_papers_data_pipeline
所属分类:Docker
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2023-08-21 23:09:37
上 传 者:
sh-1993
说明: Datatank新闻纸项目的数据管道。
(Data pipeline for Datatank s news papers project.)
文件列表:
build_push_images.sh (2407, 2023-08-21)
dags/ (0, 2023-08-21)
dags/scraping_pipeline.py (2178, 2023-08-21)
docker-compose.yaml (2224, 2023-08-21)
setup_project.sh (109, 2023-08-21)
src/ (0, 2023-08-21)
src/preprocessing/ (0, 2023-08-21)
src/preprocessing/Dockerfile (135, 2023-08-21)
src/preprocessing/main.py (2135, 2023-08-21)
src/preprocessing/requirements.txt (1210, 2023-08-21)
src/preprocessing/utils/ (0, 2023-08-21)
src/preprocessing/utils/__init__.py (0, 2023-08-21)
src/preprocessing/utils/embeddings.py (848, 2023-08-21)
src/preprocessing/utils/languages.py (404, 2023-08-21)
src/preprocessing/utils/polarity.py (637, 2023-08-21)
src/preprocessing/utils/source.py (153, 2023-08-21)
src/scrapers/ (0, 2023-08-21)
src/scrapers/Dockerfile (136, 2023-08-21)
src/scrapers/demorgen/ (0, 2023-08-21)
src/scrapers/demorgen/main.py (4371, 2023-08-21)
src/scrapers/demorgen/requirements.txt (700, 2023-08-21)
src/scrapers/dhnet/ (0, 2023-08-21)
src/scrapers/dhnet/main.py (2513, 2023-08-21)
src/scrapers/dhnet/requirements.txt (296, 2023-08-21)
src/scrapers/hln/ (0, 2023-08-21)
src/scrapers/hln/main.py (3164, 2023-08-21)
src/scrapers/hln/requirements.txt (296, 2023-08-21)
src/scrapers/knack/ (0, 2023-08-21)
src/scrapers/knack/main.py (3023, 2023-08-21)
src/scrapers/knack/requirements.txt (296, 2023-08-21)
src/scrapers/lalibre/ (0, 2023-08-21)
src/scrapers/lalibre/main.py (2393, 2023-08-21)
src/scrapers/lalibre/requirements.txt (680, 2023-08-21)
src/scrapers/lavenir/ (0, 2023-08-21)
src/scrapers/lavenir/main.py (2545, 2023-08-21)
src/scrapers/lavenir/requirements.txt (296, 2023-08-21)
src/scrapers/lecho/ (0, 2023-08-21)
src/scrapers/lecho/main.py (3017, 2023-08-21)
... ...
# Datatank News Papers Data Pipeline
This is a data pipeline that scrapes the [Datatank](https://datatank.org/) for news papers and their articles.
The goal is to create a database of news papers and their articles that can be used for further analysis.
The data is then stored in a MongoDB database.
## Contributors
**This project has not been done by me. This is the result of the hard work from the Becode Bouman6 learners! All credits goes to them.**
## Technologies used
- [Python 3.11](https://www.python.org/)
- [Docker](https://www.docker.com/)
- [MongoDB](https://www.mongodb.com/)
- [Airflow](https://airflow.apache.org/)
## How to run
1. Build & push the docker images to Docker Hub
```bash
# You need to modify the build_push_images.sh file to include your Docker Hub username
bash build_push_images.sh
```
2. Initialise the airflow database
```bash
# Do this to ensure those folders are gonna have the right permissions
mkdir -p ./dags ./logs ./plugins ./config
# Add your users UID to the .env file
echo -e "AIRFLOW_UID=$(id -u)" > .env
echo -e "AIRFLOW_GID=0" >> .env
# Run the airflow init command
docker-compose up -d airflow-init
```
2. Run airflow
```bash
docker-compose up -d
```
3. Set the MONGO_URI environment variable in the airflow UI's variables tab
```
The URI should be in the following format:
mongodb://
:@:/
The Airflow's variable should be nammed: mongodb_uri
```
4. Connect to the airflow UI: [http://localhost:8080](http://localhost:8080) The default username and password are both `airflow`.
5. Trigger the `news_papers_data_pipeline` DAG
## How to add a new scraper
1. Create a new scraper in the `scrapers` folder with:
- A `main.py` file that contains the scraper's logic
- A `requirements.txt` file that contains the scraper's dependencies
2. Add the scraper to the `scraper_names` list in the `dags/scraping_pipeline.py` file
3. Run the docker images building & push script
```bash
bash build_push_images.sh
```
4. Run the pipeline on Airflow
近期下载者:
相关文件:
收藏者: