datatank_news_papers_data_pipeline

所属分类:Docker
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2023-08-21 23:09:37
上 传 者sh-1993
说明:  Datatank新闻纸项目的数据管道。
(Data pipeline for Datatank s news papers project.)

文件列表:
build_push_images.sh (2407, 2023-08-21)
dags/ (0, 2023-08-21)
dags/scraping_pipeline.py (2178, 2023-08-21)
docker-compose.yaml (2224, 2023-08-21)
setup_project.sh (109, 2023-08-21)
src/ (0, 2023-08-21)
src/preprocessing/ (0, 2023-08-21)
src/preprocessing/Dockerfile (135, 2023-08-21)
src/preprocessing/main.py (2135, 2023-08-21)
src/preprocessing/requirements.txt (1210, 2023-08-21)
src/preprocessing/utils/ (0, 2023-08-21)
src/preprocessing/utils/__init__.py (0, 2023-08-21)
src/preprocessing/utils/embeddings.py (848, 2023-08-21)
src/preprocessing/utils/languages.py (404, 2023-08-21)
src/preprocessing/utils/polarity.py (637, 2023-08-21)
src/preprocessing/utils/source.py (153, 2023-08-21)
src/scrapers/ (0, 2023-08-21)
src/scrapers/Dockerfile (136, 2023-08-21)
src/scrapers/demorgen/ (0, 2023-08-21)
src/scrapers/demorgen/main.py (4371, 2023-08-21)
src/scrapers/demorgen/requirements.txt (700, 2023-08-21)
src/scrapers/dhnet/ (0, 2023-08-21)
src/scrapers/dhnet/main.py (2513, 2023-08-21)
src/scrapers/dhnet/requirements.txt (296, 2023-08-21)
src/scrapers/hln/ (0, 2023-08-21)
src/scrapers/hln/main.py (3164, 2023-08-21)
src/scrapers/hln/requirements.txt (296, 2023-08-21)
src/scrapers/knack/ (0, 2023-08-21)
src/scrapers/knack/main.py (3023, 2023-08-21)
src/scrapers/knack/requirements.txt (296, 2023-08-21)
src/scrapers/lalibre/ (0, 2023-08-21)
src/scrapers/lalibre/main.py (2393, 2023-08-21)
src/scrapers/lalibre/requirements.txt (680, 2023-08-21)
src/scrapers/lavenir/ (0, 2023-08-21)
src/scrapers/lavenir/main.py (2545, 2023-08-21)
src/scrapers/lavenir/requirements.txt (296, 2023-08-21)
src/scrapers/lecho/ (0, 2023-08-21)
src/scrapers/lecho/main.py (3017, 2023-08-21)
... ...

# Datatank News Papers Data Pipeline This is a data pipeline that scrapes the [Datatank](https://datatank.org/) for news papers and their articles. The goal is to create a database of news papers and their articles that can be used for further analysis. The data is then stored in a MongoDB database. ## Contributors **This project has not been done by me. This is the result of the hard work from the Becode Bouman6 learners! All credits goes to them.** ## Technologies used - [Python 3.11](https://www.python.org/) - [Docker](https://www.docker.com/) - [MongoDB](https://www.mongodb.com/) - [Airflow](https://airflow.apache.org/) ## How to run 1. Build & push the docker images to Docker Hub ```bash # You need to modify the build_push_images.sh file to include your Docker Hub username bash build_push_images.sh ``` 2. Initialise the airflow database ```bash # Do this to ensure those folders are gonna have the right permissions mkdir -p ./dags ./logs ./plugins ./config # Add your users UID to the .env file echo -e "AIRFLOW_UID=$(id -u)" > .env echo -e "AIRFLOW_GID=0" >> .env # Run the airflow init command docker-compose up -d airflow-init ``` 2. Run airflow ```bash docker-compose up -d ``` 3. Set the MONGO_URI environment variable in the airflow UI's variables tab ``` The URI should be in the following format: mongodb://:@:/ The Airflow's variable should be nammed: mongodb_uri ``` 4. Connect to the airflow UI: [http://localhost:8080](http://localhost:8080) The default username and password are both `airflow`. 5. Trigger the `news_papers_data_pipeline` DAG ## How to add a new scraper 1. Create a new scraper in the `scrapers` folder with: - A `main.py` file that contains the scraper's logic - A `requirements.txt` file that contains the scraper's dependencies 2. Add the scraper to the `scraper_names` list in the `dags/scraping_pipeline.py` file 3. Run the docker images building & push script ```bash bash build_push_images.sh ``` 4. Run the pipeline on Airflow

近期下载者

相关文件


收藏者