NewsAPI_Wordcloud_Airflow

所属分类:OA办公系统
开发工具:Jupyter Notebook
文件大小:0KB
下载次数:0
上传日期:2023-09-07 17:11:57
上 传 者sh-1993
说明:  新闻API Wordcloud气流,,
(NewsAPI Wordcloud Airflow,,)

文件列表:
Dockerfile (87, 2023-09-07)
NewsAPI_Keyword_wordcloud.ipynb (469033, 2023-09-07)
data_flow_diagram.png (137735, 2023-09-07)
data_pipeline.py (13152, 2023-09-07)
datapipeline_flowchart.png (55198, 2023-09-07)
docker-compose.yaml (4391, 2023-09-07)
requirements.txt (1460, 2023-09-07)

# News Data Extraction and Visualization Project This project is designed to extract, process, and visualize news data from various sources, providing valuable insights into daily news trends. By leveraging an ETL (Extract, Transform, Load) pipeline and interactive visualization tools, this project aims to streamline the process of accessing and comprehending news data. ## Table of Contents - [Project Overview](https://github.com/JinshaBGeorge/NewsAPI_Wordcloud_Airflow/blob/master/#project-overview) - [Getting Started](https://github.com/JinshaBGeorge/NewsAPI_Wordcloud_Airflow/blob/master/#getting-started) - [Prerequisites](https://github.com/JinshaBGeorge/NewsAPI_Wordcloud_Airflow/blob/master/#prerequisites) - [Project Structure](https://github.com/JinshaBGeorge/NewsAPI_Wordcloud_Airflow/blob/master/#project-structure) - [Data Pipeline](https://github.com/JinshaBGeorge/NewsAPI_Wordcloud_Airflow/blob/master/#data-pipeline) - [Visualization](https://github.com/JinshaBGeorge/NewsAPI_Wordcloud_Airflow/blob/master/#visualization) ## Project Overview In today's information-rich environment, understanding and analyzing daily news data is challenging. This project addresses this challenge by: - Establishing an automated ETL data pipeline for data extraction, cleaning, transformation, and loading. - Ensuring data accuracy, optimal storage, and efficiency in managing incremental updates. - Creating interactive visualizations to present insights into trending topics, sentiment analysis, and regional news coverage. ## Getting Started ### Prerequisites To run this project, you will need: - [Python](https://github.com/JinshaBGeorge/NewsAPI_Wordcloud_Airflow/blob/master/https://www.python.org/downloads/) - [Apache Airflow](https://github.com/JinshaBGeorge/NewsAPI_Wordcloud_Airflow/blob/master/https://airflow.apache.org/docs/stable/start.html) for task automation - [PostgreSQL](https://github.com/JinshaBGeorge/NewsAPI_Wordcloud_Airflow/blob/master/https://www.postgresql.org/download/) for data storage - Required Python libraries (install using `pip install -r requirements.txt`) ## Project Structure `data_extraction/`: Contains code for data extraction from NewsAPI. `data_cleaning/`: Includes scripts for data cleaning and normalization. `database/`: Manages PostgreSQL database setup and schema design. `visualization/`: Houses code for generating interactive visualizations. `airflow/`: Defines Apache Airflow DAGs for task automation. ## Data Pipeline This project follows a structured ETL (Extract, Transform, Load) process: - Extract: News data is obtained from NewsAPI and processed for cleaning and transformation. - Transform: Data is cleaned, normalized, and structured into a star schema for efficient querying. - Load: Cleaned data is loaded into a PostgreSQL database with primary and foreign key relations. ## Visualization Interactive visualizations are generated to depict daily and overall word clouds. News source statistics are presented, showcasing the number of articles each source has.

近期下载者

相关文件


收藏者