Global-News-Analysis-Using-Cloud-Tools
所属分类:云计算
开发工具:Jupyter Notebook
文件大小:11183KB
下载次数:0
上传日期:2020-06-03 05:14:00
上 传 者:
sh-1993
说明: 使用Google Cloud BigQuery收集的来自gdeltproject.org的全球新闻数据的集成分类
(Ensemble classification of global news data from gdeltproject.org gathered using Google Cloud BigQuery)
文件列表:
20200101trim.csv (4040877, 2020-05-16)
20200201trim.csv (4030912, 2020-05-16)
20200301trim.csv (4068607, 2020-05-16)
20200401trim.csv (4014692, 2020-05-16)
20200505trim.csv (4021224, 2020-05-16)
May 5 data EDA.ipynb (275341, 2020-05-16)
code (0, 2020-05-16)
code\00-bigquery.ipynb (12761, 2020-05-16)
code\01-cleaning-part1.ipynb (23546, 2020-05-16)
code\02-eda-cleaning-part2.ipynb (2737875, 2020-05-16)
code\03-pca-and-modeling.ipynb (64292, 2020-05-16)
code\04-histogram-and-initial-time-series-plotly.ipynb (7904357, 2020-05-16)
data (0, 2020-05-16)
data\cvec_master.csv (3209769, 2020-05-16)
data\master_clean.csv (3304496, 2020-05-16)
data\raw (0, 2020-05-16)
data\raw\april_raw.csv (1238095, 2020-05-16)
data\raw\feb_raw.csv (1228901, 2020-05-16)
data\raw\jan_raw.csv (1038621, 2020-05-16)
data\raw\march_raw.csv (976975, 2020-05-16)
data\raw\master_raw.csv (5107322, 2020-05-16)
data\raw\may_raw.csv (624862, 2020-05-16)
data\trimmed-data (0, 2020-05-16)
data\trimmed-data\trimmed_data.txt (1, 2020-05-16)
slidedeck (0, 2020-05-16)
slidedeck\00-Project 5 CS NB - final pt1.pptx (25891821, 2020-05-16)
slidedeck\01-Project 5 CS NB - final pt2.pptx (9784752, 2020-05-16)
slidedeck\ppt.txt (1, 2020-05-16)
# GA-DSI-Project-5 Executive Summary
### Nate Bukowski and Colin Simon
# Contents:
- [Problem Statement](https://github.com/colinsimon/Global-News-Analysis-Using-Cloud-Tools/blob/master/#Problem-Statement)
- [Data Summary](https://github.com/colinsimon/Global-News-Analysis-Using-Cloud-Tools/blob/master/#Data-Summary)
- [Mapping](https://github.com/colinsimon/Global-News-Analysis-Using-Cloud-Tools/blob/master/#Mapping)
- [Models and Techniques](https://github.com/colinsimon/Global-News-Analysis-Using-Cloud-Tools/blob/master/#Models-and-Techniques)
- [Conclusions, Limitations and Recommendations](https://github.com/colinsimon/Global-News-Analysis-Using-Cloud-Tools/blob/master/#Conclusions,-Limitations-and-Recommendations)
# Problem Statement
Using data science to identify commodity supply events using global news data.
# Data Summary
**Data Source:**
- The data for this project was pulled using Google Cloud Platform's BigQuery.
- GDELT
**Datasets Analyzed:**
- [master_clean.csv](https://github.com/colinsimon/Global-News-Analysis-Using-Cloud-Tools/blob/master/./data/master_clean.csv)
- 1,100 rows, 11 columns
# Mapping
- For this project, mapping serves three purposes:
1. EDA: To understand the data
2. To present the data to any audience
3. As a means of analytics/data science. While not predictive, given the nature of the dataset, this is crucial.
# Models and Techniques
- The goal of our modeling was to use K-Means Clustering, PCA and CountVectorizer to find a classification model that performed best when classifying the month of the year an article was written. A pipeline of various parameters was run through a GridSearch on the following models:
- K-Nearest Neighbors
- Random Forest
- Logistic Regression
- Support Vector Machine
Below are the accuracy scores for the best performing models:
- **K-Nearest Neighbors:**
- Train accuracy: 58.9%
- Test accuracy: 37.5%
- **Random Forest:**
- Train accuracy: 100%
- Test accuracy: 44.6%
- **Logistic Regression:**
- Train accuracy: 69.6%
- Test accuracy: 36.9%
- **Support Vector Machine:**
- Train accuracy: 99.3%
- Test accuracy: 42.4%
# Conclusions, Limitations and Recommendations:
*Conclusions:*
- This is a tremendously powerful dataset that updates many times a day.
- Event hotspots can be located.
- Date can be predicted from the location, tone and theme of an article.
*Limitations:*
- We only used English Articles.
- Our modeling dataset was small (1,100 data points).
- Multiple Machine Learning levels used.
- Cloud computing is expensive.
- Dataset could not be used for time series modeling.
*Recomendations:*
- Research Google, GDELT algorithms and NLP classifications .
- Nonprofit or significant sponsorship angle needed.
- Cleaning/engineering to allow for time-series modeling and improvement to clustering model.
近期下载者:
相关文件:
收藏者: