Global-News-Analysis-Using-Cloud-Tools 联合开发网

Pudn.com > 下载中心 > 云计算 > Global-News-Analysis-Using-Cloud-Tools

Global-News-Analysis-Using-Cloud-Tools

bigquery gcloud gdelt aws-storage googlecloud

所属分类：云计算
开发工具：Jupyter Notebook
文件大小：11183KB
下载次数：0
上传日期：2020-06-03 05:14:00
上传者：sh-1993

说明：使用Google Cloud BigQuery收集的来自gdeltproject.org的全球新闻数据的集成分类
(Ensemble classification of global news data from gdeltproject.org gathered using Google Cloud BigQuery)

文件列表:

20200101trim.csv (4040877, 2020-05-16)
20200201trim.csv (4030912, 2020-05-16)
20200301trim.csv (4068607, 2020-05-16)
20200401trim.csv (4014692, 2020-05-16)
20200505trim.csv (4021224, 2020-05-16)
May 5 data EDA.ipynb (275341, 2020-05-16)
code (0, 2020-05-16)
code\00-bigquery.ipynb (12761, 2020-05-16)
code\01-cleaning-part1.ipynb (23546, 2020-05-16)
code\02-eda-cleaning-part2.ipynb (2737875, 2020-05-16)
code\03-pca-and-modeling.ipynb (64292, 2020-05-16)
code\04-histogram-and-initial-time-series-plotly.ipynb (7904357, 2020-05-16)
data (0, 2020-05-16)
data\cvec_master.csv (3209769, 2020-05-16)
data\master_clean.csv (3304496, 2020-05-16)
data\raw (0, 2020-05-16)
data\raw\april_raw.csv (1238095, 2020-05-16)
data\raw\feb_raw.csv (1228901, 2020-05-16)
data\raw\jan_raw.csv (1038621, 2020-05-16)
data\raw\march_raw.csv (976975, 2020-05-16)
data\raw\master_raw.csv (5107322, 2020-05-16)
data\raw\may_raw.csv (624862, 2020-05-16)
data\trimmed-data (0, 2020-05-16)
data\trimmed-data\trimmed_data.txt (1, 2020-05-16)
slidedeck (0, 2020-05-16)
slidedeck\00-Project 5 CS NB - final pt1.pptx (25891821, 2020-05-16)
slidedeck\01-Project 5 CS NB - final pt2.pptx (9784752, 2020-05-16)
slidedeck\ppt.txt (1, 2020-05-16)

# GA-DSI-Project-5 Executive Summary ### Nate Bukowski and Colin Simon # Contents: - [Problem Statement](https://github.com/colinsimon/Global-News-Analysis-Using-Cloud-Tools/blob/master/#Problem-Statement) - [Data Summary](https://github.com/colinsimon/Global-News-Analysis-Using-Cloud-Tools/blob/master/#Data-Summary) - [Mapping](https://github.com/colinsimon/Global-News-Analysis-Using-Cloud-Tools/blob/master/#Mapping) - [Models and Techniques](https://github.com/colinsimon/Global-News-Analysis-Using-Cloud-Tools/blob/master/#Models-and-Techniques) - [Conclusions, Limitations and Recommendations](https://github.com/colinsimon/Global-News-Analysis-Using-Cloud-Tools/blob/master/#Conclusions,-Limitations-and-Recommendations) # Problem Statement Using data science to identify commodity supply events using global news data. # Data Summary **Data Source:** - The data for this project was pulled using Google Cloud Platform's BigQuery. - GDELT **Datasets Analyzed:** - [master_clean.csv](https://github.com/colinsimon/Global-News-Analysis-Using-Cloud-Tools/blob/master/./data/master_clean.csv) - 1,100 rows, 11 columns # Mapping - For this project, mapping serves three purposes: 1. EDA: To understand the data 2. To present the data to any audience 3. As a means of analytics/data science. While not predictive, given the nature of the dataset, this is crucial. # Models and Techniques - The goal of our modeling was to use K-Means Clustering, PCA and CountVectorizer to find a classification model that performed best when classifying the month of the year an article was written. A pipeline of various parameters was run through a GridSearch on the following models: - K-Nearest Neighbors - Random Forest - Logistic Regression - Support Vector Machine Below are the accuracy scores for the best performing models: - **K-Nearest Neighbors:** - Train accuracy: 58.9% - Test accuracy: 37.5% - **Random Forest:** - Train accuracy: 100% - Test accuracy: 44.6% - **Logistic Regression:** - Train accuracy: 69.6% - Test accuracy: 36.9% - **Support Vector Machine:** - Train accuracy: 99.3% - Test accuracy: 42.4% # Conclusions, Limitations and Recommendations: *Conclusions:* - This is a tremendously powerful dataset that updates many times a day. - Event hotspots can be located. - Date can be predicted from the location, tone and theme of an article. *Limitations:* - We only used English Articles. - Our modeling dataset was small (1,100 data points). - Multiple Machine Learning levels used. - Cloud computing is expensive. - Dataset could not be used for time series modeling. *Recomendations:* - Research Google, GDELT algorithms and NLP classifications . - Nonprofit or significant sponsorship angle needed. - Cleaning/engineering to allow for time-series modeling and improvement to clustering model.

近期下载者：

相关文件：

评论：[我要评论] [举报此文件]

收藏者：