covidnews-ml-pt
所属分类:人工智能/神经网络/深度学习
开发工具:R
文件大小:44622KB
下载次数:0
上传日期:2020-06-16 12:38:47
上 传 者:
sh-1993
说明: covidnews ml pt,一个使用GDELT和机器学习的葡萄牙报纸文章中covid19新闻的新闻聚合器
(covidnews-ml-pt,A news aggregator for covid19 news in Portuguese newspaper articles using GDELT and machine learning)
文件列表:
.Rproj.user (0, 2020-06-16)
.Rproj.user\EF0FA498 (0, 2020-06-16)
.Rproj.user\EF0FA498\sources (0, 2020-06-16)
.Rproj.user\EF0FA498\sources\prop (0, 2020-06-16)
.Rproj.user\EF0FA498\sources\prop\4427322D (59, 2020-06-16)
.Rproj.user\EF0FA498\sources\prop\INDEX (2704, 2020-06-16)
.Rproj.user\shared (0, 2020-06-16)
.Rproj.user\shared\notebooks (0, 2020-06-16)
.Rproj.user\shared\notebooks\patch-chunk-names (0, 2020-06-16)
api (0, 2020-06-16)
api\deploy_api.R (426, 2020-06-16)
api\entrypoint.R (72, 2020-06-16)
api\plumber.R (358, 2020-06-16)
covidnews-ml-pt.Rproj (205, 2020-06-16)
covidnews.log (88188, 2020-06-16)
covidnews.sh (593, 2020-06-16)
covidtrain.log (5083, 2020-06-16)
covidtrain.sh (279, 2020-06-16)
cronjobs.txt (441, 2020-06-16)
data (0, 2020-06-16)
data\covidpred_pt_2020-04-05-11-44.csv (157787, 2020-06-16)
data\covidpred_pt_2020-04-05-12-47.csv (164197, 2020-06-16)
data\covidpred_pt_2020-04-05-13-43.csv (104242, 2020-06-16)
data\covidpred_pt_2020-04-05-14-46.csv (224556, 2020-06-16)
data\covidpred_pt_2020-04-05-15-44.csv (136684, 2020-06-16)
data\covidpred_pt_2020-04-05-16-49.csv (230785, 2020-06-16)
data\covidpred_pt_2020-04-05-17-45.csv (101811, 2020-06-16)
data\covidpred_pt_2020-04-05-18-49.csv (323796, 2020-06-16)
data\covidpred_pt_2020-04-05-19-44.csv (146030, 2020-06-16)
data\covidpred_pt_2020-04-05-20-13.csv (161846, 2020-06-16)
data\covidpred_pt_2020-04-05-20-18.csv (161846, 2020-06-16)
data\covidpred_pt_2020-04-05-21-37.csv (91868, 2020-06-16)
data\covidpred_pt_2020-04-05-21-38.csv (91868, 2020-06-16)
data\covidpred_pt_2020-04-05-21-52.csv (126181, 2020-06-16)
data\covidpred_pt_2020-04-05-21-53.csv (126181, 2020-06-16)
data\covidpred_pt_2020-04-05-21-54.csv (126181, 2020-06-16)
... ...
Covid-19 news aggregator for Portuguese news
================
## Description
This repo contains a scraper pipeline which:
1. Pulls all news from available Portuguese domains using [GDELT’s
API](https://www.gdeltproject.org/) given a certain time range using
[gdeltr2’s wrapper functions](https://github.com/abresler/gdeltr2).
Data is collected and updated every every 2 hours. For more details,
see `scrape-parse-classify/1_gdelt_pull.R`
2. Parses each news article with python using
[news-please](https://github.com/fhamborg/news-please) or, if
failling,
[newspaper3k](https://newspaper.readthedocs.io/en/latest/), see
`scrape-parse-classify/2_news_parse.py`
3. Predicts whether or not a news article is about covid-19 using a
trained random forests model (more details below),
see`scrape-parse-classify/3_classify&push.R`
4. Automatically pushes the data to `data/... .csv` in this repository
as a individual csv file, see
`scrape-parse-classify/3_classify&push.R`
To run the entire pipeline use
`scrape-parse-classify/4_get_covid19_news.R`.
## Details on the model
After some experimentation I settled on a random forests model using
[ranger](https://cran.r-project.org/web/packages/ranger/index.html).
- **Corpus**: more than 6000 news articles from Portuguese outlets
were collected using factiva. An article was considered to be about
coronavirus if it contained the factiva label **Novel
Coronaviruses**. You can find it in `train/data/labeled_data`. For
collecting the corpus, empty queries (“e”) for Newspapers in Europe
in Portuguese were used.
- **Sampling**: I made queries for all news in a random
date between 01/01/2018 and 01/04/2020 in Factiva and labeled the data based on the
presence of the coronavirus factiva label ('Novel Coronaviruses') and its date.
- **Features** - unigram to 5-ngram tokenized words without stop-words
and stemmed represented as a tf-idf vector. Only words which
appeated in at least 20 documents were kept. Furthermore, using
[dbpedia’s spotlight api](https://www.dbpedia-spotlight.org/api)
extracted all named entities present in the document, aggregated
them by dbpedia macro-category, and added the counts for each
category as features.
- **10-fold crossvalidation repeated 3 times** for hyper-parameter tunning
- Latest **model
specification**
| sample\_size | train\_prop\_covid | mtry | n\_tree | min\_node\_size | splitrule | model\_type |
| -----------: | -----------------: | ---: | ------: | --------------: | :-------- | :------------- |
| 5479 | 0.4347048 | 204 | 500 | 20 | gini | classification |
- Latest **model
metrics**
| model\_accuracy | model\_kappa | model\_f1 | model\_precision | model\_recall |
| --------------: | -----------: | --------: | ---------------: | ------------: |
| 0.9***6357 | 0.928081 | 0.9686674 | 0.9704992 | 0.9668425 |
## Output example
``` r
dplyr::glimpse(read.csv(list.files("data", full.names = TRUE)[1]))
```
## Observations: 24
## Variables: 38
## $ prediction_covid_topic 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ title "Testes \"de prevencao\" em lares arrancaram e...
## $ description "Os testes de \"prevencao\" do covid-19 arranc...
## $ authors Ceu Neves, Global Media Group, Global Media Gr...
## $ date_download 2020-04-05 13:40:19, 2020-04-05 13:40:20, 2020...
## $ date_publish 2020-04-05 11:15:00, 2020-04-05 11:33:00, 2020...
## $ maintext "Os testes ao covid-19 em lares de idosos, lan...
## $ url https://www.dn.pt/pais/testes-de-prevencao-em-...
## $ gdelt_title "Testes ′′ de preveno ′′ em lares arrancaram em ...
## $ gdelt_url https://www.dn.pt/pais/testes-de-prevencao-em-...
## $ modeSearch artlist, artlist, artlist, artlist, artlist, a...
## $ sourcecountrySearch PO, PO, PO, PO, PO, PO, PO, PO, PO, PO, PO, PO...
## $ termSearch NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ periodtimeSearch 100 minutes, 100 minutes, 100 minutes, 100 min...
## $ isOR FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ countMaximumRecords 250, 250, 250, 250, 250, 250, 250, 250, 250, 2...
## $ urlGDELTV2FTAPI https://api.gdeltproject.org/api/v2/doc/doc?qu...
## $ urlArticleMobile https://www.dn.pt/pais/amp/testes-de-prevencao...
## $ datetimeArticle 2020-04-05T09:15:00Z, 2020-04-05T09:15:00Z, 20...
## $ urlImage "https://static.globalnoticias.pt/dn/image.asp...
## $ domainArticle dn.pt, tsf.pt, tsf.pt, observador.pt, observad...
## $ languageArticle Portuguese, Portuguese, Portuguese, Portuguese...
## $ countryArticle Portugal, Portugal, Portugal, Portugal, Portug...
## $ pred_input "Os testes ao covid-19 em lares de idosos, lan...
## $ model_accuracy 0.9***6357, 0.9***6357, 0.9***6357, 0.9***6357, 0....
## $ model_kappa 0.928081, 0.928081, 0.928081, 0.928081, 0.9280...
## $ model_f1 0.9686674, 0.9686674, 0.9686674, 0.9686674, 0....
## $ model_precision 0.9704992, 0.9704992, 0.9704992, 0.9704992, 0....
## $ model_recall 0.9668425, 0.9668425, 0.9668425, 0.9668425, 0....
## $ model "Random Forests\n('ranger' package, R 3.5.3.)\...
## $ sample_size 5479, 5479, 5479, 5479, 5479, 5479, 5479, 5479...
## $ train_prop_covid 0.4347048, 0.4347048, 0.4347048, 0.4347048, 0....
## $ mtry 204, 204, 204, 204, 204, 204, 204, 204, 204, 2...
## $ n_tree 500, 500, 500, 500, 500, 500, 500, 500, 500, 5...
## $ min_node_size 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20...
## $ splitrule gini, gini, gini, gini, gini, gini, gini, gini...
## $ model_type classification, classification, classification...
## $ fitted_on 2020-04-05, 2020-04-05, 2020-04-05, 2020-04-05...
近期下载者:
相关文件:
收藏者: