covidnews-ml-pt

所属分类:人工智能/神经网络/深度学习
开发工具:R
文件大小:44622KB
下载次数:0
上传日期:2020-06-16 12:38:47
上 传 者sh-1993
说明:  covidnews ml pt,一个使用GDELT和机器学习的葡萄牙报纸文章中covid19新闻的新闻聚合器
(covidnews-ml-pt,A news aggregator for covid19 news in Portuguese newspaper articles using GDELT and machine learning)

文件列表:
.Rproj.user (0, 2020-06-16)
.Rproj.user\EF0FA498 (0, 2020-06-16)
.Rproj.user\EF0FA498\sources (0, 2020-06-16)
.Rproj.user\EF0FA498\sources\prop (0, 2020-06-16)
.Rproj.user\EF0FA498\sources\prop\4427322D (59, 2020-06-16)
.Rproj.user\EF0FA498\sources\prop\INDEX (2704, 2020-06-16)
.Rproj.user\shared (0, 2020-06-16)
.Rproj.user\shared\notebooks (0, 2020-06-16)
.Rproj.user\shared\notebooks\patch-chunk-names (0, 2020-06-16)
api (0, 2020-06-16)
api\deploy_api.R (426, 2020-06-16)
api\entrypoint.R (72, 2020-06-16)
api\plumber.R (358, 2020-06-16)
covidnews-ml-pt.Rproj (205, 2020-06-16)
covidnews.log (88188, 2020-06-16)
covidnews.sh (593, 2020-06-16)
covidtrain.log (5083, 2020-06-16)
covidtrain.sh (279, 2020-06-16)
cronjobs.txt (441, 2020-06-16)
data (0, 2020-06-16)
data\covidpred_pt_2020-04-05-11-44.csv (157787, 2020-06-16)
data\covidpred_pt_2020-04-05-12-47.csv (164197, 2020-06-16)
data\covidpred_pt_2020-04-05-13-43.csv (104242, 2020-06-16)
data\covidpred_pt_2020-04-05-14-46.csv (224556, 2020-06-16)
data\covidpred_pt_2020-04-05-15-44.csv (136684, 2020-06-16)
data\covidpred_pt_2020-04-05-16-49.csv (230785, 2020-06-16)
data\covidpred_pt_2020-04-05-17-45.csv (101811, 2020-06-16)
data\covidpred_pt_2020-04-05-18-49.csv (323796, 2020-06-16)
data\covidpred_pt_2020-04-05-19-44.csv (146030, 2020-06-16)
data\covidpred_pt_2020-04-05-20-13.csv (161846, 2020-06-16)
data\covidpred_pt_2020-04-05-20-18.csv (161846, 2020-06-16)
data\covidpred_pt_2020-04-05-21-37.csv (91868, 2020-06-16)
data\covidpred_pt_2020-04-05-21-38.csv (91868, 2020-06-16)
data\covidpred_pt_2020-04-05-21-52.csv (126181, 2020-06-16)
data\covidpred_pt_2020-04-05-21-53.csv (126181, 2020-06-16)
data\covidpred_pt_2020-04-05-21-54.csv (126181, 2020-06-16)
... ...

Covid-19 news aggregator for Portuguese news ================ ## Description This repo contains a scraper pipeline which: 1. Pulls all news from available Portuguese domains using [GDELT’s API](https://www.gdeltproject.org/) given a certain time range using [gdeltr2’s wrapper functions](https://github.com/abresler/gdeltr2). Data is collected and updated every every 2 hours. For more details, see `scrape-parse-classify/1_gdelt_pull.R` 2. Parses each news article with python using [news-please](https://github.com/fhamborg/news-please) or, if failling, [newspaper3k](https://newspaper.readthedocs.io/en/latest/), see `scrape-parse-classify/2_news_parse.py` 3. Predicts whether or not a news article is about covid-19 using a trained random forests model (more details below), see`scrape-parse-classify/3_classify&push.R` 4. Automatically pushes the data to `data/... .csv` in this repository as a individual csv file, see `scrape-parse-classify/3_classify&push.R` To run the entire pipeline use `scrape-parse-classify/4_get_covid19_news.R`. ## Details on the model After some experimentation I settled on a random forests model using [ranger](https://cran.r-project.org/web/packages/ranger/index.html). - **Corpus**: more than 6000 news articles from Portuguese outlets were collected using factiva. An article was considered to be about coronavirus if it contained the factiva label **Novel Coronaviruses**. You can find it in `train/data/labeled_data`. For collecting the corpus, empty queries (“e”) for Newspapers in Europe in Portuguese were used. - **Sampling**: I made queries for all news in a random date between 01/01/2018 and 01/04/2020 in Factiva and labeled the data based on the presence of the coronavirus factiva label ('Novel Coronaviruses') and its date. - **Features** - unigram to 5-ngram tokenized words without stop-words and stemmed represented as a tf-idf vector. Only words which appeated in at least 20 documents were kept. Furthermore, using [dbpedia’s spotlight api](https://www.dbpedia-spotlight.org/api) extracted all named entities present in the document, aggregated them by dbpedia macro-category, and added the counts for each category as features. - **10-fold crossvalidation repeated 3 times** for hyper-parameter tunning - Latest **model specification** | sample\_size | train\_prop\_covid | mtry | n\_tree | min\_node\_size | splitrule | model\_type | | -----------: | -----------------: | ---: | ------: | --------------: | :-------- | :------------- | | 5479 | 0.4347048 | 204 | 500 | 20 | gini | classification | - Latest **model metrics** | model\_accuracy | model\_kappa | model\_f1 | model\_precision | model\_recall | | --------------: | -----------: | --------: | ---------------: | ------------: | | 0.9***6357 | 0.928081 | 0.9686674 | 0.9704992 | 0.9668425 | ## Output example ``` r dplyr::glimpse(read.csv(list.files("data", full.names = TRUE)[1])) ``` ## Observations: 24 ## Variables: 38 ## $ prediction_covid_topic 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1... ## $ title "Testes \"de prevencao\" em lares arrancaram e... ## $ description "Os testes de \"prevencao\" do covid-19 arranc... ## $ authors Ceu Neves, Global Media Group, Global Media Gr... ## $ date_download 2020-04-05 13:40:19, 2020-04-05 13:40:20, 2020... ## $ date_publish 2020-04-05 11:15:00, 2020-04-05 11:33:00, 2020... ## $ maintext "Os testes ao covid-19 em lares de idosos, lan... ## $ url https://www.dn.pt/pais/testes-de-prevencao-em-... ## $ gdelt_title "Testes ′′ de preveno ′′ em lares arrancaram em ... ## $ gdelt_url https://www.dn.pt/pais/testes-de-prevencao-em-... ## $ modeSearch artlist, artlist, artlist, artlist, artlist, a... ## $ sourcecountrySearch PO, PO, PO, PO, PO, PO, PO, PO, PO, PO, PO, PO... ## $ termSearch NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ periodtimeSearch 100 minutes, 100 minutes, 100 minutes, 100 min... ## $ isOR FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS... ## $ countMaximumRecords 250, 250, 250, 250, 250, 250, 250, 250, 250, 2... ## $ urlGDELTV2FTAPI https://api.gdeltproject.org/api/v2/doc/doc?qu... ## $ urlArticleMobile https://www.dn.pt/pais/amp/testes-de-prevencao... ## $ datetimeArticle 2020-04-05T09:15:00Z, 2020-04-05T09:15:00Z, 20... ## $ urlImage "https://static.globalnoticias.pt/dn/image.asp... ## $ domainArticle dn.pt, tsf.pt, tsf.pt, observador.pt, observad... ## $ languageArticle Portuguese, Portuguese, Portuguese, Portuguese... ## $ countryArticle Portugal, Portugal, Portugal, Portugal, Portug... ## $ pred_input "Os testes ao covid-19 em lares de idosos, lan... ## $ model_accuracy 0.9***6357, 0.9***6357, 0.9***6357, 0.9***6357, 0.... ## $ model_kappa 0.928081, 0.928081, 0.928081, 0.928081, 0.9280... ## $ model_f1 0.9686674, 0.9686674, 0.9686674, 0.9686674, 0.... ## $ model_precision 0.9704992, 0.9704992, 0.9704992, 0.9704992, 0.... ## $ model_recall 0.9668425, 0.9668425, 0.9668425, 0.9668425, 0.... ## $ model "Random Forests\n('ranger' package, R 3.5.3.)\... ## $ sample_size 5479, 5479, 5479, 5479, 5479, 5479, 5479, 5479... ## $ train_prop_covid 0.4347048, 0.4347048, 0.4347048, 0.4347048, 0.... ## $ mtry 204, 204, 204, 204, 204, 204, 204, 204, 204, 2... ## $ n_tree 500, 500, 500, 500, 500, 500, 500, 500, 500, 5... ## $ min_node_size 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20... ## $ splitrule gini, gini, gini, gini, gini, gini, gini, gini... ## $ model_type classification, classification, classification... ## $ fitted_on 2020-04-05, 2020-04-05, 2020-04-05, 2020-04-05...

近期下载者

相关文件


收藏者