Scrapping_CH

所属分类:自然语言处理
开发工具:HTML
文件大小:4186KB
下载次数:0
上传日期:2020-08-03 19:48:44
上 传 者sh-1993
说明:  报废_CH,用于NLP分析的报废项目-智利新闻主席
(Scrapping_CH,Scrapping project for NLP analysis - Chilean Press Presidency)

文件列表:
Markdown Pin_era.Rmd (11996, 2020-08-04)
Pi_era-Speeches.RData (1892021, 2020-08-04)
assets (0, 2020-08-04)
assets\Lists of urls.png (92196, 2020-08-04)
assets\Screenshot 2020-07-19 at 12.13.08.png (236956, 2020-08-04)
assets\Untitled-3c75d9f5.png (193199, 2020-08-04)
assets\Untitled-85967513.png (236956, 2020-08-04)
assets\captura.png (253439, 2020-08-04)
assets\graph1.png (62734, 2020-08-04)
assets\speech.png (293244, 2020-08-04)
plotly.html (3553110, 2020-08-04)
scrapping.R (2718, 2020-08-04)

# Scrapping Speeches from the Press Presidential Website of Chile # **Date**: 20.07.2020 **Author**: Andres Ponce ### Table of contents 1. [Introduction](https://github.com/andrespnc/Scrapping_CH/blob/master/#introduction) 2. [Scrapping Presidential Press Website](https://github.com/andrespnc/Scrapping_CH/blob/master/#paragraph1) 3. [Analysing text & structured topic modelling](https://github.com/andrespnc/Scrapping_CH/blob/master/#paragraph2) 4. [Final Thoughts](https://github.com/andrespnc/Scrapping_CH/blob/master/#paragraph3) ## Introduction As a Public Policy graduate I see great challenges in **_open government policies and, in particular, access to public data_**. I started this project with the idea of using coding skills to gather and analyze public sources available to any citizen. As of January 2020, the presidential press website of Chile [prensa.presidencia](https://github.com/andrespnc/Scrapping_CH/blob/master/https://prensa.presidencia.cl/discursos.aspx) contained a large source of official speeches for president Pinera, since the time of his election to January 2020. These releases reflect the president s communication strategy, even if they are subject to editorial control from presidential staff. This project, runned completely in R, consists of two parts. First, the scrapping strategy, gathering speeches from March 2018 to Jan 2020. And second, structured topic modelling using date as covariate to understand how topic proportion change over time.

## Scrapping Presidential Press Website The scrapping process takes advantage of the URL structure `https://prensa.presidencia.cl/discursos.aspx` by using the Rcrawler library [^1] and Rvest [^2]. The speeches tab "discursos" contains 97 pages (at the time I did this). Each of these pages has at most 6 links to speeches, so 582 separate pages containing one speech each one. If we access a particular page we notice that each one of them has a URL pattern followed by a number `https://prensa.presidencia.cl/discurso.aspx?id=135058`. This pattern is used to identify speech pages from `Rcrawler` output.

Scraping text is straightforward with Rvest. I created a function containing three processes `read_html()`, `html_nodes()`, and `html_text()`. I also used the same process to retrieve other useful information, such as date and speech title.

## Analysing text & structured topic modelling The number 582 speeches scrapped are shown by monthly count as follow: _Montlhy count of presidential speeches since Pinera took office in March 2018_ |_Year_|_Jan_|_Feb_|_Mar_|_Apr_|_May_|_Jun_|_Jul_|_Aug_|_Sep_|_Oct_|_Nov_|_Dec_|_Total_| |:-----|----:|----:|----:|----:|----:|----:|----:|----:|----:|----:|----:|----:|------:| | _2018_ | | | 18 | 30 | 27 | 30 | 27 | 39 | 27 | 50 | 25 | 29 | __302__ | | _2019_ | 37 | 22 | 18 | 20 | 29 | 22 | 27 | 25 | 25 | 23 | 11 | 12 | __271__ | | _2020_ | 9 | |_**Total**_|__46__|__22__ | __36__ | __50__ | __56__ | __52__ | __54__ | __***__ | __52__ | __73__ | __36__ | __39__ | __582__ | By plotting the numer of times the Chilean president appears in public to give public speeches I found a substantial decrease after November of 2020. This is coincidental with the fact that, during this time, Chile experienced a social uprising leading the government to an all-time minimun rate of public support according to **_CADEM_** [^3].

In the second step I apply the structural topic modelling with the `stm` package. The stm packages allows the researcher to estimate a model using document covariates. In this case I used date to see how the proportion of topics varies across time (months). I choosed `Plotly` _(hosted in plotly studio: [Click here](https://github.com/andrespnc/Scrapping_CH/blob/master/https://chart-studio.plotly.com/~Andres1***6/1.embed?share_key=hkHUmY5lfL9zZc8nYvfVga))_ to visualize topic trends over time. For instance, the topic of **_Security & Crime_** is a recurrent topic in the president's speeches. Coincidentally, this topic shows a proportion spike by the end of 2018 and 2019, when the government suffered from police brutality scandals, first for killing an unarmed indigenous civilian and then for police repression in the social upheaval.

## Final Thoughts This exercise had no other purpose but to train coding skills and apply empirical methods to text data, and more specifically, to data that should be available to all citizens. However, it is important to point out that by the end of this project, **the speeches from president Pinera are no longer available in the Press website**. It is possible to access only speeches for the present month, without an option to access all the past speeches. [^1]: https://github.com/salimk/Rcrawler [^2]: https://github.com/tidyverse/rvest [^3]: https://www.cadem.cl/

近期下载者

相关文件


收藏者