## Scrapping Presidential Press Website The scrapping process takes advantage of the URL structure `https://prensa.presidencia.cl/discursos.aspx` by using the Rcrawler library [^1] and Rvest [^2]. The speeches tab "discursos" contains 97 pages (at the time I did this). Each of these pages has at most 6 links to speeches, so 582 separate pages containing one speech each one. If we access a particular page we notice that each one of them has a URL pattern followed by a number `https://prensa.presidencia.cl/discurso.aspx?id=135058`. This pattern is used to identify speech pages from `Rcrawler` output.
Scraping text is straightforward with Rvest. I created a function containing three processes `read_html()`, `html_nodes()`, and `html_text()`. I also used the same process to retrieve other useful information, such as date and speech title.
## Analysing text & structured topic modelling The number 582 speeches scrapped are shown by monthly count as follow: _Montlhy count of presidential speeches since Pinera took office in March 2018_ |_Year_|_Jan_|_Feb_|_Mar_|_Apr_|_May_|_Jun_|_Jul_|_Aug_|_Sep_|_Oct_|_Nov_|_Dec_|_Total_| |:-----|----:|----:|----:|----:|----:|----:|----:|----:|----:|----:|----:|----:|------:| | _2018_ | | | 18 | 30 | 27 | 30 | 27 | 39 | 27 | 50 | 25 | 29 | __302__ | | _2019_ | 37 | 22 | 18 | 20 | 29 | 22 | 27 | 25 | 25 | 23 | 11 | 12 | __271__ | | _2020_ | 9 | |_**Total**_|__46__|__22__ | __36__ | __50__ | __56__ | __52__ | __54__ | __***__ | __52__ | __73__ | __36__ | __39__ | __582__ | By plotting the numer of times the Chilean president appears in public to give public speeches I found a substantial decrease after November of 2020. This is coincidental with the fact that, during this time, Chile experienced a social uprising leading the government to an all-time minimun rate of public support according to **_CADEM_** [^3].
In the second step I apply the structural topic modelling with the `stm` package. The stm packages allows the researcher to estimate a model using document covariates. In this case I used date to see how the proportion of topics varies across time (months). I choosed `Plotly` _(hosted in plotly studio: [Click here](https://github.com/andrespnc/Scrapping_CH/blob/master/https://chart-studio.plotly.com/~Andres1***6/1.embed?share_key=hkHUmY5lfL9zZc8nYvfVga))_ to visualize topic trends over time. For instance, the topic of **_Security & Crime_** is a recurrent topic in the president's speeches. Coincidentally, this topic shows a proportion spike by the end of 2018 and 2019, when the government suffered from police brutality scandals, first for killing an unarmed indigenous civilian and then for police repression in the social upheaval.
## Final Thoughts This exercise had no other purpose but to train coding skills and apply empirical methods to text data, and more specifically, to data that should be available to all citizens. However, it is important to point out that by the end of this project, **the speeches from president Pinera are no longer available in the Press website**. It is possible to access only speeches for the present month, without an option to access all the past speeches. [^1]: https://github.com/salimk/Rcrawler [^2]: https://github.com/tidyverse/rvest [^3]: https://www.cadem.cl/