television-news-analyser

所属分类:能源行业(电力石油煤炭)
开发工具:Jupyter Notebook
文件大小:0KB
下载次数:0
上传日期:2023-09-10 15:38:55
上 传 者sh-1993
说明:  刮掉法国电视新闻来分析人类最大的挑战:化石能源和气候变化。,
(Scrap french TV news to analyse humanity s biggest challenge : fossil energies and climate change.,)

文件列表:
.dockerignore (292, 2024-01-04)
.scalafmt.conf (201, 2024-01-04)
CNAME (29, 2024-01-04)
Dockerfile (263, 2024-01-04)
LICENSE (18092, 2024-01-04)
build.sbt (1978, 2024-01-04)
data-news-csv/ (0, 2024-01-04)
data-news-csv/year=2013/ (0, 2024-01-04)
data-news-csv/year=2013/part-00000-d964c139-19ed-47cf-b389-49b1f624aa7c.c000.csv.gz (3991161, 2024-01-04)
data-news-csv/year=2014/ (0, 2024-01-04)
data-news-csv/year=2014/part-00000-d964c139-19ed-47cf-b389-49b1f624aa7c.c000.csv.gz (6461348, 2024-01-04)
data-news-csv/year=2015/ (0, 2024-01-04)
data-news-csv/year=2015/part-00000-d964c139-19ed-47cf-b389-49b1f624aa7c.c000.csv.gz (7177484, 2024-01-04)
data-news-csv/year=2016/ (0, 2024-01-04)
data-news-csv/year=2016/part-00000-d964c139-19ed-47cf-b389-49b1f624aa7c.c000.csv.gz (7529225, 2024-01-04)
data-news-csv/year=2017/ (0, 2024-01-04)
data-news-csv/year=2017/part-00000-d964c139-19ed-47cf-b389-49b1f624aa7c.c000.csv.gz (9411616, 2024-01-04)
data-news-csv/year=2018/ (0, 2024-01-04)
data-news-csv/year=2018/part-00000-d964c139-19ed-47cf-b389-49b1f624aa7c.c000.csv.gz (11042587, 2024-01-04)
data-news-csv/year=2019/ (0, 2024-01-04)
data-news-csv/year=2019/part-00000-d964c139-19ed-47cf-b389-49b1f624aa7c.c000.csv.gz (10506283, 2024-01-04)
data-news-csv/year=2020/ (0, 2024-01-04)
data-news-csv/year=2020/part-00000-d964c139-19ed-47cf-b389-49b1f624aa7c.c000.csv.gz (10269626, 2024-01-04)
data-news-csv/year=2021/ (0, 2024-01-04)
data-news-csv/year=2021/part-00000-d964c139-19ed-47cf-b389-49b1f624aa7c.c000.csv.gz (11364300, 2024-01-04)
data-news-csv/year=2022/ (0, 2024-01-04)
data-news-csv/year=2022/part-00000-d964c139-19ed-47cf-b389-49b1f624aa7c.c000.csv.gz (13628588, 2024-01-04)
data-news-csv/year=2023/ (0, 2024-01-04)
data-news-csv/year=2023/part-00000-d964c139-19ed-47cf-b389-49b1f624aa7c.c000.csv.gz (10297382, 2024-01-04)
data-news-gargantext-tsv/ (0, 2024-01-04)
data-news-gargantext-tsv/part-00000-76d92978-9fb7-4b9a-8606-d3299d3c633d-c000.csv (60338955, 2024-01-04)
... ...

# [TV news analyser](https://polomarcus.github.io/television-news-analyser/) | https://observatoire.climatmedias.org/ [![List news urls containing global warming everyday at 5am](https://github.com/polomarcus/television-news-analyser/actions/workflows/save-data.yml/badge.svg)](https://github.com/polomarcus/television-news-analyser/actions/workflows/save-data.yml) [Scrap](https://en.wikipedia.org/wiki/Data_scraping) France 2, France 3, and TF1 Tv news to analyse humanity's biggest challenge : fossil energies and **climate change** and [analyse the data on a website](https://polomarcus.github.io/television-news-analyser/). ![metabaseexample](https://user-images.githubusercontent.com/4059615/203794161-12fa4267-252f-41a5-af26-d0cad55eceed.png) ### Data sources - HTMLs pages : * TF1 : https://www.tf1info.fr/emission/le-20h-11001/extraits/ * France 2 : https://www.francetvinfo.fr/replay-jt/france-2/20-heures/jt-de-20h-du-jeudi-30-decembre-2021_4876025.html * France 3 : https://www.francetvinfo.fr/replay-jt/france-3/19-20/jt-de-19-20-du-vendredi-15-avril-2022_5045866.html ### Data sinks * JSON https://github.com/polomarcus/television-news-analyser/tree/main/data-news-json/ * CSV or actually Tab Separated Values (TSV) compressed ([if you don't know how to uncompressed these data](https://www.wikihow.com/Extract-a-Gz-File)) https://github.com/polomarcus/television-news-analyser/tree/main/data-news-csv/ JSON data can be stored inside Postgres and displayed on a Metabase dashboard (read "Run" on this readme), or can be found on [this website](https://observatoire.climatmedias.org/) : ## Run ### Requirements * [docker compose](https://docs.docker.com/compose/install/) * Optional: if you want to code you have to use Scala build tool (SBT) ### Spin up 1 Postgres, Metabase, nginx and load data to PG using SBT #### Docker Compose without SBT (Scala build tool) ``` # with docker compose - no need of sbt ./init-stack-with-data.sh # this script does this : docker-compose -f src/test/docker/docker-compose.yml up -d --build app ``` #### Init Metabase to explore with SQL After you ran the project with docker compose, you can check metabase here http://localhost:3000 with a few steps : 1. configure an account 2. configure PostgreSQL data source: (user/password - host : postgres - database name : metabase) (see docker-compose for details) 3. You're good to go : "Ask a simple question", then select your data source and the "Aa_News" table #### Jupyter Notebook Some examples are inside [example.ipynb](https://github.com/polomarcus/television-news-analyser/blob/main/example.ipynb), but I preferred to use Metabase dashboard and visualisation using SQL ### To scrap data from 3 pages from France 2 website ``` sbt "runMain com.github.polomarcus.main.TelevisionNewsAnalyser 3" ``` ### To store the JSON data to PG and explore it with Metabase ``` sbt "runMain com.github.polomarcus.main.SaveTVNewsToPostgres" ``` ### To update data for the website alone ``` sbt "runMain com.github.polomarcus.main.UpdateNews" ``` ## How does it run automatically every day ? Last replays France 2, 3 and TF1 are scrapped with [a GitHub Action](https://github.com/polomarcus/television-news-analyser/actions/workflows/save-data.yml), then this news are stored inside [this folder](https://github.com/polomarcus/television-news-analyser/tree/main/data-news-json/) partitioned by media and by date. If news title or description contains a ["global warming key word"](https://github.com/polomarcus/television-news-analyser/blob/main/src/main/scala/com/github/polomarcus/utils/TextService.scala#L9) : they are marked as such with `containsWordGlobalWarming: Boolean`. Some results can be found on this repo's website : https://polomarcus.github.io/television-news-analyser/ | https://observatoire.climatmedias.org/ ### To check the GitHub Action 1. Click here : https://github.com/polomarcus/television-news-analyser/actions/workflows/save-data.yml 2. Click on the last workflow ran called "Get news from websites", then on "click-here-to-see-data" 3. Click on "List France 2 news urls containing global warming (see end)" to see France 2's urls 4. Click on "List TF1 news urls containing global warming (see end)" to see TF1's urls : ![Urls are listed on the github action workflow](https://user-images.githubusercontent.com/4059615/151147733-3313174a-e2fd-486e-85e7-81272ec0957c.png) ## Checkout the project website locally (https://observatoire.climatmedias.org/) Go to http://localhost:8080 The source are inside the `docs` folder ## Test ``` # first, be sure to have docker compose up with ./init-stack-with-data.sh sbt test # it will parsed some localhost pages from test/resources/ ``` ### Test only one method ``` sbt> testOnly ParserTest -- -z parseFranceTelevisionHome ``` ## Libraries documentation * https://github.com/ruippeixotog/scala-scraper * https://circe.github.io/circe/parsing.html * [Have multiple threads to handle future](http://stackoverflow.com/questions/15285284/how-to-configure-a-fine-tuned-thread-pool-for-futures)

近期下载者

相关文件


收藏者