peskas.timor.data

所属分类:操作系统开发
开发工具:R
文件大小:0KB
下载次数:0
上传日期:2023-09-29 15:04:06
上 传 者sh-1993
说明:  东帝汶的Peskas数据管道,
(Peskas data pipeline for East Timor,)

文件列表:
.Rbuildignore (265, 2023-12-19)
.dockerignore (13, 2023-12-19)
DESCRIPTION (1602, 2023-12-19)
Dockerfile (1274, 2023-12-19)
Dockerfile.prod (1550, 2023-12-19)
LICENSE.md (34900, 2023-12-19)
NAMESPACE (3343, 2023-12-19)
NEWS.md (11251, 2023-12-19)
R/ (0, 2023-12-19)
R/airtable.R (7789, 2023-12-19)
R/calculate-nutrients.R (6101, 2023-12-19)
R/calculate-weights.R (15774, 2023-12-19)
R/clean-raw-data.R (17074, 2023-12-19)
R/cloud-storage.R (7725, 2023-12-19)
R/estimate-catch.R (12133, 2023-12-19)
R/export-dataverse.R (11669, 2023-12-19)
R/format-public-data.R (34233, 2023-12-19)
R/get-cloud-files.R (8167, 2023-12-19)
R/globals.R (68, 2023-12-19)
R/google-drive.R (717, 2023-12-19)
... ...

# peskas.timor.data.pipeline [![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental) [![CRAN status](https://www.r-pkg.org/badges/version/peskas.timor.data.pipeline)](https://CRAN.R-project.org/package=peskas.timor.data.pipeline) [![Codecov test coverage](https://codecov.io/gh/WorldFishCenter/peskas.timor.data.pipeline/branch/master/graph/badge.svg)](https://codecov.io/gh/WorldFishCenter/peskas.timor.data.pipeline?branch=master) [![R build status](https://github.com/WorldFishCenter/peskas.timor.data.pipeline/workflows/R-CMD-check/badge.svg)](https://github.com/WorldFishCenter/peskas.timor.data.pipeline/actions) The goal of peskas.timor.data.pipeline is to implement, deploy, and execute the data and modelling pipelines that underpin Peskas-East Timor, the small-scale fisheries analytics in East Timor. ## The pipeline is an R package peskas.timor.data.pipeline is structured as an R package because it makes it easier to write production-grade software. Specifically, structuring the code as an R package allows us to: - better handle system and package dependencies, - forces us to split the code into functions, - makes it easier to document the code, and - makes it easier to test the code We make heavy use of [tidyverse style conventions](https://engineering-shiny.org) and the [usethis](https://usethis.r-lib.org) package to automate tasks during project setup and deployment. For more information about the rationale of structuring the pipeline as a package check [Chapter 3](https://engineering-shiny.org/structuring-project.html#structuring-your-app_) in [*Engineering Production-Grade Shiny Apps*](https://engineering-shiny.org). The book is focused on Shiny applications but the rationale also applies to data pipelines and production-ready code in general. The best place to learn more about package development is probably the [*R packages*](https://r-pkgs.org) book by Hadley Wickham and Jenny Brian. ## The pipeline runs on Github Actions While each step in the pipeline are defined as a function in the package, these functions are deployed and integrated using [GitHub Actions](https://docs.github.com/en/actions/learn-github-actions). This allow us to take advantages of best practices in continous development and integration (CD/CI) and automatically link the code to execution. However, these workflow functions work almost as scripts because they don’t take parameters and are used for their side effects. Each job in the pipeline is defined in the workflow file: [`.github/workflows/data-pipeline.yaml`](https://github.com/WorldFishCenter/peskas.timor.data.pipeline/blob/main/.github/workflows/data-pipeline.yaml) and can be seen in the figure below. Note that additional workflows exist to test the package in multiple environments and build the documentation website. ![](man/figures/pipeline.png) The figure above illustrate the jobs that are part of the pipeline workflow. Note that not all of them are implemented yet. Generally, artifacts produced by each job are stored in a cloud storage container and retrieved from the cloud storage by the next job in the pipeline. When storing a job’s artifacts are versioned using the function `add_version()`, which generally includes a timestamp and the commit sha. This approach allow us to trace each artifact to a unique run of the pipeline. When retrieving jobs can call `cloud_object_name()` to obtain the latest or an specific version of an artifact. ## Environment parameters are specified in the config file The parameters that determine how the pipeline is run are specified in [`inst/conf.yml`](https://github.com/WorldFishCenter/peskas.timor.data.pipeline/blob/main/inst/conf.yml). This file can be accessed using `system.file("conf.yml",package="peskas.timor.data.pipeline")`. Using this file, as opposed to, for example, including them in the code, allows us to easily switch parameters depending on the environment. We use the [config](https://github.com/rstudio/config) package to read the configuration file. We use three different environments (see below). To determine which environment to use, the config package checks the environment variable `R_CONFIG_ACTIVE`. - Remote development environment (default): The development environment is the “default” configuration. This environment should be used when the code is running in the cloud One characteristic of this environment is that it uses cloud storage buckets that differ to those that the real application uses. This makes it ideal to test the code and the pipeline before it’s deployed into production. Because this environment is designed to run in the cloud, it indicates that API tokens and authentication files should be read from environment variables. This works well when the code runs in GitHub Actions, because the workflow has instructions to read authentication details from [GitHub secrets](https://docs.github.com/en/actions/reference/encrypted-secrets) and passes it to R as environment variables. - Local development environment: The “local” environment is similar to the default environment in that is used for development and therefore uses resources ideal for testing the code. The main difference is that the authentication information is not read from environment variables but from local files. Specifically authentication files should live in a directory called `auth` which should never be committed to git. It is possible to run `Sys.setenv(R_CONFIG_ACTIVE="local")` in the R console to ensure the local environment is activated the next time `conf.yml` is read. An even easier alternative is to add the key-value pair `R_CONFIG_ACTIVE=local` to the `.Renviron` file in the project directory. - Production environment: The production environment is similar to the default environment in that it’s designed to run in the cloud and read authentication details from the cloud. It differs in that it uses cloud resources that should be used exclusively in production once things have been tested out. This environment is active when `R_CONFIG_ACTIVE=production`; this environment variable is passed by pipeline workflow file when the code executes from the “main” git branch. ## We use docker containers We use docker containers to make it easier to run and develop code. - Development: We use the main [`Dockerfile`](https://github.com/WorldFishCenter/peskas.timor.data.pipeline/blob/main/Dockerfile) for development. It’s based on the rocker/geospatial image and spins up an RStudio server instance with quite a large number of packages. To start an instance of this container you can simply go to the project’s directory and run `docker-compose up -d --build` from the terminal console. - Production: We use [`Dockerfile.prod`](https://github.com/WorldFishCenter/peskas.timor.data.pipeline/blob/main/Dockerfile.prod) to run the code in production. This image is based in a more lightweight version of R and only installs the required packages. The first job in the pipeline builds this container and other steps use it to run the code. This allow us to run the code under the same environment regardless of the cloud computing infrastructure that runs it. ## Logging We use the [logger](https://daroczig.github.io/logger/) package to log events in production.

近期下载者

相关文件


收藏者