# `domestic-heating-data`
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7322967.svg)](https://doi.org/10.5281/zenodo.7322967)
Data pipelines for [Centre for Net Zero's agent-based model of domestic heating](https://github.com/centrefornetzero/domestic-heating-abm).
The pipelines transform and combine publicly available datasets to produce data relevant to the decisions households in England and Wales make about their heating system.
Read the [post on our tech blog](https://www.centrefornetzero.org/how-we-use-bigquery-dbt-and-github-actions-to-manage-data-at-cnz/) for a longer description of how this works.
## Where can I download the data?
The datasets we use are publicly available but released under their own licences and copyright restrictions.
Here we publish code to transform the datasets.
You need to obtain the datasets yourself and use this code to transform it. The `README` for each dataset in [`cnz/models/staging`](cnz/models/staging) contains a link to download the original data.
If you wish to cite this Github repository that contains the code to transform the datasets, or download the joined dataset (after carefully reading the terms of licenses and restrictions) you can do so by referring to our [Zenodo page for this dataset](https://zenodo.org/record/7322967#.Y3OGWezP16o).
## `dim_household_agents`
[`dim_household_agents`](cnz/models/marts/domestic_heating/dim_household_agents.sql) is the ultimate output of the models.
Each row describes a household we can model in our ABM.
`dim_household_agents` queries [`dim_households`](cnz/models/marts/domestic_heating/dim_households.sql), which contains all the households in `dim_household_agents` _and_ those with insufficient data for us to include in the ABM.
## Supported databases
We use [BigQuery](https://cloud.google.com/bigquery).
We haven't tested it with other databases, but expect that it would work with other databases like PostgreSQL with little or no modifications to the queries.
As of writing there are 168 data tests to help you make any changes with confidence.
## dbt set up
If you are new to dbt, first read the [introduction in the dbt documentation](https://docs.getdbt.com/docs/introduction).
You need Python 3.9 and [`pipenv`](https://github.com/pypa/pipenv).
If you don't have them, see [our instructions for macOS](https://gist.github.com/tomwphillips/715d4fd452ef5d52b4708c0fc5d4f30f).
To set up the project, run:
1. Clone this repo.
2. `cp .env.template .env`.
3. Fill in the values in `.env`.
4. `./scripts/bootstrap.sh`.
Once you've loaded the data into BigQuery you can:
1. Test all your sources: `dbt test --models "source:*"`
2. Run all the models: `dbt run`
3. Test all the models: `dbt test --exclude "source:*"`
If that all succeeded, you should now be able to query `dim_household_agents`.
## CNZ's dbt style guide
Our `models` directory is organised into two folders: `staging` and `marts`.
It is inspired by how [Fishtown Analytics structure their dbt projects](https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355).
### Staging
`staging` is organised by source, e.g. `epc` or `nsul`.
Staging models take raw data, clean it up (fill in/drop missing values, recast types, rename fields, etc.) and make it available for further use.
By doing this we only have to clean up a dataset once.
```
└── models
└── staging
├── nsul
├── epc
├── ...
└── marts
```
Within each source folder, you will find:
* `src_