This is a pipeline for appending place- and date- based geomarker data from multiple data sources at multiple geographic and temporal resolutions, with a specific focus on Hamilton County, Ohio:
%%{init: { "fontFamily": "arial" } }%%
graph TD
classDef id fill:#fff,stroke:#000,stroke-width:1px;
classDef input fill:#fff,stroke:#000,stroke-width:1px;
classDef data fill:#e8e8e8,stroke:#000,stroke-width:1px;
hosp[(hospitalizations)]:::input
hosp --> date(date related to address):::id
hosp --> address(cleaned and parsed \naddress components):::input
daily("weather, air quality, pollen/mold, \nseasonality, instructional days"):::data
date --join by date --> daily
address --street range \ngeocoding--> geocode(geocoded \ncoordinates):::id
address --probabilistic \nmatching--> parcel(parcel \nidentifier):::id
pd(point-level \ndata resources, e.g.,\n greenspace, traffic,\n air pollution, \nhospital access):::data
tract_data(tract-level data resources, e.g.,\nChild Opportunity Index,\n Material Deprivation Index):::data
geocode --spatial intersection \nwith census tract --> tract_data
geocode -- geomarker \nassessment library --> pd
parcel_data(tax auditor databases,\nhousing code violations,\nhousing health pedigree):::data
parcel --join by parcel identifier--> parcel_data
See the metadata for information about specific data elements and sources.
Notes:
- Patient addresses are constructed by pasting together
ADDRESS
,CITY
,STATE
,ZIP
fields hh_acs_measures
has annual measures, but measures from 2019 are used here
Data | Min Date | Max Date | Frequency | Data Availability Lag |
---|---|---|---|---|
AQI | 2015-01-01 | 2022-10-01 | daily | 6 months |
weather | 2015-01-01 | 2022-09-30 | daily | 6 months |
pollen and mold | 2021-02-17 | 2022-12-09 | random (skips weekends, some random days missing) | unknown |
shotspotter (only for Avondale, E. Price Hill, and W. Price Hill) | 2017-08-16 | 2023-01-03 | daily | daily |
- Clone github repository to destination; manually move input health data into place (
DR1767_r2.csv
) - Install
pak
(install.packages("pak", repos = sprintf("https://r-lib.github.io/p/pak/stable/%s/%s/%s", .Platform$pkgType, R.Version()$os, R.Version()$arch))
) - Install all packages from DESCRIPTION file by running
pak::pak()
orremotes::install_deps()
in the project root from R. (If you are on a linux machine, speed up installation to use binaries hosted by Posit, by settingoptions("repos" = c("CRAN" = "https://packagemanager.rstudio.com/all/__linux__/jammy/latest"))
, substitutingjammy
for your specific linux version.) - Install required python libraries. (Use
reticulate::py_config()
to check on available python environments):
reticulate::py_install("usaddress", pip = TRUE)
reticulate::py_install("dedupe", pip = TRUE)
reticulate::py_install("dedupe-variable-address", pip = TRUE)
- Use
make
to create targets defined inMakefile
ormake tdr
to create the final output as a tabular data resource.docker
is required to run thegeocode
andgeomark
targets.
To specify the python executable to use for the pipeline without using R, set an environment variable in the shell being used to call make
:
export RETICULATE_PYTHON=~/.virtualenvs/r-parcel/bin/python
make all
Notes:
/data
is for any output data associated with a participant and/or the raw hospital admissions file; any*.rds
or*.csv
files in this directory will always be git ignored/data-raw
is for raw (e.g., violations spreadsheet) or intermediate data products (e.g., daily AQI) that are tracked using gitMakefile
defines the pipeline, see other high level targets there