This repository accompanies Grist's State Trust Lands Benefitting Prisons investigation. It allows users to build and modify the dataset underlying the project. For more details on our methodology, please view METHODOLOGY.md
in our initial investigation. A user guide for working with similar datasets to the STLbp project is available at USER-GUIDE.md
in the inital repo. Final built STLbp datasets are available in the public_data
folder of this repo.
The investigation was written and reported by Alleen Brown, Clayton Aldern, and Maria Parazo Rose. This repository was authored by Parker Ziegler, Clayton Aldern, and Maria Parazo Rose.
Looking for the code behind the first story in this series? That repo is here. Looking for the code behind the interactives in the project? That repo is here.
This project currently requires using Python <= 3.10.4. To set the appropriate Python version locally, consider using a Python version manager like pyenv
.
This project uses Git LFS to store .zip
, .dbf
, .shp
, and .geojson
files remotely on GitHub rather than directly in the repository source. In order to access these files and build the datasets locally, you'll need to do the following:
- Install Git LFS. Follow the installation instructions for your operating system.
- At the terminal, run the following two commands (without typing the dollar sign):
$ git lfs install
$ git lfs pull
Together, these commands will install the Git LFS configuration, fetch LFS changes from the remote, and replace pointer files locally with the actual data.
After completing the above, install Python dependencies. At the terminal, run the following command (again, omit the dollar sign):
$ pip install -e .
This only needs to be done the first time you run the script.
All functionality is orchestrated by a single top-level command run.py
. To see a listing of all available command options, run the following command:
$ DATA=data python run.py --help
run.py
expects a particular directory structure for the datasets, which is already enforced by this repository's directory structure. As such, you should not move any files from their current locations.
The STLbp dataset is built in stages, with the output of each stage becoming the input of the next stage.
The following assumes your data directory is itself called data
and all paths will refer to it as such.
To execute Stage 1, run the following command at the terminal:
$ DATA=data python run.py stl-stage-1
This command:
- Gathers the raw data from both remote state-run servers and the input data directory.
- Collects all fields-of-interest from the raw data and normalizes the naming across datasets.
- Unifies all individual states' data into one large dataset.
Individual state datasets are written to data/stl_dataset/step_1/output/merged/<state-abbreviation>.[csv,geojson]
. The unified dataset is written to data/stl_dataset/step_1/output/merged/all-states.[csv,geojson]
.
To execute Stage 2, run the following command at the terminal:
$ DATA=data PYTHONHASHSEED=42 python run.py stl-stage-2
This command matches activity information to the parcels from the unified multi-state dataset output in Stage 1 (data/stl_dataset/step_1/output/merged/all-states.[csv,geojson]
). The data
directory for Stage 2 already includes state-specific information about the activities occurring on all parcels.
The output of this stage is written to data/stl_dataset/step_2/output/stl_dataset_extra_activities.[csv, geojson]
Stage 2.5 involves enriching the unified dataset from Stage 2 (data/stl_dataset/step_2/output/stl_dataset_extra_activities.[csv, geojson]
) with land-cession information for each parcel. In previous investigations, this step was a manual effort; it has since been automated. The new dataset will be named stl_dataset_extra_activities_plus_cessions.csv
and located at data/stl_dataset/step_2_5/output/
(though, as evidenced by the Stage 3 input details, the title is not important to the code).
To execute Stage 2.5, run the following command at the terminal:
$ DATA=data python run.py stl-stage-2-5
To execute Stage 3, run the following command at the terminal:
$ DATA=data python run.py stl-stage-3
This command will take the first CSV found in data/stl_dataset/step_2_5/output/
and augment it with the prices paid for each parcel of land. The data
directory for Stage 3 already contains a listing a CSV with the listing of prices paid for land (data/stl_dataset/step_3/input/Cession_Data.csv
).
The output of this stage is written to data/stl_dataset/step_3/output/stl_dataset_extra_activities_plus_cessions_plus_prices.[csv, geojson]
To execute Stage 4, run the following command at the terminal:
$ DATA=data python run.py stl-stage-4
This command will calculate summaries connecting cessions to tribes using the output of Stage 3 (data/stl_dataset/step_3/output/stl_dataset_extra_activities_plus_cessions_plus_prices.[csv, geojson]
).
The outputs of this stage are two files, written to:
data/stl_dataset/step_4/output/tribe-summary.csv
data/stl_dataset/step_4/output/tribe-summary-condensed.csv