04-method.Rmd

# Quantitative Methods

```{r method-libs, echo=FALSE, message=FALSE}
library(here)
library(knitr)
library(tidyverse)
library(lubridate)
library(sf)
library(arrow)
library(kableExtra)

`%!in%` <- Negate(`%in%`)
```

How might results of quantitative approach to evaluating urban quality at the 
parcel level align with the values identified in the workshops described in 
Chapter 3? In this chapter, we propose one such approach, using 
readily-available parcel-level data for Allegheny County, Pennsylvania.

## Data

We obtained data on property addresses, land uses, assessed values (for both 
land and buildings), and sale prices 
from @allegheny_county_office_of_property_assessments_allegheny_2022, which 
includes information on 582,116 properties in Allegheny County.

We also obtained latitude and longitude coordinates for each property from a 
geocoder file provided by @western_pennsylvania_regional_data_center_geocoders_2021. 
Over 99.5 percent of properties included in the assessment dataset are included
in the geocoder file. Properties without geocoded locations are excluded from 
our analysis.

Potential development sites were identified as those 

1. classified as "residential" (residential properties with one to 
four housing units) or "commercial" (which includes mixed-use developments
and residential properties with more than four housing units), and

2. with a land use description in one of 59 possible categories^[One site (3008 
Phillip Dr in Clairton) is missing a land use description in the assessment data. 
We checked this address on Zillow to determine that this is a single-family home 
and classified it as such in our data.]. The most common of these are listed Table \@ref(tab:list-site-uses).^[The land use descriptions that were 
classified as potential development sites but are not listed in Table 
\@ref(tab:list-site-uses), which combine to represent less than one percent of all sites
are  "RIGHTOF WAY - RESIDENTIAL", "CONDOMINIUM UNIT", "DWG USED AS OFFICE", 
"APART:20-39 UNITS", "CONDO GARAGE UNITS", "COMMON AREA", "CONDO DEVELOPMENTAL 
LAND", "CONDEMNED/BOARDED-UP", "CONDOMINIUM OFFICE BUILDING", "INDEPENDENT LIVING
(SENIORS)", "DWG USED AS RETAIL", "OTHER COMMERCIAL", "MOBILE HOMES/TRAILER PKS",
"RIGHT OF WAY - COMMERCIAL", "GROUP HOME", "TOTAL/MAJOR FIRE DAMAGE - COMM", 
"OTHER COMMERCIAL HOUSING", "TOTAL/MAJOR FIRE DAMAGE", "COMM APRTM CONDOS 5-19 
UNITS", "MUNICIPAL URBAN RENEWAL", "COMMERCIAL LAND", "CAMPGROUNDS", "COMMON AREA
OR GREENBELT", "CHARITABLE EXEMPTION/HOS/HOMES", "INCOME PRODUCING PARKING LOT", 
"DWG APT CONVERSION", ">10 ACRES VACANT", "MINOR FIRE DAMAGE", "COMM APRTM CONDOS
20-39 UNITS", "COMMERCIAL/UTILITY", 
"H.O.A RECREATIONS AREA", "COMM APRTM CONDOS 40+ UNITS", "MINOR FIRE DAMAGE - COMM", 
"OTHER", "OTHER RESIDENTIAL STRUCTURE", "OWNED BY METRO HOUSING AU", "RESIDENTIAL VACANT
LAND", "HUD PROJ #221", and "VACANT LAND 0-9 ACRES"].


```{r list-site-uses, echo=FALSE, message=FALSE, tidy=FALSE}
sites <- here("02_data",
              "sites.csv") %>%
  read_csv(show_col_types = FALSE)

site_use_summary <- sites %>%
  group_by(USEDESC) %>%
  summarise(`Number of potential sites` = n()) %>%
  arrange(desc(`Number of potential sites`)) %>%
  mutate(`Percent of potential sites` = 100 *
           (`Number of potential sites` / sum(`Number of potential sites`))) %>%
  mutate(`Cumulative percent of potential sites` = 
           cumsum(`Percent of potential sites`))

kable(
  head(site_use_summary, 20), 
  caption = 'Most common land uses categorized as potential sites',
  booktabs = TRUE,
  digits = 1,
  format.args = (list(big.mark = ','))) %>%
  kable_styling(full_width = TRUE)
```

Potential building sites were further filtered to exclude those with missing data 
on the most recent sale (about one percent of all sites).^[Four sites had sales 
prices listed that were unreasonably high. 3039 Liberty Avenue in Pittsburgh is 
listed as having sold for \$511,945,000 on August 30, 2021. Zillow lists this 
property as having sold on that date for \$511,945 
(https://www.zillow.com/homedetails/3039-W-Liberty-Ave-Pittsburgh-PA-15216/2070262638_zpid/, accessed 5/4/2022),
so the value was corrected for what appears to have been a typo. 220 Hyeholde Dr 
in Coraopolis is listed as having sold for \$28,100,000 in 1967. This may also 
be a typo, and it also does not seem to be the most recent sale. Zillow lists 
this home as having sold for \$350,000 in 2004
(https://www.zillow.com/homes/220-hyeholde-dr,-Coraopolis,-PA_rb/11552817_zpid/, 
accessed 5/4/2022), so the data was corrected to add that as the most recent sale.
Two other sites were identified as having unreasonably high sales values: 1339 
Arlington Avenue in Pittsburgh is a three-bedroom single-family home that is 
listed as having sold for \$57,010,813 in 1976 and a 0.06-acre vacant lot with 
tax ID 0165G00270000000 is listed as having sold for \$24,920,232 in 1936. The 
sales data for these sites were treated as missing.]

The focus of this analysis is on potential development sites rather than on 
properties. Some properties in the assessor dataset are condominums where 
multiple properties share a single parcel of land. We aggregated these to the 
site level by identifying all properties with an assessed building value 
greater than zero, a land value of zero, and a land use description that did
not indicate the land was vacant. If multiple such properties share an address,
we classified all properties at that address as a condominium and aggregated 
them to the parcel level. This led to a final sample of 506,405 sites.

### Tax assessment data

Three variables (total assessed fair market value, assessed fair market 
value of the building, and lot area) were taken directly from the county 
tax assessment data for use in our analysis. We also included the most 
recent listed sales price, adjusted for inflation.

To aggregate properties identified as condominiums to the site level, we summed 
the total values for lot area, assessed land value, assessed building value, and
inflation-adjusted sale price. We log-transformed these four variables prior to
including them in our analysis. Their distributions are shown in
Figure \@ref(fig:assessor-hist).

```{r assessor-hist, echo=FALSE, warning=FALSE, message=FALSE, tidy=FALSE, fig.asp=1.0, out.width='100%', fig.cap='Distribution of variables from tax assessor database'}
assessor_vals_long <- sites %>%
  select(total_value_log,
         land_value_log,
         lot_area_log,
         price_log) %>%
  rename(`Total assessed value` = total_value_log,
         `Land value` = land_value_log,
         `Lot area (sf)` = lot_area_log,
         `Inflation-adjusted price` = price_log) %>%
  pivot_longer(everything(),
               names_to = "variable",
               values_to = " ")


ggplot(assessor_vals_long) +
  geom_histogram(aes(x = ` `),
                 bins = 50) +
  facet_wrap(facets = vars(variable),
             nrow = 2) +
  scale_y_continuous(name = "Number of parcels") +
  scale_x_continuous(breaks = breaks <- log(c(1, 100, 10000, 1000000)),
                     labels = c("1", "100", "10,000", "1,000,000")) +
  theme_minimal()
```


### Accessibilty data

Accessibilty was calculated from each of the 518,032 sites in our sample to
each of several location types described below.

#### Destination parcels

We used land use codes from the county assessor parcel data to identify 
_destination parcels_ that residents might value access to. The most common 
land use codes of identified destination parcels are listed in Table \@ref(tab:dest-uses).

```{r dest-uses, echo=FALSE, out.width='100%'}
dest_parcels <- here("02_data",
                     "all-parcels.csv") %>%
  read_csv(show_col_types = FALSE) %>%
  filter(Category == "destination") %>%
  group_by(USEDESC) %>%
  summarise(`Number of identified destinations` = n()) %>%
  arrange(desc(`Number of identified destinations`)) %>%
  mutate(`Percent of identified destinations` = 100 *
           (`Number of identified destinations` / 
              sum(`Number of identified destinations`))) %>%
  mutate(`Cumulative percent of identified destinations` = 
           cumsum(`Percent of identified destinations`))

kable(
  head(dest_parcels, 20), 
  caption = 'Land uses identified as potential destinations',
  booktabs = TRUE,
  digits = 2,
  format.args = (list(big.mark = ','))
  ) %>%
  kable_styling(full_width = TRUE)
```

#### Job locations

We identified _job locations_ based on data from a Longitudinal 
Employer-Household Dynamics (LEHD) dataset published by the United States Census
Bureau [@united_states_census_bureau_lehd_2021]. The LEHD dataset provides the 
total number of jobs in each census block in the United States, based on 
employment tax records. The location of each job was defined as the centroid of
the block in which it was located. We downloaded job location data for 
Pennsylvania and filtered it to include locations in the Pittsburgh metropolitan
area (Allegheny, Armstrong, Beaver, Butler, Fayette, Washington, and 
Westmoreland counties).

In addition to calculating the accessibility to jobs of all categories, we also
calculated accessibility to several subsets of jobs. We disaggregated jobs by 
earnings, reasoning that the usefulness of a job might vary depending on how 
well it matches a workers skills or wage expectations. _High-paying job locations_
are a subset of job locations where the worker earns more than \$3333 per month.
_Low-paying job locations_ are those where the worker earns \$1250 per month or less. 

We also disaggregated jobs based on employment industry, based on the North 
American Industry Classification System (NAICS), reasoning that the presence of
jobs particular industries might represent a shopping or recreation destination.
_Retail job locations_ are a subset of job locations in NAICS sector 44-45 
(retail trade); _Entertainment job locations_ are those in NAICS sector 71 
(arts, entertainment, and recreation); and _Hospitality job locations_ are 
those in NAICS sector 72 (accommodation and food services).  

Finally, we identified three location types that correspond with common non-work
trips: schools, grocery stores, and parks. _Grocery store locations_ were 
identified as vendors participating in the Supplemental Nutrition Program for
Women, Infants, and Children (WIC). WIC vendor locations and _school locations_ 
were obtained from the Allegheny County GIS portal 
[@allegheny_county_office_of_information_technology_allegheny_2018; 
@allegheny_county_office_of_information_technology_allegheny_2020]. 
_Park locations_ were taken from the Pennsylvania Geospatial Data Clearinghouse 
[@pennsylvania_department_of_conservation_and_natural_resources_pennsylvania_2015].
Park locations were downloaded for Pennsylvania and filtered to Allegheny county.

We used the r5r package in the R programming language [@pereira_r5r_2021] to 
calculate accessibility each destination type described above,
for each of four transportation modes (walking, cycling, driving, and transit). 
The r5r package calculates accessibility as the weighted total number of 
destinations reachable by a given mode, where destinations are weighted 
according to a decay function, such that destinations that can be reached within
less time are assigned greater weight. We used a logistic decay function, as 
illustrated in \@ref(fig:show-decay-func). For motorized modes, the decay 
function had a mean (inflection) of 40 minutes and a standard deviation of 10 
minutes. For non-motorized modes, the decay function had a mean of 20 minutes 
and a standard deviation of 5 minutes.

```{r show-decay-func, echo=FALSE, fig.cap='Decay functions for accessibility calculations', out.width='80%', fig.asp=.75, fig.align='center', fig.alt='Plot illustrating two logistic decay functions.'}
travel_time <- seq(0, 100, by = 0.1)

weight_motor <- (8.380076 - cumsum(dlogis(travel_time, 
                                  location = 40, 
                                  scale = 20, 
                                  log = FALSE))) / 8.380076

weight_active <- (8.380076 - cumsum(dlogis(travel_time[1:510], 
                                   location = 20, 
                                   scale = 10, 
                                   log = FALSE))) / 8.380076

logistic_sample <- tibble(
  `Travel time from site` = c(travel_time, travel_time[1:510]),
  `Weight assigned to destination` = c(weight_motor, weight_active),
  Mode = c(rep("Motorized", 1001), rep("Active", 510)))

ggplot(logistic_sample) +
  geom_line(aes(x = `Travel time from site`,
                y = `Weight assigned to destination`,
                lty = Mode)) +
  theme_minimal()
```

```{r, echo=FALSE}
accessibility <- here("02_data",
                      "access.parquet") %>%
  read_parquet() 

accessibility_long <- accessibility%>%
  pivot_longer(!PARID, names_to = "type", values_to = "value") %>%
  mutate(mode  = rep(c(rep("bike", 10), 
                       rep("walk", 10),
                       rep("car", 10), 
                       rep("transit", 10)),
                   length(accessibility$PARID))) %>%
  mutate(destination = rep(c("Total\njobs",
                             "High-\npaying\njobs",
                             "Low-\npaying\njobs",
                             "Retail\njobs",
                             "Enter-\ntainment\njobs",
                             "Hospit-\nality\njobs",
                             "Amenity\nparcels",
                             "Grocery\nstores",
                             "Schools",
                             "Parks"),
                           length(accessibility$PARID) * 4)) %>%
  filter(value > 0)
```

Calculating accessibility metrics for a combination of four transportation 
modes and ten destination types yields 40 different accessibility variables. Figure \@ref(fig:show-decay-func) 
illustrates the distributions of each of these variables.

```{r access-dist, echo=FALSE, fig.cap='Distributions of accessibility variables', out.width='100%', fig.asp=.75, fig.align='center', fig.alt='Histograms of 40 accessibility variables.'}
zero_labels <- tibble(
  pct_zero = c(prettyNum(100 * sum(accessibility$access_bike_dest_parcels == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_bike_entrnmt_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_bike_grocery == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_bike_hi_pay_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_bike_hosplty_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_bike_all_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_bike_lo_pay_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_bike_park == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_bike_school == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_bike_retail_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_car_dest_parcels == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_car_entrnmt_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_car_grocery == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_car_hi_pay_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_car_hosplty_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_car_all_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_car_lo_pay_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_car_park == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_car_school == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_car_retail_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_walk_dest_parcels == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_walk_entrnmt_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_walk_grocery == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_walk_hi_pay_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_walk_hosplty_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_walk_all_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_walk_lo_pay_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_walk_park == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_walk_school == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_walk_retail_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_transit_dest_parcels == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_transit_entrnmt_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_transit_grocery == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_transit_hi_pay_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_transit_hosplty_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_transit_all_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_transit_lo_pay_jobs == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_transit_park == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_transit_school == 0) / 
                           length(accessibility$PARID), digits = 0),
               prettyNum(100 * sum(accessibility$access_transit_retail_jobs == 0) / 
                           length(accessibility$PARID), digits = 0)),
  y = 175000,
  x = 0,
  mode = c(rep("bike", 10),
           rep("car", 10),
           rep("walk", 10),
           rep("transit", 10)),
  destination = rep(c("Amenity\nparcels",
                      "Enter-\ntainment\njobs",
                      "Grocery\nstores",
                      "High-\npaying\njobs",
                      "Hospit-\nality\njobs",
                      "Total\njobs",
                      "Low-\npaying\njobs",
                      "Parks",
                      "Schools",
                      "Retail\njobs"),
                    4)) %>%
  mutate(label = paste0(pct_zero,
                        "% zero")) %>%
  filter(pct_zero > 0)

ggplot(accessibility_long) +
  geom_histogram(aes(x = log(value)),
                 color = "gray",
                 fill = "gray",
                 bins = 20) +
  facet_grid(rows = vars(mode), 
             cols = vars(destination),
             scales = "free_x") +
  scale_x_continuous(name = "Accessibility (log-transformed) (values of zero are excluded)") +
  scale_y_continuous(name = "Number of sites",
                     labels = scales::label_comma()) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) +
  geom_text(aes(x = x, y = y, 
                label = label), 
            data = zero_labels, 
            size = 2.5,
            hjust = 0)
```


### Disamenity proximity

```{r bad-uses, include=FALSE}
parcels <- here("02_data",
                "all-parcels.csv") %>%
  read_csv(show_col_types = FALSE) 

big_coal <- parcels %>%
  filter(PARID == "0187E00175000000")

bad_parcels <- parcels %>%
  filter(Category == "disamenity") %>%
  filter((x != big_coal$x &
            y != big_coal$y) |
           PARID == big_coal$PARID)

bad_parcel_list <- bad_parcels %>%
  group_by(USEDESC) %>%
  summarise(`Number of identified locations` = n()) %>%
  arrange(desc(`Number of identified locations`)) %>%
  mutate(`Percent of identified locations` = 100 *
           (`Number of identified locations` / 
              sum(`Number of identified locations`))) %>%
  mutate(`Cumulative percent of identified locations` = 
           cumsum(`Percent of identified locations`))
```

We categorized several land uses in the county assessor data as 
disamenities. The land use codes we used to identify disamenities are 
listed in \@ref(tab:bad-use-list)^[289 properties related to coal mining 
(with land use descriptions of either "COAL RIGHTS, WORKING INTERESTS" or 
"COAL LAND, SURFACE RIGHTS") are co-located and are treated as a single 
site.].

```{r bad-use-list, echo=FALSE, message=FALSE}
kable(
  bad_parcel_list, 
  caption = 'Land uses identified as disamenities',
  booktabs = TRUE,
  digits = 2,
  format.args = (list(big.mark = ','))
  ) %>%
  kable_styling(full_width = TRUE)
```

We included a disamenity proximity index in our analysis that we calculated
as the logarithm of the average distance from each site to the ten closest 
disamenity sites. The distribution of this index is shown in \@ref(fig:bad-prox).

```{r bad-prox, echo=FALSE, fig.cap='Distribution of average distance to nearest ten disamenity sites', out.width='80%', fig.asp=.75, fig.align='center'}

bad_proximity <- here("02_data",
                      "bad_proximity.csv") %>%
  read_csv(show_col_types = FALSE)

ggplot(bad_proximity) +
  geom_histogram(aes(x = log_avg_disamentiy_dist),
                 fill = "gray",
                 color = "gray",
                 bins = 50) +
  scale_x_continuous(name = "Disamenity proximity index") +
  scale_y_continuous(name = "Number of sites",
                     labels = scales::label_comma()) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) 
```

### Density

To represent the residential density around each site, we used the sf [@sf], nngeo
[@nngeo] and tidycensus [@tidycensus] R packages to determine the smallest circular 
buffer around each site containing a population of at least two thousand people, 
based on the 2020 census. In denser places, a buffer with a smaller radius would 
encompass two thousand residents. In more sparsely-populated places, a buffer 
containing two thousand residents would be larger. The distribution of radii for 
two-thousand-person site buffers is shown in \@ref(fig:radii).

```{r radii, echo=FALSE, fig.cap='Histogram of radii of buffer containing 2000 residents', out.width='80%', fig.asp=.75, fig.align='center', fig.alt='Buffer radius histogram'}

buffers <- here("02_data",
                      "diversity.parquet") %>%
  read_parquet()

ggplot(buffers) +
  geom_histogram(aes(x = radius / 5280),
                 fill = "gray",
                 color = "gray",
                 bins = 50) +
  scale_x_continuous(name = "Radius of buffer including two thousand residents (miles)") +
  scale_y_continuous(name = "Number of sites",
                     labels = scales::label_comma()) +
  theme_minimal() 
```

### Population diversity

The two-thousand-resident buffers described above were also used as a basis to 
estimate the racial diversity of residents in the immediate vicinity. For each
buffer, we calculated the percentage of residents that who identified in the 2020
census as non-Hispanic white, non-Hispanic Black, and Hispanic. The distributions
of these variables are shown in \@ref(fig:divers-hist).

```{r divers-hist, echo=FALSE, fig.cap='Distributions of population diversity variables', out.width='80%', fig.asp=.75, fig.align='center', fig.alt='Diversity histograms'}

diversity <- buffers %>%
  select(pct_white, pct_black, pct_hispanic) %>%
  rename(`Non-Hispanic white` = pct_white,
         `Non-Hispanic Black` = pct_black,
         Hispanic = pct_hispanic) %>%
  pivot_longer(everything(),
               names_to = "Race/ethnicity",
               values_to = "Percent of population")

ggplot(diversity) +
  geom_histogram(aes(x = `Percent of population`,
                   color = `Race/ethnicity`),
                 fill = NA,
                 alpha = 0.5,
                 bins = 50) +
  scale_y_continuous(name = "Number of sites",
                     labels = scales::label_comma()) +
  scale_x_continuous(breaks = breaks <- seq(0, 1, by=0.1),
                     labels = paste0((breaks * 100), "%")) +
  theme_minimal()
  
```


### Land use diversity

We also calculated the total number of different land uses within each 
two-thousand-resident buffer and used this as a measure of land-use diversity.
\@ref(fig:land-divers).

```{r land-divers, echo=FALSE, fig.cap='Histogram of land use diversity', out.width='80%', fig.asp=.75, fig.align='center', fig.alt='Land-use diversity histogram'}

ggplot(buffers) +
  geom_histogram(aes(x = n_uses),
                 bins = 20,
                 color = "lightgray",
                 fill = "gray") +
  scale_y_continuous(name = "Number of sites",
                     labels = scales::label_comma()) +
  scale_x_continuous(name = "Number of unique land use types within\na 2,000-person buffer") +
  theme_minimal()
  
```

## Index development

The methods described above yielded a set of fifty parcel-level variables, 
forty of which are accessibility metrics, for each of 506,405 parcels. We used
the EFAtools R package [@EFAtools] to develop a set of parcel level indices from 
these variables using factor analysis. The Kaiser-Meyer-Olkin criterion for the 
dataset is 0.9, suggesting a "marvellous" case for factor analysis [@kaiser1974index].

We determined the appropriate number of factors based on the Kaiser-Guttman criterion and the Hull method.The Hull method suggests an 
optimal number of factors that balances model fit and number of parameters, with a goal
or keeping only major factors. Potential solutions with various number of factors are
plotted on a graph of goodness-of-fit versus degrees of freedom, where the optimal 
solution will be on the boundary of a convex hull [@lorenzo2011hull]. The Kaiser-Guttmat criterion is a 
recommendation to retain as many factors as there are
sample eigenvalues greater than one [@guttman1954some].

We computed factor loadings 
using an oblimin rotation. We applied these loadings to calculate a set of index 
scores (one for each factor) for each potential development site.

## Index validation

The indices we developed through factor analysis might represent dimensions of 
urban quality. If they are valid quality metrics, one might expect them to 
be predictive of an activity associated with desirable locations for development.

We hypothesize that more desirable locations for development might be those where 
plans have been made for new development activity and that building permits for
new construction or demolition are indicative of such plans. 

We estimated a logistic regression model using the indices developed through
factor analysis as independent variables and predicting the likelihood that 
a site in the city of Pittsburgh was issued a a building permit for either 
construction or demolition over a one-year period (June 2021 - May 2022). Building 
permit data were obtained from @western_pennsylvania_regional_data_center_permits_2022.
Out of 269,151 potential residential development sites in Pittsburgh, 139 (one 
twentieth of one percent) had building permits issued for new construction or 
demolition of a residential structure during the study period. 

## Combined index

If the indices we developed represent distinct dimensions of urban quality, 
the relative importance of each dimension (and its associated index) might vary
depending on the values of the individual or institution seeking to assess urban 
quality. However, the results of the regression analysis might offer insight into 
the typical or average values of active housing developers and property owners in 
Pittsburgh. 

We used the regression coefficients estimated to predict the likelihood 
of a recent building permit (as described above) to generate weights for 
each index, scaled such that the highest coefficient represented a weight 
of one hundred percent. We used these weights to calculate a combined index 
value for each site.