-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04-method.Rmd
615 lines (530 loc) · 29.6 KB
/
04-method.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
# Quantitative Methods
```{r method-libs, echo=FALSE, message=FALSE}
library(here)
library(knitr)
library(tidyverse)
library(lubridate)
library(sf)
library(arrow)
library(kableExtra)
`%!in%` <- Negate(`%in%`)
```
How might results of quantitative approach to evaluating urban quality at the
parcel level align with the values identified in the workshops described in
Chapter 3? In this chapter, we propose one such approach, using
readily-available parcel-level data for Allegheny County, Pennsylvania.
## Data
We obtained data on property addresses, land uses, assessed values (for both
land and buildings), and sale prices
from @allegheny_county_office_of_property_assessments_allegheny_2022, which
includes information on 582,116 properties in Allegheny County.
We also obtained latitude and longitude coordinates for each property from a
geocoder file provided by @western_pennsylvania_regional_data_center_geocoders_2021.
Over 99.5 percent of properties included in the assessment dataset are included
in the geocoder file. Properties without geocoded locations are excluded from
our analysis.
Potential development sites were identified as those
1. classified as "residential" (residential properties with one to
four housing units) or "commercial" (which includes mixed-use developments
and residential properties with more than four housing units), and
2. with a land use description in one of 59 possible categories^[One site (3008
Phillip Dr in Clairton) is missing a land use description in the assessment data.
We checked this address on Zillow to determine that this is a single-family home
and classified it as such in our data.]. The most common of these are listed Table \@ref(tab:list-site-uses).^[The land use descriptions that were
classified as potential development sites but are not listed in Table
\@ref(tab:list-site-uses), which combine to represent less than one percent of all sites
are "RIGHTOF WAY - RESIDENTIAL", "CONDOMINIUM UNIT", "DWG USED AS OFFICE",
"APART:20-39 UNITS", "CONDO GARAGE UNITS", "COMMON AREA", "CONDO DEVELOPMENTAL
LAND", "CONDEMNED/BOARDED-UP", "CONDOMINIUM OFFICE BUILDING", "INDEPENDENT LIVING
(SENIORS)", "DWG USED AS RETAIL", "OTHER COMMERCIAL", "MOBILE HOMES/TRAILER PKS",
"RIGHT OF WAY - COMMERCIAL", "GROUP HOME", "TOTAL/MAJOR FIRE DAMAGE - COMM",
"OTHER COMMERCIAL HOUSING", "TOTAL/MAJOR FIRE DAMAGE", "COMM APRTM CONDOS 5-19
UNITS", "MUNICIPAL URBAN RENEWAL", "COMMERCIAL LAND", "CAMPGROUNDS", "COMMON AREA
OR GREENBELT", "CHARITABLE EXEMPTION/HOS/HOMES", "INCOME PRODUCING PARKING LOT",
"DWG APT CONVERSION", ">10 ACRES VACANT", "MINOR FIRE DAMAGE", "COMM APRTM CONDOS
20-39 UNITS", "COMMERCIAL/UTILITY",
"H.O.A RECREATIONS AREA", "COMM APRTM CONDOS 40+ UNITS", "MINOR FIRE DAMAGE - COMM",
"OTHER", "OTHER RESIDENTIAL STRUCTURE", "OWNED BY METRO HOUSING AU", "RESIDENTIAL VACANT
LAND", "HUD PROJ #221", and "VACANT LAND 0-9 ACRES"].
```{r list-site-uses, echo=FALSE, message=FALSE, tidy=FALSE}
sites <- here("02_data",
"sites.csv") %>%
read_csv(show_col_types = FALSE)
site_use_summary <- sites %>%
group_by(USEDESC) %>%
summarise(`Number of potential sites` = n()) %>%
arrange(desc(`Number of potential sites`)) %>%
mutate(`Percent of potential sites` = 100 *
(`Number of potential sites` / sum(`Number of potential sites`))) %>%
mutate(`Cumulative percent of potential sites` =
cumsum(`Percent of potential sites`))
kable(
head(site_use_summary, 20),
caption = 'Most common land uses categorized as potential sites',
booktabs = TRUE,
digits = 1,
format.args = (list(big.mark = ','))) %>%
kable_styling(full_width = TRUE)
```
Potential building sites were further filtered to exclude those with missing data
on the most recent sale (about one percent of all sites).^[Four sites had sales
prices listed that were unreasonably high. 3039 Liberty Avenue in Pittsburgh is
listed as having sold for \$511,945,000 on August 30, 2021. Zillow lists this
property as having sold on that date for \$511,945
(https://www.zillow.com/homedetails/3039-W-Liberty-Ave-Pittsburgh-PA-15216/2070262638_zpid/, accessed 5/4/2022),
so the value was corrected for what appears to have been a typo. 220 Hyeholde Dr
in Coraopolis is listed as having sold for \$28,100,000 in 1967. This may also
be a typo, and it also does not seem to be the most recent sale. Zillow lists
this home as having sold for \$350,000 in 2004
(https://www.zillow.com/homes/220-hyeholde-dr,-Coraopolis,-PA_rb/11552817_zpid/,
accessed 5/4/2022), so the data was corrected to add that as the most recent sale.
Two other sites were identified as having unreasonably high sales values: 1339
Arlington Avenue in Pittsburgh is a three-bedroom single-family home that is
listed as having sold for \$57,010,813 in 1976 and a 0.06-acre vacant lot with
tax ID 0165G00270000000 is listed as having sold for \$24,920,232 in 1936. The
sales data for these sites were treated as missing.]
The focus of this analysis is on potential development sites rather than on
properties. Some properties in the assessor dataset are condominums where
multiple properties share a single parcel of land. We aggregated these to the
site level by identifying all properties with an assessed building value
greater than zero, a land value of zero, and a land use description that did
not indicate the land was vacant. If multiple such properties share an address,
we classified all properties at that address as a condominium and aggregated
them to the parcel level. This led to a final sample of 506,405 sites.
### Tax assessment data
Three variables (total assessed fair market value, assessed fair market
value of the building, and lot area) were taken directly from the county
tax assessment data for use in our analysis. We also included the most
recent listed sales price, adjusted for inflation.
To aggregate properties identified as condominiums to the site level, we summed
the total values for lot area, assessed land value, assessed building value, and
inflation-adjusted sale price. We log-transformed these four variables prior to
including them in our analysis. Their distributions are shown in
Figure \@ref(fig:assessor-hist).
```{r assessor-hist, echo=FALSE, warning=FALSE, message=FALSE, tidy=FALSE, fig.asp=1.0, out.width='100%', fig.cap='Distribution of variables from tax assessor database'}
assessor_vals_long <- sites %>%
select(total_value_log,
land_value_log,
lot_area_log,
price_log) %>%
rename(`Total assessed value` = total_value_log,
`Land value` = land_value_log,
`Lot area (sf)` = lot_area_log,
`Inflation-adjusted price` = price_log) %>%
pivot_longer(everything(),
names_to = "variable",
values_to = " ")
ggplot(assessor_vals_long) +
geom_histogram(aes(x = ` `),
bins = 50) +
facet_wrap(facets = vars(variable),
nrow = 2) +
scale_y_continuous(name = "Number of parcels") +
scale_x_continuous(breaks = breaks <- log(c(1, 100, 10000, 1000000)),
labels = c("1", "100", "10,000", "1,000,000")) +
theme_minimal()
```
### Accessibilty data
Accessibilty was calculated from each of the 518,032 sites in our sample to
each of several location types described below.
#### Destination parcels
We used land use codes from the county assessor parcel data to identify
_destination parcels_ that residents might value access to. The most common
land use codes of identified destination parcels are listed in Table \@ref(tab:dest-uses).
```{r dest-uses, echo=FALSE, out.width='100%'}
dest_parcels <- here("02_data",
"all-parcels.csv") %>%
read_csv(show_col_types = FALSE) %>%
filter(Category == "destination") %>%
group_by(USEDESC) %>%
summarise(`Number of identified destinations` = n()) %>%
arrange(desc(`Number of identified destinations`)) %>%
mutate(`Percent of identified destinations` = 100 *
(`Number of identified destinations` /
sum(`Number of identified destinations`))) %>%
mutate(`Cumulative percent of identified destinations` =
cumsum(`Percent of identified destinations`))
kable(
head(dest_parcels, 20),
caption = 'Land uses identified as potential destinations',
booktabs = TRUE,
digits = 2,
format.args = (list(big.mark = ','))
) %>%
kable_styling(full_width = TRUE)
```
#### Job locations
We identified _job locations_ based on data from a Longitudinal
Employer-Household Dynamics (LEHD) dataset published by the United States Census
Bureau [@united_states_census_bureau_lehd_2021]. The LEHD dataset provides the
total number of jobs in each census block in the United States, based on
employment tax records. The location of each job was defined as the centroid of
the block in which it was located. We downloaded job location data for
Pennsylvania and filtered it to include locations in the Pittsburgh metropolitan
area (Allegheny, Armstrong, Beaver, Butler, Fayette, Washington, and
Westmoreland counties).
In addition to calculating the accessibility to jobs of all categories, we also
calculated accessibility to several subsets of jobs. We disaggregated jobs by
earnings, reasoning that the usefulness of a job might vary depending on how
well it matches a workers skills or wage expectations. _High-paying job locations_
are a subset of job locations where the worker earns more than \$3333 per month.
_Low-paying job locations_ are those where the worker earns \$1250 per month or less.
We also disaggregated jobs based on employment industry, based on the North
American Industry Classification System (NAICS), reasoning that the presence of
jobs particular industries might represent a shopping or recreation destination.
_Retail job locations_ are a subset of job locations in NAICS sector 44-45
(retail trade); _Entertainment job locations_ are those in NAICS sector 71
(arts, entertainment, and recreation); and _Hospitality job locations_ are
those in NAICS sector 72 (accommodation and food services).
Finally, we identified three location types that correspond with common non-work
trips: schools, grocery stores, and parks. _Grocery store locations_ were
identified as vendors participating in the Supplemental Nutrition Program for
Women, Infants, and Children (WIC). WIC vendor locations and _school locations_
were obtained from the Allegheny County GIS portal
[@allegheny_county_office_of_information_technology_allegheny_2018;
@allegheny_county_office_of_information_technology_allegheny_2020].
_Park locations_ were taken from the Pennsylvania Geospatial Data Clearinghouse
[@pennsylvania_department_of_conservation_and_natural_resources_pennsylvania_2015].
Park locations were downloaded for Pennsylvania and filtered to Allegheny county.
We used the r5r package in the R programming language [@pereira_r5r_2021] to
calculate accessibility each destination type described above,
for each of four transportation modes (walking, cycling, driving, and transit).
The r5r package calculates accessibility as the weighted total number of
destinations reachable by a given mode, where destinations are weighted
according to a decay function, such that destinations that can be reached within
less time are assigned greater weight. We used a logistic decay function, as
illustrated in \@ref(fig:show-decay-func). For motorized modes, the decay
function had a mean (inflection) of 40 minutes and a standard deviation of 10
minutes. For non-motorized modes, the decay function had a mean of 20 minutes
and a standard deviation of 5 minutes.
```{r show-decay-func, echo=FALSE, fig.cap='Decay functions for accessibility calculations', out.width='80%', fig.asp=.75, fig.align='center', fig.alt='Plot illustrating two logistic decay functions.'}
travel_time <- seq(0, 100, by = 0.1)
weight_motor <- (8.380076 - cumsum(dlogis(travel_time,
location = 40,
scale = 20,
log = FALSE))) / 8.380076
weight_active <- (8.380076 - cumsum(dlogis(travel_time[1:510],
location = 20,
scale = 10,
log = FALSE))) / 8.380076
logistic_sample <- tibble(
`Travel time from site` = c(travel_time, travel_time[1:510]),
`Weight assigned to destination` = c(weight_motor, weight_active),
Mode = c(rep("Motorized", 1001), rep("Active", 510)))
ggplot(logistic_sample) +
geom_line(aes(x = `Travel time from site`,
y = `Weight assigned to destination`,
lty = Mode)) +
theme_minimal()
```
```{r, echo=FALSE}
accessibility <- here("02_data",
"access.parquet") %>%
read_parquet()
accessibility_long <- accessibility%>%
pivot_longer(!PARID, names_to = "type", values_to = "value") %>%
mutate(mode = rep(c(rep("bike", 10),
rep("walk", 10),
rep("car", 10),
rep("transit", 10)),
length(accessibility$PARID))) %>%
mutate(destination = rep(c("Total\njobs",
"High-\npaying\njobs",
"Low-\npaying\njobs",
"Retail\njobs",
"Enter-\ntainment\njobs",
"Hospit-\nality\njobs",
"Amenity\nparcels",
"Grocery\nstores",
"Schools",
"Parks"),
length(accessibility$PARID) * 4)) %>%
filter(value > 0)
```
Calculating accessibility metrics for a combination of four transportation
modes and ten destination types yields 40 different accessibility variables. Figure \@ref(fig:show-decay-func)
illustrates the distributions of each of these variables.
```{r access-dist, echo=FALSE, fig.cap='Distributions of accessibility variables', out.width='100%', fig.asp=.75, fig.align='center', fig.alt='Histograms of 40 accessibility variables.'}
zero_labels <- tibble(
pct_zero = c(prettyNum(100 * sum(accessibility$access_bike_dest_parcels == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_bike_entrnmt_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_bike_grocery == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_bike_hi_pay_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_bike_hosplty_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_bike_all_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_bike_lo_pay_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_bike_park == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_bike_school == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_bike_retail_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_car_dest_parcels == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_car_entrnmt_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_car_grocery == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_car_hi_pay_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_car_hosplty_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_car_all_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_car_lo_pay_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_car_park == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_car_school == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_car_retail_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_walk_dest_parcels == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_walk_entrnmt_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_walk_grocery == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_walk_hi_pay_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_walk_hosplty_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_walk_all_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_walk_lo_pay_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_walk_park == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_walk_school == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_walk_retail_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_transit_dest_parcels == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_transit_entrnmt_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_transit_grocery == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_transit_hi_pay_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_transit_hosplty_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_transit_all_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_transit_lo_pay_jobs == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_transit_park == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_transit_school == 0) /
length(accessibility$PARID), digits = 0),
prettyNum(100 * sum(accessibility$access_transit_retail_jobs == 0) /
length(accessibility$PARID), digits = 0)),
y = 175000,
x = 0,
mode = c(rep("bike", 10),
rep("car", 10),
rep("walk", 10),
rep("transit", 10)),
destination = rep(c("Amenity\nparcels",
"Enter-\ntainment\njobs",
"Grocery\nstores",
"High-\npaying\njobs",
"Hospit-\nality\njobs",
"Total\njobs",
"Low-\npaying\njobs",
"Parks",
"Schools",
"Retail\njobs"),
4)) %>%
mutate(label = paste0(pct_zero,
"% zero")) %>%
filter(pct_zero > 0)
ggplot(accessibility_long) +
geom_histogram(aes(x = log(value)),
color = "gray",
fill = "gray",
bins = 20) +
facet_grid(rows = vars(mode),
cols = vars(destination),
scales = "free_x") +
scale_x_continuous(name = "Accessibility (log-transformed) (values of zero are excluded)") +
scale_y_continuous(name = "Number of sites",
labels = scales::label_comma()) +
theme_minimal() +
theme(axis.text.x = element_blank()) +
geom_text(aes(x = x, y = y,
label = label),
data = zero_labels,
size = 2.5,
hjust = 0)
```
### Disamenity proximity
```{r bad-uses, include=FALSE}
parcels <- here("02_data",
"all-parcels.csv") %>%
read_csv(show_col_types = FALSE)
big_coal <- parcels %>%
filter(PARID == "0187E00175000000")
bad_parcels <- parcels %>%
filter(Category == "disamenity") %>%
filter((x != big_coal$x &
y != big_coal$y) |
PARID == big_coal$PARID)
bad_parcel_list <- bad_parcels %>%
group_by(USEDESC) %>%
summarise(`Number of identified locations` = n()) %>%
arrange(desc(`Number of identified locations`)) %>%
mutate(`Percent of identified locations` = 100 *
(`Number of identified locations` /
sum(`Number of identified locations`))) %>%
mutate(`Cumulative percent of identified locations` =
cumsum(`Percent of identified locations`))
```
We categorized several land uses in the county assessor data as
disamenities. The land use codes we used to identify disamenities are
listed in \@ref(tab:bad-use-list)^[289 properties related to coal mining
(with land use descriptions of either "COAL RIGHTS, WORKING INTERESTS" or
"COAL LAND, SURFACE RIGHTS") are co-located and are treated as a single
site.].
```{r bad-use-list, echo=FALSE, message=FALSE}
kable(
bad_parcel_list,
caption = 'Land uses identified as disamenities',
booktabs = TRUE,
digits = 2,
format.args = (list(big.mark = ','))
) %>%
kable_styling(full_width = TRUE)
```
We included a disamenity proximity index in our analysis that we calculated
as the logarithm of the average distance from each site to the ten closest
disamenity sites. The distribution of this index is shown in \@ref(fig:bad-prox).
```{r bad-prox, echo=FALSE, fig.cap='Distribution of average distance to nearest ten disamenity sites', out.width='80%', fig.asp=.75, fig.align='center'}
bad_proximity <- here("02_data",
"bad_proximity.csv") %>%
read_csv(show_col_types = FALSE)
ggplot(bad_proximity) +
geom_histogram(aes(x = log_avg_disamentiy_dist),
fill = "gray",
color = "gray",
bins = 50) +
scale_x_continuous(name = "Disamenity proximity index") +
scale_y_continuous(name = "Number of sites",
labels = scales::label_comma()) +
theme_minimal() +
theme(axis.text.x = element_blank())
```
### Density
To represent the residential density around each site, we used the sf [@sf], nngeo
[@nngeo] and tidycensus [@tidycensus] R packages to determine the smallest circular
buffer around each site containing a population of at least two thousand people,
based on the 2020 census. In denser places, a buffer with a smaller radius would
encompass two thousand residents. In more sparsely-populated places, a buffer
containing two thousand residents would be larger. The distribution of radii for
two-thousand-person site buffers is shown in \@ref(fig:radii).
```{r radii, echo=FALSE, fig.cap='Histogram of radii of buffer containing 2000 residents', out.width='80%', fig.asp=.75, fig.align='center', fig.alt='Buffer radius histogram'}
buffers <- here("02_data",
"diversity.parquet") %>%
read_parquet()
ggplot(buffers) +
geom_histogram(aes(x = radius / 5280),
fill = "gray",
color = "gray",
bins = 50) +
scale_x_continuous(name = "Radius of buffer including two thousand residents (miles)") +
scale_y_continuous(name = "Number of sites",
labels = scales::label_comma()) +
theme_minimal()
```
### Population diversity
The two-thousand-resident buffers described above were also used as a basis to
estimate the racial diversity of residents in the immediate vicinity. For each
buffer, we calculated the percentage of residents that who identified in the 2020
census as non-Hispanic white, non-Hispanic Black, and Hispanic. The distributions
of these variables are shown in \@ref(fig:divers-hist).
```{r divers-hist, echo=FALSE, fig.cap='Distributions of population diversity variables', out.width='80%', fig.asp=.75, fig.align='center', fig.alt='Diversity histograms'}
diversity <- buffers %>%
select(pct_white, pct_black, pct_hispanic) %>%
rename(`Non-Hispanic white` = pct_white,
`Non-Hispanic Black` = pct_black,
Hispanic = pct_hispanic) %>%
pivot_longer(everything(),
names_to = "Race/ethnicity",
values_to = "Percent of population")
ggplot(diversity) +
geom_histogram(aes(x = `Percent of population`,
color = `Race/ethnicity`),
fill = NA,
alpha = 0.5,
bins = 50) +
scale_y_continuous(name = "Number of sites",
labels = scales::label_comma()) +
scale_x_continuous(breaks = breaks <- seq(0, 1, by=0.1),
labels = paste0((breaks * 100), "%")) +
theme_minimal()
```
### Land use diversity
We also calculated the total number of different land uses within each
two-thousand-resident buffer and used this as a measure of land-use diversity.
\@ref(fig:land-divers).
```{r land-divers, echo=FALSE, fig.cap='Histogram of land use diversity', out.width='80%', fig.asp=.75, fig.align='center', fig.alt='Land-use diversity histogram'}
ggplot(buffers) +
geom_histogram(aes(x = n_uses),
bins = 20,
color = "lightgray",
fill = "gray") +
scale_y_continuous(name = "Number of sites",
labels = scales::label_comma()) +
scale_x_continuous(name = "Number of unique land use types within\na 2,000-person buffer") +
theme_minimal()
```
## Index development
The methods described above yielded a set of fifty parcel-level variables,
forty of which are accessibility metrics, for each of 506,405 parcels. We used
the EFAtools R package [@EFAtools] to develop a set of parcel level indices from
these variables using factor analysis. The Kaiser-Meyer-Olkin criterion for the
dataset is 0.9, suggesting a "marvellous" case for factor analysis [@kaiser1974index].
We determined the appropriate number of factors based on the Kaiser-Guttman criterion and the Hull method.The Hull method suggests an
optimal number of factors that balances model fit and number of parameters, with a goal
or keeping only major factors. Potential solutions with various number of factors are
plotted on a graph of goodness-of-fit versus degrees of freedom, where the optimal
solution will be on the boundary of a convex hull [@lorenzo2011hull]. The Kaiser-Guttmat criterion is a
recommendation to retain as many factors as there are
sample eigenvalues greater than one [@guttman1954some].
We computed factor loadings
using an oblimin rotation. We applied these loadings to calculate a set of index
scores (one for each factor) for each potential development site.
## Index validation
The indices we developed through factor analysis might represent dimensions of
urban quality. If they are valid quality metrics, one might expect them to
be predictive of an activity associated with desirable locations for development.
We hypothesize that more desirable locations for development might be those where
plans have been made for new development activity and that building permits for
new construction or demolition are indicative of such plans.
We estimated a logistic regression model using the indices developed through
factor analysis as independent variables and predicting the likelihood that
a site in the city of Pittsburgh was issued a a building permit for either
construction or demolition over a one-year period (June 2021 - May 2022). Building
permit data were obtained from @western_pennsylvania_regional_data_center_permits_2022.
Out of 269,151 potential residential development sites in Pittsburgh, 139 (one
twentieth of one percent) had building permits issued for new construction or
demolition of a residential structure during the study period.
## Combined index
If the indices we developed represent distinct dimensions of urban quality,
the relative importance of each dimension (and its associated index) might vary
depending on the values of the individual or institution seeking to assess urban
quality. However, the results of the regression analysis might offer insight into
the typical or average values of active housing developers and property owners in
Pittsburgh.
We used the regression coefficients estimated to predict the likelihood
of a recent building permit (as described above) to generate weights for
each index, scaled such that the highest coefficient represented a weight
of one hundred percent. We used these weights to calculate a combined index
value for each site.