-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathGrad_Assignment.Rmd
952 lines (724 loc) · 39.5 KB
/
Grad_Assignment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
---
title: "Grad Assignment"
output: html_document
date: "2023-11-19"
author: Adriana Navarro
dataset: Maryland Department of the Environment - Water and Science Administration (WSA) Compliance
link: https://opendata.maryland.gov/Energy-and-Environment/Maryland-Department-of-the-Environment-Water-and-S/hxmu-urvx
project details: https://docs.google.com/document/d/1rvEymxmdoU72bAFqYEfobgc9is7ko2aZNxu5QZj-WKU/edit#heading=h.m8hwzhwk8ccv
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
1. Who created and maintains the data? What state program or law is it connected to?
**Answer** Maryland Department of the Environment (MDE); owner: Sharmila; This data set is connected to MDE's Annual Enforcement and Compliance Report, required by the Environment Article §1-301(d), Annotated Code of Maryland. I tried finding a full name attached to the dataset, but the MDE keeps passing me from person to person who they say would know more.
2. What does each row represent?
**Answer** Each row represents a Water and Science Administration (WSA) Inspection
3. What is the chronological and geographic scope of this data?
**Answer** The data set was created in 2021 but was last updated Nov. 25, 2023. As for the geographic scope, I group_by and summarised the county column and found it covers all the counties, Baltimore City County and three other categories: "Not Yet Determined", "Outside of Maryland", and "Statewide".
4. If this data contains aggregates (totals), can you find itemized data that those totals are derived from? Do the totals match your own calculations using the itemized data?
**Answer** The WSA data on its own does not contain aggregates.
**Questions for data owner**
1. Why is the inspection_reason column blank? Can I have the data?
* Note: I called, I emailed, I got in contact with the department (the dataset only has a first name attached to it, and I couldn't link it to an employee), and I've kind of just been tossed from place to place looking for someone who can help me find this. Might need to MPIA it, worse comes to worse. I was really hoping to get a good story idea out of though, as the people at the department told me this column is supposed to tell me if the inspection was routine or upon complaint.
2. Why are there reports from other states?
3. What is the 00000 zip code -- is this just blank? What is the "206 7", "20619619" and "33333-3333"? (Cleaned -- could not match 33333 with a zip code in Baltimore)
#Story Ideas
1. The MTA-Purple Line has the highest number of reports -- and the highest number of noncompliance results. By percentage, that's 48% its inspections -- I would guess not an acceptable amount even for a large project. A large majority of these are also storm water related, which can impact the likelihood of if an area floods. Now, because the project stretches from Bethesda to North Carrollton and these reports are only attributed to Bethesda in a broad sense, it's possible they aren't clumped in one area. If I had more time, I would likely request the reports related to the Purple Line (the link that should have showed them was broken). I have a chart father down that shows the distribution of inspection types related to the Purple Line.
2. Many of these noncompliance reports (including ones that required corrective actions) are in counties that have seen the majority of flood events from 2019 to 2023. In fact, the top three counties with the highest number of flood events (NWS: Montgomery, Prince George's and Frederick) all had the same top issue: NPDES (National Pollutant Discharge Elimination System) Construction Activity. That basically translates to issues with controlling pollutants from construction from discharging into the water system. Most of the Purple Line construction issues are all logged under Montgomery County despite the scope of the site crossing counties, which means PG's actual count for this could be higher but logged in Montgomery depending on where the inspection took place. Would love to talk to the Department of the Environment as well as the NWS about how flooding plays a role in this since it's likely the site wouldn't be protected from rainwater sweeping up pollutants. Some more questions: Why are Montgomery, PG and Frederick prone to flooding over other areas around the Chesapeake Bay?
3. I started looking into unauthorized sites that were inspected and found to be noncompliant. Within them I found an "Alleged unauthorized clearing of forested wetlands" in Dundalk in Baltimore, MD. There are 13 counts of inspection, all noncompliant. Some of the types were sediment erosion and others were along the lines of storm water management throughout construction. I started looking into all the unauthorized events. I think the story here is likely less on one specific violation and more on why these things occur, using this data to find them. Since the addresses are listed, they're places I could visit and ask around about. One count was of discharge in a stormdrain here in PG near the deconstruction of NAPA, there are also a few unauthorized bridges. Why were they built, what community needs are not being met and why?
Note: I joined this data set with Census data -- both race/ethncity and then median income, because I wanted to know: Are there more noncompliance site condition results in zip codes with a majority Black population? What about in zip codes that have a lower median income than Maryland's median income? However, I found no relationships between the data and those two aspects of the census data. If I were to continue working on this, specifically relating to census data, I would next look to the age demographics and the "prices asked" data for home sales. The age portion I'm curious about to see if a lot of these noncompliance outcomes occur near zip codes with a college or zip codes with a high concentration of older folks. For the house data, I want to know if houses near areas that received noncompliance reports were typically cheaper than areas away from them.
```{r}
library(tidyverse)
library(dplyr)
library(janitor)
library(stringr)
library(ggplot2)
library(ggthemes)
library(sf)
library(tigris)
options(tigris_use_cache = TRUE)
```
```{r}
#Here I am cleaning the data. I read it in, cleaned the names, separated the location column into three separate ones that list the city, state and zip code of the complaint, and did a lot of work on the new zip column. I then took away the document and inspection_reason columns because they were empty, but I've been trying to get ahold of the creators of the spreadsheet to see if I can get the data. I'm very interested in the inspection_reason column and was disappointed to see it was completely useless here. The document links are invalid, making that column more or less useless as well.
# I commented out the two lines of code that I used to see if there was any reason to keep the "inspection_reason" column. Unfortunately, it's all NA, so I took it out. I am trying to get a hold of the data.
compliance <- read_csv("data/compliance_dec.csv") |>
clean_names() |>
separate(city_state_zip, into = c("city", "state", "zip"), sep = ",") |>
#group_by(inspection_reason) |>
#summarise(count=n())
mutate(inspection_date = mdy(inspection_date),
zip = str_replace_all(zip, c("-.*$" = "", "20619619" = "20619", "206 4" = "20674")),
zip = str_trim(zip),
year = year(inspection_date),
month = month(inspection_date, label = TRUE),
day = day(inspection_date),
zip = case_when(zip == "" | zip == "00000" | zip == "33333" ~ NA, .default = zip),
site_name = case_when(site_name == "Agricopia-Section 4-Lots 315-316" |
site_name == "Agricopia-Section 4 Lots 319 & 320" |
site_name == "Agricopia - Section 4 & 5 - Lots 329-330" |
site_name == "Agricopia - Section 4 - Lots 313-314" |
site_name == "Agricopia - Section 4 & 5 - Lots 383-384" |
site_name == "Agricopia - Section 4 - Lots 315-316" |
site_name == "Agricopia - Section 4 - Lots 317-318" |
site_name == "Agricopia - Section 4 - Lots 325-326" |
site_name == "Agricopia - Section 4 - Lots 327-328" |
site_name == "Agricopia - Section 4C-2 - Lots 505-510" |
site_name == "Agricopia - Section 7 - Lot 194" |
site_name == "Agricopia-Section 4-Lots 317-318" |
site_name == "Agricopia - Section 7 - Lot 195" |
site_name == "Agricopia - Section 7, 8, & 9" |
site_name == "Agricopia -Section 4C-4 - Lots 532-538, Lots525-531" |
site_name == "Agricopia /Section 6B Open Space O" |
site_name == "Agricopia – Section 7A" |
site_name == "Agricopia Lot 222" |
site_name == "Agricopia Section $ Lots 323 & 324" |
site_name == "Agricopia Section 4C-1 Townhomes Lots 465-470" |
site_name == "Agricopia Section 6B - Lots 413-418 & 423-424" |
site_name == "Agricopia Section 7 Lot 220" |
site_name == "Agricopia Sections 4 & 5 - Lots 379-380" |
site_name == "Agricopia- Lot 221" |
site_name == "Agricopia-Lot 152" |
site_name == "Agricopia Section 4 Lots 321-322"
~ "Agricopia", .default = site_name),
site_name = case_when(site_name == "P & J Contracting Company, Inc." ~ "P & J Contracting Company, Inc", .default = site_name)
) |>
select(-document, -inspection_reason)
compliance
```
```{r}
#Here I'm checking if these are just Maryland reports and what the zip codes look like. I used this to better clean the compliance dataframe. Cleaning checklist: change the 00000 zip codes to NA, change the zip code "20619619" to "20619" (I double checked this and it is California, Maryland), change zip code "206 4" to "20674" after grabbing the address from that entry and looking it up.
#Data smells: zip codes: "", "00000", "206 7", "20619619" "33333-3333", Why are they in Illinois?? They are in DC, GA, IL, MD, VA, WV, YY (typo?), and no state available.
#Question: Which zip code has the highest number of reports?
zip_count <- compliance |>
group_by(zip) |>
summarise(count = n()) |>
arrange(desc(count))
zip_count
state_zip <- compliance |>
filter(county == "Outside of Maryland") |>
group_by(zip, county, state,)
state_zip
```
```{r}
# Here I wanted to group the counties to get a count of how many there were/the geographical scope of the data. There ended up being a few extra entries that weren't Maryland counties
# Question: What county has the highest count of compliance forms? Does this match with the zip code above? St. Mary's has the highest count of compliance forms, but the zip code with the highest count is in Talbot County.
compliance_counties <- compliance |>
group_by(county) |>
summarise(count = n()) |>
arrange(desc(count))
compliance_counties
#Because of this, I'm going to create the md_reports dataframe, which will just include the Maryland reports. The ones outside of the state don't seem too significant, though there are a few on the borders. My theory is that the weird locations, like the one in Atlanta, Georgia, are listed because their headquarters is probably there. It's the reason why I'm hesitant about the Purple Line as just Bethesda.
md_reports <- compliance |>
filter(state == "MD")
md_reports
```
```{r}
#What counties have the most noncompliance reports? I'm including "Corrective Actions Required" since it shows something still needed to be fixed.
county_noncomp_count <- md_reports |>
filter(site_condition == "Noncompliance" | site_condition == "Corrective Actions Required") |>
group_by(county) |>
summarise(count = n()) |>
arrange(desc(count))
county_noncomp_count
```
```{r}
#Here I want to get a sense of the different columns and what all their entries are.
#The recommended_actions column
compliance_actions <- compliance |>
group_by(recommended_actions) |>
summarise(count =n()) |>
arrange(desc(count))
compliance_actions
#The site_condition column
compliance_site_condition <- compliance |>
group_by(site_condition) |>
summarise(count =n()) |>
arrange(desc(count))
compliance_site_condition
```
```{r}
#What site has the most "corrective action required" outcome in its site_condition column? A residence Hall at Bowie State University has the most. Would probably need to FOIA the records or just ask since they seemed pretty willing to help out when I called. If I had the inspection_reason column, I would have loved to have seen how many requests came from college campuses.
action_required <- compliance |>
filter(site_condition == "Corrective Actions Required") |>
group_by(site_name) |>
summarise(count = n()) |>
arrange(desc(count))
action_required
```
```{r}
#A recycling company (World Recycling Company, 11 counts of significant noncompliance) received the most "Significant Noncompliance" counts within the site_status column. However, combined with Noncompliance, it's a little further down the list.
sig_non <- compliance |>
filter(site_status == "Significant Noncompliance") |>
group_by(site_name) |>
summarise(count = n()) |>
arrange(desc(count))
sig_non
sig_norm_noncom <- compliance |>
filter(site_status == "Significant Noncompliance" | site_status == "Noncompliance") |>
group_by(site_name) |>
summarise(count = n()) |>
arrange(desc(count))
sig_norm_noncom
```
```{r}
#Wowowowowow the Purple Line has over 500 total reports, making it the site with the highest count of reports. This makes sense seeing how it's a large project that spans across different zip codes.
names <- compliance |>
group_by(site_name) |>
summarise(total_count =n()) |>
arrange(desc(total_count))
names
```
```{r}
# Does the MTA-Purple Line have the highest count of noncompliance as well? Yep. Note: It had over 260 counts when I used the site_status column instead. I'm using the site_condition column for now since it seems like originally this is the correct column. The "site_status" column should say whether the case is "active" or "closed."
noncompliance_condition <- compliance |>
filter(site_condition == "Noncompliance") |>
group_by(site_name) |>
summarise(condition_count = n()) |>
arrange(desc(condition_count))
noncompliance_condition
noncompliance_status <- compliance |>
filter(site_status == "Noncompliance") |>
group_by(site_name) |>
summarise(status_count = n()) |>
arrange(desc(status_count))
noncompliance_status
```
```{r}
#Because the purple line is such an outlier, I want to compare what percentage noncompliance results make up of all the company's inspections. About 48% of the Purple Line's inspection results are noncompliant.
status_merge <- names |> left_join(noncompliance_status, by = "site_name") |>
mutate(perc_noncom = (status_count/total_count)*100) |>
arrange(desc(perc_noncom))
status_merge
```
```{r}
#I just wanted to make this dataframe for future coding
purple_line <- compliance |>
filter(site_name == "MTA-Purple Line")
purple_line
```
```{r}
#
purple_line_types <- purple_line |>
filter(site_status == "Noncompliance") |>
group_by(inspection_type) |>
summarise(count =n()) |>
mutate(percent = count/(sum(count))*100) |>
arrange(desc(count))
purple_line_types
write_csv(purple_line_types, "data/purple_line_types.csv")
#Exporting this for a datawrapper map. See below.
```
```{r}
#I'm more comfortable with using datawrapper, but I wanted to try to make this a bar chart in ggplot.
#Here's the datawrapper map: https://datawrapper.dwcdn.net/vrh1F/2/
#And here's the ggplot chart:
purple_line_types |>
ggplot() +
geom_bar(aes(x =reorder(inspection_type, percent), weight = percent), fill = "lightblue", color = "lightblue") +
coord_flip() +
theme_economist() +
labs(
title="Majority of Purple Line's noncompliance results were storm water management",
x = "noncompliance types",
y = "percent",
caption = "source: Maryland Department of the Environment"
) +
theme(
plot.title = element_text(size = 12, margin = margin(b = 15), hjust = 1.3),
axis.title.y = element_text(margin = margin(r = 10)),
axis.title.x = element_text(margin = margin(t = 10)),
plot.margin = margin(t = 30, r = 10, b = 10, l = 10)
)
```
```{r}
#Construction on the Purple Line actually paused around the pandemic and restarted during the summer of 2022. What was going on during that period of limbo? Why were there still compliance reports/was it not taken care of during that time? How did this impact the surrounding area/environment? This can be especially concerning since storm water management is the top inspection type for this site. Did this impact flash flood risk?
noncompliance_per_year <- compliance |>
filter(site_status == "Noncompliance") |>
group_by(year) |>
summarise(count = n()) |>
arrange(year)
noncompliance_per_year
```
```{r}
#Was there a year with more noncompliance issues than others? Yes, 2023. Construction began Aug. 2017, paused Sept. 2020 and restarted summer of 2022. Still, these noncompliances have been growing up over time.
purple_line_time <- purple_line |>
filter(site_condition == "Noncompliance"| site_condition == "Corrective Actions Required") |>
group_by(year) |>
summarise(count = n()) |>
arrange(year)
purple_line_time
```
```{r}
# Here I'm exploring the types of site conditions found along the purple line.
purple_compliant_condition <- purple_line |>
group_by(site_condition) |>
summarise(count = n()) |>
arrange(desc(count))
purple_compliant_condition
```
```{r}
#Again, more exploring. I try not to use the site_status column, but I wanted to see what it totaled to for these. I tried to group the statuses together that made the mose sense. For everthing else, I labeled them as other.
purple_compliant_status <- purple_line |>
group_by(site_status) |>
mutate(
site_status = case_when(site_status == "Out of Compliance- NOV" | site_status == "Out of Comp HPV NOV Issued"| site_status == "Corrective Actions Required" ~ "Noncompliance", .default = site_status) ,
site_status = case_when(site_status == "Satisfactory/Compliance" | site_status == "No Violations Observed" ~ "Compliance", .default = site_status),
site_status = case_when(site_status == "Additional Investigation Requi" | site_status == "Not Applicable"| site_status == "Not Evaluated" | site_status == "Compliance Assistance Rendered" ~ "Other", .default = site_status)
) |>
summarise(count = n()) |>
arrange(desc(count))
purple_compliant_status
```
#Okay, time to start exploring another data set. Are there more noncompliance results in zip codes with a higher Black population? Data shows this isn't quite true.
```{r}
#Here I'm pulling all the data on race and ethnicity from the census. I'll join it with MD data later to weed out the non-MD reports.
mdzip <- get_acs(geography="zcta", variables = "B01003_001", year=2021) |>
clean_names() |>
rename(zip = geoid, total_pop = estimate) |>
select(-name, -variable, -moe)
mdzip
md_white <- get_acs(geography="zcta", variables = "B02001_002", year=2021) |>
clean_names() |>
rename(zip = geoid, white_pop = estimate) |>
select(-name, -variable, -moe)
md_black <- get_acs(geography="zcta", variables = "B02001_003", year=2021) |>
clean_names() |>
rename(zip = geoid, black_pop = estimate) |>
select(-name, -variable, -moe)
md_indigenous <- get_acs(geography="zcta", variables = "B02001_004", year=2021)|>
clean_names() |>
rename(zip = geoid, indigenous_pop = estimate) |>
select(-name, -variable, -moe)
md_asian <- get_acs(geography="zcta", variables = "B02001_005", year=2021)|>
clean_names() |>
rename(zip = geoid, asian_pop = estimate) |>
select(-name, -variable, -moe)
md_hawaii_pi <- get_acs(geography="zcta", variables = "B02001_006", year=2021)|>
clean_names() |>
rename(zip = geoid, hawaii_pi_pop = estimate) |>
select(-name, -variable, -moe)
md_other <- get_acs(geography="zcta", variables = "B02001_007", year=2021)|>
clean_names() |>
rename(zip = geoid, other_pop = estimate) |>
select(-name, -variable, -moe)
md_two_more <- get_acs(geography="zcta", variables = "B02001_008", year=2021)|>
clean_names() |>
rename(zip = geoid, two_more_pop = estimate) |>
select(-name, -variable, -moe)
md_hispanic <- get_acs(geography="zcta", variables = "B03001_003", year=2021)|>
clean_names() |>
rename(zip = geoid, hispanic_pop = estimate) |>
select(-name, -variable, -moe)
```
# Joining the demographic data with the noncompliance data
```{r}
#The number of people by race/ethnicity were a little meaningless just by looking at them, so I made columns that showed their percentage of the population instead. Since "Hispanic" is not a race, there will be overlap.
census_data <- mdzip |>
left_join(md_white, by = "zip") |>
left_join(md_black, by = "zip") |>
left_join(md_indigenous, by = "zip") |>
left_join(md_asian, by = "zip") |>
left_join(md_hawaii_pi, by = "zip") |>
left_join(md_other, by = "zip") |>
left_join(md_two_more, by = "zip") |>
left_join(md_hispanic, by = "zip") |>
mutate(perc_white = (white_pop/total_pop)*100) |>
mutate(perc_black = (black_pop/total_pop)*100) |>
mutate(perc_indigenous = (indigenous_pop/total_pop)*100) |>
mutate(perc_asian = (asian_pop/total_pop)*100) |>
mutate(perc_hawaii_pi = (hawaii_pi_pop/total_pop)*100) |>
mutate(perc_other = (other_pop/total_pop)*100) |>
mutate(perc_two_more = (two_more_pop/total_pop)*100) |>
mutate(perc_hispanic = (hispanic_pop/total_pop)*100) |>
select(-white_pop, -black_pop, -indigenous_pop, -asian_pop, -hawaii_pi_pop, -other_pop, -two_more_pop, -hispanic_pop)
census_data
```
#I want to focus on noncompliance records for this. Possibly more focused on issues that were fixed/ones that haven't been resolved.
```{r}
#Let's look at just Maryland stuff right now. I'm clumping "Significant Noncompliance" together with "Noncompliance" since they're along the same vein.
count_noncompliance <- md_reports |>
mutate(
site_status = case_when(site_status == "Out of Compliance- NOV" | site_status == "Out of Comp HPV NOV Issued"| site_status == "Corrective Actions Required" ~ "Noncompliance", .default = site_status)
) |>
filter(site_status == "Noncompliance") |>
group_by(zip, site_status) |>
summarise(noncompliance_count = n()) |>
select(-site_status) |>
arrange(zip)
count_noncompliance
count_compliance <- md_reports |>
mutate(
site_status = case_when(site_status == "Satisfactory/Compliance" | site_status == "No Violations Observed" ~ "Compliance", .default = site_status)
) |>
filter(site_status == "Compliance") |>
group_by(zip, site_status) |>
summarise(compliance_count = n()) |>
select(-site_status) |>
arrange(zip)
count_compliance
other_reports <- md_reports |>
mutate(
site_status = case_when(site_status == "Additional Investigation Requi" | site_status == "Not Applicable"| site_status == "Not Evaluated" | site_status == "Compliance Assistance Rendered" ~ "Other", .default = site_status)
) |>
filter(site_status == "Other") |>
group_by(zip, site_status) |>
summarise(other_count = n()) |>
select(-site_status) |>
arrange(zip)
other_reports
md_reports_status <- count_noncompliance |>
left_join(count_compliance, by = "zip") |>
left_join(other_reports, by = "zip") |>
mutate(
noncompliance_count = ifelse(is.na(noncompliance_count), 0, noncompliance_count),
compliance_count = ifelse(is.na(compliance_count), 0, compliance_count),
other_count = ifelse(is.na(other_count), 0, other_count),
total = noncompliance_count + compliance_count + other_count,
perc_non_comp = (noncompliance_count/total)*100
)
md_reports_status
```
#Okay, well, this gives me a few taking zip codes to look closer into farther down. Now I'm going to take the above and merge it with the census data
```{r}
md_noncompliant_zip <- md_reports_status |>
left_join(census_data, by = "zip") |>
# filter(perc_black >= 50.00) |>
select(zip, perc_non_comp, noncompliance_count, compliance_count, perc_black, perc_white) |>
arrange(desc(perc_non_comp))
md_noncompliant_zip
```
#Is there a trend where the majority of percentage of noncompliance outcomes are in majority Black zip codes?
```{r}
ggplot(md_noncompliant_zip, aes(x=perc_black, y=perc_non_comp)) +
geom_point()+
labs(
title="Sites that received WSA noncompliant results",
x = "percent of zip population that identifies as Black/African American",
y = "percent of noncompliant results",
caption = "source: Maryland Department of the Environment, U.S. Census") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(check_overlap = TRUE, size = 2, aes(label = zip), hjust = 0.5, vjust = -0.5)
```
#WElp, doesn't look like my theory was correct. What about income?
```{r}
# Attempt at income
mdzip_income <- get_acs(geography="zcta", variables = "B19013_001", year=2021) |>
clean_names() |>
rename(zip = geoid, median_income = estimate) |>
select(-name, -variable, -moe)
mdzip_income
```
```{r}
#Now grabbing noncompliance again.
md_reports_noncompliance <- md_reports |>
filter(site_condition == "Noncompliance") |>
group_by(zip) |>
summarise(count_of_noncompliant = n()) |>
arrange(desc(count_of_noncompliant))
md_reports_noncompliance
```
```{r}
md_income_noncompliance <- md_reports_status |>
left_join(mdzip_income, by = "zip") |>
arrange(desc(perc_non_comp))
md_income_noncompliance
```
```{r}
ggplot(md_income_noncompliance, aes(x=median_income, y=perc_non_comp)) +
geom_point()+
labs(
title="Sites that received WSA noncompliant results compared to median income",
x = "median household income",
y = "percent of noncompliant results",
caption = "source: Maryland Department of the Environment, U.S. Census") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(check_overlap = TRUE, size = 2, aes(label = zip), hjust = 0.5, vjust = -0.5)
```
#Okay, so really no relationship here overall. Let's take a look at some specific zip codes just to see their demographics.
Population density:
21601: 216
21801: 708
21740: 884.72
21613: 110.36
21921: 488.08
21666: 557
20646: 229 -- La Plata
20817: 2,687.56 -- Bethesda
Bethesda is the one outlier among wealthier zip codes -- MTA Purple Line
```{r}
#Is it possible to compare per capita, or are the numbers too small?
md_zip_map <- md_reports_zip |>
left_join(census_data, by = "zip") |>
mutate(reports_per_capita = (number_of_reports/total_pop)*10,000) |>
select(zip, reports_per_capita)
md_zip_map
#I think they might be too small to compare per capita by zip
#write_csv(md_zip_map, "data/md_map.csv")
```
```{r}
#These were outliers when I did a general total rather than a percent.
md_outliers_zip <- md_reports |>
filter(zip == "21601" | zip == "21801"| zip == "21740" | zip == "21613"| zip == "21921"| zip == "21666"| zip == "20646"| zip == "20817" & site_condition == "Noncompliance" | site_condition == "Significant Noncompliance")|>
group_by(zip, inspection_type) |>
summarise(count = n()) |>
arrange(desc(count))
md_outliers_zip
#write_csv(md_outliers_zip, "data/zip_outliers.csv")
```
Lots of sediment erosion in 20646 (486 reports), 21601 (418 reports), 20801 (405 reports). I'm just going to poke around in these zip codes to see what's there.
```{r}
#This one is in Baltimore County, which made me interested
zip_21213 <- md_reports |>
filter(zip == "21213") |>
group_by(site_name, inspection_type) |>
summarise(count =n()) |>
arrange(desc(count))
zip_21213
#P & J Contracting Company is standing out the most to me here.
```
```{r}
#Here's a zip code with a majority Black population
zip_20613 <- md_reports |>
group_by(site_condition) |>
summarise(count = n()) |>
arrange(desc(count))
zip_20613
```
```{r}
#Agricopia is a neighborhood in La Plata, which is in this zip. Agricopia had a good amount of noncompliance reports overall (still behind the Purple Line)
zip_20646 <- md_reports |>
filter(zip == "20646" & site_condition == "Noncompliance" | site_condition == "Significant Noncompliance", inspection_type == "Sediment_Erosion") |>
group_by(site_name) |>
summarise(count = n()) |>
arrange(desc(count))
zip_20646
```
```{r}
#Checking out this Talbot County zip code. Easton Technology Center had stood out to me at the moment.
zip_21601 <- md_reports |>
filter(zip == "21601" & site_condition == "Noncompliance" | site_condition == "Significant Noncompliance", site_name == "Easton Technology Center Lot 11") |>
group_by(inspection_type) |>
summarise(count = n()) |>
arrange(desc(count))
zip_21601
```
#I FOUND FLOOD EVENT DATA
```{r}
#What areas have had the majority of flood events? I found some NWS data that is 2019- Aug. 31, 2023. I want to be able to compare this to counties with the noncompliance results. If a location's stormwater management is poor, it could contribute to flash flooding. If it's noncompliant with NPDES regulations, that could potentially mean more pollutants in water sources. This would show me what counties to focus on if I wanted to look deeper into flooding. I would love to find a water quality map that shows local water quality over time. However, I couldn't find one in time that would work.
flood_events <- read_csv("data/storm_data_search_results.csv") |>
clean_names() |>
rename(county = cz_name_str) |>
select(county, begin_location, begin_date) |>
mutate(begin_date = mdy(begin_date),
county = str_to_title(str_remove_all(county, " CO.")),
county = case_when (county == "Baltimore City (C)" ~ "Baltimore City", .default = county),
begin_location= str_to_title(begin_location))
flood_events
md_flood_event_count <- flood_events |>
group_by(county) |>
summarise(flood_event_count = n()) |>
arrange(desc(flood_event_count))
md_flood_event_count
```
```{r}
counties <- counties()
```
```{r}
md_counties <- counties |>
filter(STATEFP == "24") |>
clean_names() |>
rename(county = name) |>
mutate(
geoid = as.character(geoid),
county = str_replace(county, "Baltimore County", "Baltimore"),
county = case_when(
county == "Baltimore" & namelsad == "Baltimore city" ~ "Baltimore City", TRUE ~ county),
county = str_replace(county, "Baltimore County", "Baltimore"))
md_counties
```
```{r}
flood_counties <- md_counties |> left_join(md_flood_event_count, join_by(county)) |>
# select(-county.x) |>
# rename(county = county.y) |>
mutate(county = str_replace(county, "Baltimore County", "Baltimore")) |>
select(-namelsad)
flood_counties
```
```{r}
county_centroids <- st_centroid(flood_counties)
county_centroids_df <- as.data.frame(st_coordinates(county_centroids))
county_centroids_df$county <- flood_counties$county
st_centroid(flood_counties) |>
st_coordinates() |>
as.data.frame() |>
cbind(flood_counties)
```
```{r}
county_centroids_df$county
```
```{r}
#The grey counties appear to have had no flood events from 2019-2023.
ggplot() +
geom_sf(data=flood_counties, aes(fill=flood_event_count)) +
geom_text(aes(x = X, y = Y, label = county), data = county_centroids_df, size = 3, check_overlap = TRUE) +
theme_minimal()
```
```{r}
#What are the top noncompliance investigation types for the top 5 flood counties? (Frederick, Montgomery, Baltimore, Prince George's, St. Mary's)
Baltimore_top_inspection_type <- md_reports |>
filter(site_condition == "Noncompliance"| site_condition == "Corrective Actions Required", county == "Baltimore", inspection_type == "NPDES Industrial Stormwater") |>
group_by(inspection_type) |>
summarise(count = n()) |>
arrange(desc(count)) |>
mutate(county = "Baltimore")
Baltimore_top_inspection_type
Frederick_top_inspection_type <- md_reports |>
filter(site_condition == "Noncompliance"| site_condition == "Corrective Actions Required", county == "Frederick", inspection_type == "NPDES Construction Activity") |>
group_by(inspection_type) |>
summarise(count = n()) |>
arrange(desc(count)) |>
mutate(county = "Frederick")
Frederick_top_inspection_type
Montgomery_top_inspection_type <- md_reports |>
filter(site_condition == "Noncompliance"| site_condition == "Corrective Actions Required", county == "Montgomery", inspection_type == "NPDES Construction Activity") |>
group_by(inspection_type) |>
summarise(count = n()) |>
arrange(desc(count)) |>
mutate(county = "Montgomery")
Montgomery_top_inspection_type
PrinceGeorges_top_inspection_type <- md_reports |>
filter(site_condition == "Noncompliance"| site_condition == "Corrective Actions Required", county == "Prince George's", inspection_type == "NPDES Construction Activity") |>
group_by(inspection_type) |>
summarise(count = n()) |>
arrange(desc(count)) |>
mutate(county = "Prince George's")
PrinceGeorges_top_inspection_type
StMarys_top_inspection_type <- md_reports |>
filter(site_condition == "Noncompliance"| site_condition == "Corrective Actions Required", county == "St. Mary's", inspection_type == "Tidal Wetlands") |>
group_by(inspection_type) |>
summarise(count = n()) |>
arrange(desc(count)) |>
mutate(county = "St. Mary's")
StMarys_top_inspection_type
top_types_in_flood <- bind_rows(Baltimore_top_inspection_type,
Frederick_top_inspection_type,
Montgomery_top_inspection_type,
PrinceGeorges_top_inspection_type,
StMarys_top_inspection_type) |>
arrange(desc(count)) |>
select(county, everything())
#^This last line I snagged from Chat GPT. I asked it "How do I bring the last column up to be the first column?" It responded by using the select() function and then "everything() " to avoid listing every single column after.
top_types_in_flood
```
#^Now we have the top issues in the top five counties that have seen the majority of flood events since 2019.
```{r}
inspection_types <- md_reports |>
group_by(inspection_type) |>
summarise(inspection_type_count = n()) |>
arrange(desc(inspection_type_count))
inspection_types
```
```{r}
#Baltimore City was pretty dark on the map, meaning it didn't have a lot of flood events during that timespan, but I still want to see what is causing the majority of these noncompliance findings. P$J Contracting Company, Inc. stands out.
bc_storm_water <- md_reports |>
filter(county == "Baltimore City",
inspection_type == "NPDES Industrial Stormwater",
site_status == "Noncompliance",
year != "2018"
) |>
group_by(site_name) |>
summarise(count = n()) |>
arrange(desc(count))
bc_storm_water
```
```{r}
#Now let's make a dataframe of just the corrective_actions_required reports in MD. How are they spread out across the state? College Park, MD, has the most.
compliance_corrective_required <- md_reports |>
filter(site_condition == "Corrective Actions Required")
required_zip <- compliance_corrective_required |>
group_by(zip) |>
summarise(number_of_corrective_actions_required = n()) |>
arrange(desc(number_of_corrective_actions_required)) |>
mutate(total = sum(number_of_corrective_actions_required),
percent_of_corrections = (number_of_corrective_actions_required/total)*100) |>
select(-total)
required_zip
```
```{r}
#What are all of the inspections from unauthorized sources? Where are they located? I decided to make a Google map of this since I had the addresses/could narrow it down a little more to see exactly where the events happened. There seems to be a corridor from southeast of Baltimore that runs northwest of the city. This isn't visual in the zip code map https://www.google.com/maps/d/edit?hl=en&mid=15blNyKGxHlsnJw9fq7eU0PvGLTEqltI&ll=38.970523016438676%2C-77.36076905598128&z=8
#Here's the zip map: https://datawrapper.dwcdn.net/70Unb/1/
unauthorized <- md_reports |>
mutate(is_unauthorized = str_detect(site_name, "(?i)unauthorized")) |>
filter(is_unauthorized == "TRUE")
zip_unauthorized <- unauthorized |>
filter(site_condition == "Noncompliance" | site_condition == "Corrective Actions Required") |>
group_by(zip) |>
summarise(unauthorized_count = n()) |>
arrange(desc(unauthorized_count))
zip_unauthorized
#write_csv(zip_unauthorized, "data/zipcode_unauthorized.csv")
```
```{r}
#What does this look like at a county level? Not my favorite, especially since there are a few near the border counties. I want to know if they were done by the Maryland border or if it was deeper in the state. https://datawrapper.dwcdn.net/pU9Q3/2/
county_unauthorized <- unauthorized |>
filter(site_condition == "Noncompliance" | site_condition == "Corrective Actions Required") |>
group_by(county) |>
summarise(count = n()) |>
arrange(desc(count))
county_unauthorized
#write_csv(county_unauthorized, "data/county_unauthorized.csv")
```
```{r}
#What's the spread of incident types with unauthorized incidents?
unauthorized_incident_types <- unauthorized |>
group_by(inspection_type) |>
summarise(count = n()) |>
arrange(desc(count))
unauthorized_incident_types
```
```{r}
#What are the demographics of those zip codes? Still mostly in white zip codes
unauthorized_demo <- zip_unauthorized |>
left_join(census_data, by = "zip") |>
select(zip, unauthorized_count, total_pop, perc_black, perc_white)
unauthorized_demo
```
```{r}
#Speaking of flooding, How big is state/federal stormwater management issue in each zip code? Each county? A handful of counties in Montgomery, Baltimore, St. Mary's and PG County had some of the highest SWM noncompliance counts.
state_fed_swm_zip <- md_reports |>
left_join(count_noncompliance, by = "zip") |>
filter(inspection_type == "State_federal SWM", site_condition == "Noncompliance") |>
group_by(zip, county, noncompliance_count) |>
summarise(swm_count = n()) |>
mutate(swm_percent_of_noncompliance = (swm_count/noncompliance_count)*100) |>
arrange(desc(swm_percent_of_noncompliance))
state_fed_swm_zip
```
```{r}
#Anne Arundel hs the largest percentage at 15%, followed by Montgomery, Harford, PG County and Baltimore. Both Montgomery and PG County were among the top three counties with the highest flood events from 2019-2023.
state_fed_swm_county <- md_reports |>
left_join(county_noncomp_count, by = "county") |>
filter(inspection_type == "State_federal SWM", site_condition == "Noncompliance") |>
group_by(county, count) |>
summarise(swm_count = n()) |>
mutate(swm_percent_of_noncompliance = (swm_count/count)*100) |>
arrange(desc(swm_percent_of_noncompliance))
state_fed_swm_county
```
```{r}
#What is Baltimore City's biggest noncompliance issue overall?
Baltimore_City_noncompliance <- md_reports |>
filter(county == "Baltimore City", site_condition == "Noncompliance") |>
group_by(inspection_type) |>
summarise(Baltimore_City = n()) |>
arrange(desc(Baltimore_City))
Baltimore_City_noncompliance
```
```{r}
#What are the demographics of zip codes that require action? Still pretty white.
demographics_near_required_actions <- required_zip |>
left_join(census_data, by = "zip") |>
arrange(desc(percent_of_corrections))
demographics_near_required_actions
```