forked from statds/ids-s23
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnyccrash.qmd
282 lines (226 loc) · 9.5 KB
/
nyccrash.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
# NYC Crash Data
## First Glance
Consider the NYC Crash Data in January 2022.
```{python}
import pandas as pd
jan23 = pd.read_csv("data/nyc_crashes_202301.csv")
jan23.head()
jan23.describe()
```
Frequency tables for categorical variables.
```{python}
jan23["BOROUGH"].value_counts(dropna=False)
```
Some tables are too long.
```{python}
# jan23["CONTRIBUTING FACTOR VEHICLE 1"].value_counts(dropna=False)
with pd.option_context('display.max_rows', None):
print(jan23["VEHICLE TYPE CODE 1"].value_counts(dropna=False))
```
Cross-tables
```{python}
pd.crosstab(index = jan23["CONTRIBUTING FACTOR VEHICLE 1"],
columns = jan23["BOROUGH"], dropna = False)
```
## Some Cleaning
Questions from Dr. Douglas Bates:
+ The `CRASH_DATE`s are all in the correct month and there are no missing values
+ There are no missing values in the `CRASH_TIME`s but there are 117 values of exactly `00:00:00`. Is this a matter of bad luck when the clock strikes midnight?
+ Over 1/3 of the `ZIP_CODE` and `BOROUGH` values are missing. There are the same number of missing values in these columns - do they always co-occur? If `LATITUDE` and `LONGITUDE` are available, can they be used to infer the `ZIP_CODE`?
+ There are 178 unique non-missing `ZIP_CODE` values as stated in the Jamboree description. (“Trust, but verify.”) Is there really a zip code of 10000 in New York?
+ There are 20 values of 0.0 for `LATITUDE` and `LONGITUDE`? These are obviously incorrect - should they be coded as missing?
+ Is it redundant to keep the `LOCATIO` in addition to `LATITUDE` and `LONGITUDE`?
+ The `COLLISION_ID` is unique to each row and can be used as a key. The values are not consecutive - why not?
+ The `NUMBER_OF_...` columns seem reasonable. A further consistency check is suggested in the Jamboree tasks.
+ In the `CONTRIBUTING_FACTOR`_... columns, is `Unspecified` different from `missing`?
+ The codes in the `VEHICLE_TYPE_CODE_...` columns are the usual hodge-podge of results from “free-form” data entry. Should `unk`, `UNK`, `UNKNOWN`, and `Unknown` be converted to missing?
+ In contrast, the codes in the `CONTRIBUTING_FACTOR_...` columns appear to be standardized (not sure why `Illnes` isn’t `Illness`).
```{python}
with pd.option_context('display.max_rows', None):
print(jan23["CRASH TIME"].value_counts())
```
For example, here are some cleaning steps:
```{python}
import numpy as np
jan23["CONTRIBUTING FACTOR VEHICLE 1"] = (
jan23["CONTRIBUTING FACTOR VEHICLE 1"].replace(["Unspecified"], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 2"] = (
jan23["CONTRIBUTING FACTOR VEHICLE 2"].replace(["Unspecified"], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 3"] = (
jan23["CONTRIBUTING FACTOR VEHICLE 3"].replace(["Unspecified"], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 4"] = (
jan23["CONTRIBUTING FACTOR VEHICLE 4"].replace(["Unspecified"], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 5"] = (
jan23["CONTRIBUTING FACTOR VEHICLE 5"].replace(["Unspecified"], np.nan))
jan23["LATITUDE"] = jan23["LATITUDE"].replace([0.0], np.nan)
jan23["LONGITUDE"] = jan23["LONGITUDE"].replace([0.0], np.nan)
jan23.describe()
```
By the data dictionary, `OFF STREET NAME` is the street address of the collision
site. Some records have `OFF STREET NAME` but missing `LATITUDE` and
`LONGITUDE`. The geocode can be filled by geocoding the street address with
package
## Filling the Missing Zip Codes by Reverse Geocoding
The package `uszipcode` is the most powerful and easy to use programmable
zipcode database in Python. It provides information about 42,724 zipcodes
in the US with data crawled from <data.census.gov>.
See [its documentation](https://uszipcode.readthedocs.io/index.html) for details.
```{python}
from uszipcode import SearchEngine
sr = SearchEngine()
sr.by_zipcode("10001")
```
We can use `uszipcode` to reverse geocode a point by its coordinates. The
returned zipcode can be used to handle missing zipcode.
```{python}
z = sr.by_coordinates(40.769993, -73.915825, radius = 1)
z[0].zipcode
z[0].median_home_value
```
Once we have found the zipcode, we can find its borough. See
the [complete NYC zip code list](https://bklyndesigns.com/new-york-city-zip-code/).
```{python}
def nyczip2burough(zip):
nzip = int(zip)
if nzip >= 10001 and nzip <= 10282:
return "MANHATTAN"
elif nzip >= 10301 and nzip <= 10314:
return "STATEN ISLAND"
elif nzip >= 10451 and nzip <= 10475:
return "BRONX"
elif nzip >= 11004 and nzip <= 11109:
return "QUEENS"
elif nzip >= 11351 and nzip <= 11697:
return "QUEENS"
elif nzip >= 11201 and nzip <= 11256:
return "BROOKLYN"
else:
return np.nan
```
Let's try it out:
```{python}
nyczip2burough(z[0].zipcode)
```
Here is a vectorized version:
```{python}
import numpy as np
import pandas as pd
from typing import Union, List
def nyczip2borough(zips: Union[np.ndarray, pd.Series]) -> Union[np.ndarray, pd.Series]:
zips = zips.values if isinstance(zips, pd.Series) else zips
condlist = [
(zips >= 10001) & (zips <= 10282),
(zips >= 10301) & (zips <= 10314),
(zips >= 10451) & (zips <= 10475),
(zips >= 11004) & (zips <= 11109),
(zips >= 11351) & (zips <= 11697),
(zips >= 11201) & (zips <= 11256),
]
choicelist = [
"MANHATTAN",
"STATEN ISLAND",
"BRONX",
"QUEENS",
"QUEENS",
"BROOKLYN",
]
result = np.select(condlist, choicelist, default=np.nan)
return pd.Series(result) if isinstance(zips, pd.Series) else result
```
Try it out
```{python}
nyczip2borough(jan23["ZIP CODE"].dropna().head(10))
```
The `uszipcode` package provides databases at the zip code level from the US
Census. Such information could be merged with the NYC crash data for further
analysis.
```{python}
from uszipcode import SearchEngine, SimpleZipcode
import os
# set the default database file location
db_file = os.path.join(os.getenv("HOME"), "simple_db.sqlite")
search = SearchEngine(db_file_path=db_file)
search.by_zipcode("10030")
```
The SQL database of US zip code is stored in `$HOME/.uszipcode`. It can be imported
as a `pandas dataframe`.
```{python}
import sqlite3
import pandas as pd
# change to your own path after installing uszipcode
con = sqlite3.connect(db_file)
zipdf = pd.read_sql_query("SELECT * from simple_zipcode", con)
zipdf.info()
```
The zip code dataframe can be merged with the crash dataframe.
## Map the Crash Sites
We can do this with package `gmplot`. See instructions from
[this tutorial](https://www.tutorialspoint.com/plotting-google-map-using-gmplot-package-in-python).
```{python}
import gmplot
import numpy as np
# prepare the geododes
latitude = jan23["LATITUDE"].dropna().values
longitude = jan23["LONGITUDE"].dropna().values
# center of the map and zoom level
gmap = gmplot.GoogleMapPlotter(40.769737, -73.91244, 14)
# plot heatmap
gmap.heatmap(latitude, longitude)
gmap.scatter(latitude, longitude, c = 'r', marker = True)
# gmap.scatter(latitude, longitude, '#FF0000', size = 50, marker = False)
# Your Google_API_Key
# gmap.apikey = "put your key here"
# save it to html
gmap.draw(r"nycrashmap.html")
```
## Predicting Injuries
Let's start by imporing the cleaned data. We define `INJURY` as an indicator of
personal injuries and creat an `HOUR` variable for the crash hours.
```{python}
jan23 = pd.read_csv("data/nyc_crashes_202301_cleaned.csv")
jan23["INJURY"] = [1 if x > 0 else 0 for x in jan23["NUMBER OF PERSONS INJURED"]]
jan23["HOUR"] = jan23["CRASH TIME"].str.split(":").str[0].astype(int)
```
We import zipcode level information from package `uszipcode` and merge it.
```{python}
zipdf = pd.read_sql_query("SELECT * from simple_zipcode WHERE state = 'NY'", con)
zipdf["zipcode"] = zipdf["zipcode"].astype(float)
crashdf = pd.merge(jan23, zipdf, left_on = "ZIP CODE", right_on = "zipcode")
```
Let's select a few columns that might be useful:
```{python}
crashdf = crashdf[["INJURY", "HOUR", "BOROUGH", "LATITUDE", "LONGITUDE",
"zipcode", "population_density", "median_home_value",
"median_household_income"]]
crashdf.info()
```
### Preparation
Categorical variables need to be coded appropriately. Ordinal variables can be
coded with `OrdinalEncoder`. Nominal variables can be coded with
`OneHotEncoder`. We want to treat `HOUR`, `BOROUGH`, and `zipcode` as nominal
categorical variables. One way to encode them is `pd.get_dummies`.
```{python}
catVars = ["HOUR", "BOROUGH"] #, "zipcode"]
crashdf = pd.get_dummies(crashdf, columns = catVars, prefix = catVars)
crashdf.shape
```
It is important to note that we get one dummy variable for each level of the
categorical variable. For example, we get 5 dummies for `BOROUGH` and 24 dummies
for `HOUR`. If these dummies were to used in a regression model, the model
matrix is not of full rank; in other words, there is perfect colinearity among
the covariates. An error would occur from the model fitting. For regularized
regression models or many machine learning methods (e.g., decision tree, random
forest, SVM, etc.), this is less an issue.
We need to standardize the continuous covariates so that their impacts are
comparable. Standardization also makes the numerical computation more stable. If
one variable has a variance that 100 times larger than others, it might dominate
the objective function and make it hard for the optimization algorithms to find
the optimum. This preprocessing can be done with the `sklearn.preprocessing`
module.
```{python}
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
conVars = ["population_density", "median_home_value", "median_household_income"]
crashdf[conVars] = scaler.fit_transform(crashdf[conVars])
crashdf.columns
```