nyccrash.qmd

# NYC Crash Data


## First Glance
Consider the NYC Crash Data in January 2022.

```{python}
import pandas as pd

jan23 = pd.read_csv("data/nyc_crashes_202301.csv")
jan23.head()
jan23.describe()
```

Frequency tables for categorical variables.
```{python}
jan23["BOROUGH"].value_counts(dropna=False)
```

Some tables are too long.
```{python}
# jan23["CONTRIBUTING FACTOR VEHICLE 1"].value_counts(dropna=False)
with pd.option_context('display.max_rows', None):
    print(jan23["VEHICLE TYPE CODE 1"].value_counts(dropna=False))
```

Cross-tables 
```{python}
pd.crosstab(index = jan23["CONTRIBUTING FACTOR VEHICLE 1"],
			columns = jan23["BOROUGH"], dropna = False)
```

## Some Cleaning
Questions from Dr. Douglas Bates:

+ The `CRASH_DATE`s are all in the correct month and there are no missing values
+ There are no missing values in the `CRASH_TIME`s but there are 117 values of exactly `00:00:00`. Is this a matter of bad luck when the clock strikes midnight?
+ Over 1/3 of the `ZIP_CODE` and `BOROUGH` values are missing. There are the same number of missing values in these columns - do they always co-occur? If `LATITUDE` and `LONGITUDE` are available, can they be used to infer the `ZIP_CODE`?
+ There are 178 unique non-missing `ZIP_CODE` values as stated in the Jamboree description. (“Trust, but verify.”) Is there really a zip code of 10000 in New York?
+ There are 20 values of 0.0 for `LATITUDE` and `LONGITUDE`? These are obviously incorrect - should they be coded as missing?
+ Is it redundant to keep the `LOCATIO` in addition to `LATITUDE` and `LONGITUDE`?
+ The `COLLISION_ID` is unique to each row and can be used as a key. The values are not consecutive - why not?
+ The `NUMBER_OF_...` columns seem reasonable. A further consistency check is suggested in the Jamboree tasks.
+ In the `CONTRIBUTING_FACTOR`_... columns, is `Unspecified` different from `missing`?
+ The codes in the `VEHICLE_TYPE_CODE_...` columns are the usual hodge-podge of results from “free-form” data entry. Should `unk`, `UNK`, `UNKNOWN`, and `Unknown` be converted to missing?
+ In contrast, the codes in the `CONTRIBUTING_FACTOR_...` columns appear to be standardized (not sure why `Illnes` isn’t `Illness`).


```{python}
with pd.option_context('display.max_rows', None):
	print(jan23["CRASH TIME"].value_counts())
```

For example, here are some cleaning steps:
```{python}
import numpy as np

jan23["CONTRIBUTING FACTOR VEHICLE 1"] = (
    jan23["CONTRIBUTING FACTOR VEHICLE 1"].replace(["Unspecified"], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 2"] = (
    jan23["CONTRIBUTING FACTOR VEHICLE 2"].replace(["Unspecified"], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 3"] = (
    jan23["CONTRIBUTING FACTOR VEHICLE 3"].replace(["Unspecified"], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 4"] = (
    jan23["CONTRIBUTING FACTOR VEHICLE 4"].replace(["Unspecified"], np.nan))
jan23["CONTRIBUTING FACTOR VEHICLE 5"] = (
    jan23["CONTRIBUTING FACTOR VEHICLE 5"].replace(["Unspecified"], np.nan))
jan23["LATITUDE"] = jan23["LATITUDE"].replace([0.0], np.nan)
jan23["LONGITUDE"] = jan23["LONGITUDE"].replace([0.0], np.nan)
jan23.describe()
```


By the data dictionary, `OFF STREET NAME` is the street address of the collision
site. Some records have `OFF STREET NAME` but missing `LATITUDE` and
`LONGITUDE`. The geocode can be filled by geocoding the street address with
package 


## Filling the Missing Zip Codes by Reverse Geocoding

The package `uszipcode` is the most powerful and easy to use programmable 
zipcode database in Python. It provides information about 42,724 zipcodes
in the US with data crawled from <data.census.gov>. 
See [its documentation](https://uszipcode.readthedocs.io/index.html) for details.


```{python}
from uszipcode import SearchEngine

sr = SearchEngine()
sr.by_zipcode("10001")
```

We can use `uszipcode` to reverse geocode a point by its coordinates. The
returned zipcode can be used to handle missing zipcode.

```{python}
z = sr.by_coordinates(40.769993, -73.915825, radius = 1)
z[0].zipcode
z[0].median_home_value
```

Once we have found the zipcode, we can find its borough. See 
the [complete NYC zip code list](https://bklyndesigns.com/new-york-city-zip-code/).

```{python}
def nyczip2burough(zip):
	nzip = int(zip)
	if nzip >= 10001 and nzip <= 10282:
		return "MANHATTAN"
	elif nzip >= 10301 and nzip <= 10314:
		return "STATEN ISLAND"
	elif nzip >= 10451 and nzip <= 10475:
		return "BRONX"
	elif nzip >= 11004 and nzip <= 11109:
		return "QUEENS"
	elif nzip >= 11351 and nzip <= 11697:
		return "QUEENS"
	elif nzip >= 11201 and nzip <= 11256:
		return "BROOKLYN"
	else:
		return np.nan
```

Let's try it out:
```{python}
nyczip2burough(z[0].zipcode)
```

Here is a vectorized version:
```{python}
import numpy as np
import pandas as pd
from typing import Union, List

def nyczip2borough(zips: Union[np.ndarray, pd.Series]) -> Union[np.ndarray, pd.Series]:
    zips = zips.values if isinstance(zips, pd.Series) else zips
    condlist = [
        (zips >= 10001) & (zips <= 10282),
        (zips >= 10301) & (zips <= 10314),
        (zips >= 10451) & (zips <= 10475),
        (zips >= 11004) & (zips <= 11109),
        (zips >= 11351) & (zips <= 11697),
        (zips >= 11201) & (zips <= 11256),
    ]
    choicelist = [
        "MANHATTAN",
        "STATEN ISLAND",
        "BRONX",
        "QUEENS",
        "QUEENS",
        "BROOKLYN",
    ]
    result = np.select(condlist, choicelist, default=np.nan)
    return pd.Series(result) if isinstance(zips, pd.Series) else result
```

Try it out
```{python}
nyczip2borough(jan23["ZIP CODE"].dropna().head(10))
```


The `uszipcode` package provides databases at the zip code level from the US
Census. Such information could be merged with the NYC crash data for further
analysis.

```{python}
from uszipcode import SearchEngine, SimpleZipcode
import os
# set the default database file location
db_file = os.path.join(os.getenv("HOME"), "simple_db.sqlite")
search = SearchEngine(db_file_path=db_file)
search.by_zipcode("10030")
```

The SQL database of US zip code is stored in `$HOME/.uszipcode`. It can be imported
as a `pandas dataframe`.

```{python}
import sqlite3
import pandas as pd
# change to your own path after installing uszipcode
con = sqlite3.connect(db_file)
zipdf = pd.read_sql_query("SELECT * from simple_zipcode", con)
zipdf.info()
```

The zip code dataframe can be merged with the crash dataframe.


## Map the Crash Sites

We can do this with package `gmplot`. See instructions from
[this tutorial](https://www.tutorialspoint.com/plotting-google-map-using-gmplot-package-in-python).

```{python}
import gmplot
import numpy as np

# prepare the geododes
latitude  = jan23["LATITUDE"].dropna().values
longitude = jan23["LONGITUDE"].dropna().values

# center of the map and zoom level
gmap = gmplot.GoogleMapPlotter(40.769737, -73.91244, 14)

# plot heatmap
gmap.heatmap(latitude, longitude)
gmap.scatter(latitude, longitude, c = 'r', marker = True)
# gmap.scatter(latitude, longitude, '#FF0000', size = 50, marker = False)
# Your Google_API_Key
# gmap.apikey = "put your key here"
# save it to html
gmap.draw(r"nycrashmap.html")
```

## Predicting Injuries

Let's start by imporing the cleaned data. We define `INJURY` as an indicator of
personal injuries and creat an `HOUR` variable for the crash hours.
```{python}
jan23 = pd.read_csv("data/nyc_crashes_202301_cleaned.csv")
jan23["INJURY"] = [1 if x > 0 else 0 for x in jan23["NUMBER OF PERSONS INJURED"]]
jan23["HOUR"] = jan23["CRASH TIME"].str.split(":").str[0].astype(int)
```

We import zipcode level information from package `uszipcode` and merge it.
```{python}
zipdf = pd.read_sql_query("SELECT * from simple_zipcode WHERE state = 'NY'", con)
zipdf["zipcode"] = zipdf["zipcode"].astype(float)
crashdf = pd.merge(jan23, zipdf, left_on = "ZIP CODE", right_on = "zipcode")
```

Let's select a few columns that might be useful:
```{python}
crashdf = crashdf[["INJURY", "HOUR", "BOROUGH", "LATITUDE", "LONGITUDE",
				   "zipcode", "population_density", "median_home_value",
				   "median_household_income"]]
crashdf.info()
```

### Preparation

Categorical variables need to be coded appropriately. Ordinal variables can be
coded with `OrdinalEncoder`. Nominal variables can be coded with
`OneHotEncoder`. We want to treat `HOUR`, `BOROUGH`, and `zipcode` as nominal
categorical variables. One way to encode them is `pd.get_dummies`.

```{python}
catVars = ["HOUR", "BOROUGH"] #, "zipcode"]
crashdf = pd.get_dummies(crashdf, columns = catVars, prefix = catVars)
crashdf.shape
```

It is important to note that we get one dummy variable for each level of the
categorical variable. For example, we get 5 dummies for `BOROUGH` and 24 dummies
for `HOUR`. If these dummies were to used in a regression model, the model
matrix is not of full rank; in other words, there is perfect colinearity among
the covariates. An error would occur from the model fitting. For regularized
regression models or many machine learning methods (e.g., decision tree, random
forest, SVM, etc.), this is less an issue.


We need to standardize the continuous covariates so that their impacts are
comparable. Standardization also makes the numerical computation more stable. If
one variable has a variance that 100 times larger than others, it might dominate
the objective function and make it hard for the optimization algorithms to find
the optimum. This preprocessing can be done with the `sklearn.preprocessing`
module.


```{python}
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
conVars = ["population_density", "median_home_value", "median_household_income"]
crashdf[conVars] = scaler.fit_transform(crashdf[conVars])
crashdf.columns
```