forked from statds/ids-s23
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathexercises.qmd
129 lines (119 loc) · 7.53 KB
/
exercises.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# Exercises
1. Pick up Git basics and set up an account at GitHub if you don't have
one. Please practice the tips on Git in the notes. Make sure you have at
least 10 commits in the repo, each with informative message. Keep checking
the status of your repo with `git status`. My grader will grade the repo.
1. Clone the `ids-s23` repo to your own computer.
1. Add your name and wishes to the Wishlist; commit with an informative message.
1. Remove the `Last, First` entry from the list; commit.
1. Create a new file called `add.qmd` containing a few lines of texts; commit.
1. Remove `add.qmd` (pretending that this is by accident; commit.
1. Recover the accidently removed file `add.qmd`; add a long line (a
paragraph without a hard break); add a short line (under 80 characters);
commit.
1. Change one word in the long line and one word in the short line; use
`git diff` to see the difference from the last commit; commit.
1. Put the repo into the GitHub Classroom homework repo with `git remote add` and `git push`.
1. Get ready for contributing to the classnotes.
1. Create a fork of the `ids-s23` repo into your own GitHub account.
1. Clone it to your local computer.
1. Make a new branch to experiment with your changes.
1. Checkout your branch and add your wishes to the wish list; push to your
GitHub account.
1. Make a pull request to my `ids-s23` repo from your fork at GitHub. Make
sure you have clear messages to document the changes.
1. Write a function to demonstrate the Monty Hall problem through
simulation. The function takes two arguments `ndoors` and
`ntrials`, representing the number of doors in the experiment and
the number of trails in a simulation, respectively. The function
should return the proportion of wins for both the switch and
no-switch strategy. Apply your function with 3 doors and 5 doors,
both with 1000 trials. Include sufficient text around the code to explain
your them.
1. Write a function to do a Monte Carlo approximation of $\pi$. The
function takes a Monte Carlo sample size `n` as input, and returns
a point estimate of $\pi$ and a 95% confidence interval. Apply your
function with sample size 1000, 2000, 4000, and 8000. Repeat the experiment
1000 times for each sample size and check the empirical probability that the
confidence intervals cover the true value of $\pi$. Comment on
the results.
1. Find the first 10-digit prime number occurring in consecutive
digits of $e$. This was a
[Google recruiting ad](http://mathworld.wolfram.com/news/2004-10-13/google/)
1. The NYC motor vehicle collisions data with documentation is available from
[NYC Open
Data](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95).
The raw data needs some cleaning. (JY: Add variable name cleaning next year.)
1. Use the filter from the website to download the crash data of January
2023; save it under a directory `data` with an informative name
(e.g., `nyc_crashes_202301.csv`).
1. Get basic summaries of each variable: missing percentage; descriptive
statistics for continuous variables; frequency tables for discrete
variables.
1. Are the `LATITUDE` and `LONGITIDE` values all look legitimate? If not
(e.g., zeroes), code them as missing values.
1. If `OFF STREET NAME` is not missing, are there any missing `LATITUDE` and
`LONGITUDE`? If so, geocode the addresses.
1. (Optional) Are the missing patterns of `ON STREET NAME` and `LATITUDE` the same?
Summarize the missing patterns by a cross table. If `ON STREET NAME` and
`CROSS STREET NAME` are available, use geocoding by intersection to fill
the `LATITUDE` and `LONGITUDE`.
1. Are `ZIP CODE` and `BOROUGH` always missing together? If `LATITUDE` and
`LONGITUDE` are available, use reverse geocoding to fill the `ZIP CODE`
and `BOROUGH`.
1. Print the whole frequency table of
`CONTRIBUTING FACTOR VEHICLE 1`.
Convert lower cases to uppercases and check the frequencies again.
1. Provided an opportunity to meet the data provider, what suggestions do
you have to make the data better based on your data exploration
experience?
1. Except the first problem, use the cleaned data set with missing geocode
imputed (`data/nyc_crashes_202301_cleaned.csv`).
1. Construct a contigency table for missing in geocode (latitude and
longitude) by borough. Is the missing pattern the same across borough?
Formulate a hypothesis and test it.
1. Construct a `hour` variable with integer values from 0 to 23. Plot the
histogram of the number of crashes by `hour`. Plot it by borough.
1. Overlay the locations of the crashes on a map of NYC. The map could be a
static map or Google map.
1. Create a new variable `injury` which is one if the number of persons
injured is 1 or more; and zero otherwise. Construct a cross table for
`injury` versus borough. Test the null hypothesis that the two variables are
not associated.
1. Merge the crash data with the zip code database.
1. Fit a logistic model with `injury` as the outcome variable and covariates
that are available in the data or can be engineered from the data. For
example, zip code level covariates can be obtained by merging with the
zip code database.
1. Using the cleaned NYC crash data, perform classification of `injury` with
support vector machine and compare the results with the benchmark from
regularized logistic regression. Use the last week's data as testing data.
1. Explain the parameters you used in your fitting for each method.
2. Explain the confusion matrix retult from each fit.
3. Compare the performance of the two approaches in terms of accuracy,
precision, recall, F1-score, and AUC.
1. (Mid-term team project) The [NYC Open Data of 311 Service Requests](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9) contains
all requests from 2010 to present. We consider a subset of it with request
time between 00:00:00 01/15/2023 and 24:00:00 01/21/2023. The subset is
available in CSV format as `data/nyc311_011523-012123_by022023.csv`. Read the
data dictionary to understand the meaning of the variables,
1. Clean the data: fill missing fields as much as possible; check for
obvious data entry errors (e.g., can `Closed Date` be earlier than
`Created Date`?); summarize your suggestions to the data curator in
several bullet points.
1. Remove requests that are not made to NYPD and create a new variable
`duration`, which represents the time period from the `Created Date` to
`Closed Date`. Note that `duration` may be censored for some
requests. Visualize the distribution of uncensored `duration` by
weekdays/weekend and by borough, and test whether the distributions
are the same across weekdays/weekends of their creation
and across boroughs.
1. Define a binary variable `over3h` which is 1 if `duration` is greater
than 3 hours. Note that it can be obtained even for censored `duration`.
Build a model to predict `over3h`.
If your model has tuning parameters, justify their choices. Apply
this model to the 311 requests of NYPD in the week of 01/22/2023. Assess
the performance of your model.
1. Now you know the data quite well. Come up with a research question of
interest that can be answered by the data, which could be analytics or
visualizations. Perform the needed analyses and answer your question.