-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathae-05.Rmd
317 lines (221 loc) · 9.47 KB
/
ae-05.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
---
title: "Data Wrangling I"
author: "[INSERT YOUR NAME]"
date: "`r Sys.Date()`"
output: pdf_document
editor_options:
chunk_output_type: console
markdown:
wrap: 72
---
To demonstrate data wrangling we will use `flights`, a tibble in the
**nycflights13** R package. It includes characteristics of all flights
departing from New York City (JFK, LGA, EWR) in 2013.
```{r load-packages, message = FALSE}
library(tidyverse)
library(nycflights13) #includes flights data
```
The data frame has over 336,000 observations (rows), `r nrow(flights)`
observations to be exact, so we will **not** view the entire data frame.
Instead we'll use the commands below to help us explore the data.
```{r glimpse-data}
glimpse(flights)
```
```{r column-names}
names(flights)
```
```{r explore-data}
head(flights)
```
The `head()` function returns "A tibble: 6 x 19" and then the first six rows of the `flights` data.
## Tibble vs. data frame
A **tibble** is an opinionated version of the `R` data frame. In other words, all tibbles are data frames, but not all data frames are tibbles!
There are two main differences between a tibble and a data frame:
1. When you print a tibble, the first ten rows and all of the columns
that fit on the screen will display, along with the type of each
column.
Let's look at the differences in the output when we type `flights`
(tibble) in the console versus typing `cars` (data frame) in the
console.
2. Second, tibbles are somewhat more strict than data frames when it
comes to subsetting data. You will get an error message if you try
to access a variable that doesn't exist in a tibble. You will get
`NULL` if you try to access a variable that doesn't exist in a data
frame.
```{r tibble-v-data-frame}
flights$apple
cars$apple
```
## Data wrangling with `dplyr`
**dplyr** is the primary package in the tidyverse for data wrangling.
[Click here](https://dplyr.tidyverse.org/) for the dplyr reference page.
[Click
here](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)
for the dplyr cheatsheet.
Quick summary of key dplyr functions[^1]:
[^1]: From [dplyr
vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html)
**Rows:**
- `filter()`:chooses rows based on column values.
- `slice()`: chooses rows based on location.
- `arrange()`: changes the order of the rows
- `sample_n()`: take a random subset of the rows
**Columns:**
- `select()`: changes whether or not a column is included.
- `rename()`: changes the name of columns.
- `mutate()`: changes the values of columns and creates new columns.
**Groups of rows:**
- `summarise()`: collapses a group into a single row.
- `count()`: count unique values of one or more variables.
- `group_by()`: perform calculations separately for each value of a variable
## `select()`
- Make a data frame that only contains the variables `dep_delay` and `arr_delay`.
```{r select-vars}
select(flights, dep_delay, arr_delay)
```
- Make a data frame that keeps every variable except `dep_delay`.
```{r exclude-vars}
# add code here
```
- Make a data frame that includes all variables between `year` through `dep_delay` (inclusive). These are all variables that provide information about the departure of each flight.
```{r include-range}
## add code
```
- Use the `select` helper `contains()` to make a data frame that includes the variables associated with the arrival, i.e., contains the string "arr_" in the name.
```{r arr-vars}
# add code
```
## The pipe
Before looking at more data wrangling functions, let's introduce the pipe. The **pipe**, `%>%`, is a technique for passing information from one process to
another. We will use `%>%` mainly in dplyr pipelines to pass the output of the previous line of code as the first input of the next line of code.
When reading code "in English", say "and then" whenever you see a pipe.
**Question 1 (4 minutes)** The following code is equivalent to which line of code? Submit your response in Ed Discussion: https://edstem.org/us/courses/8027/discussion/590071
```{r pipe-demo}
flights %>%
select(dep_delay, arr_delay) %>%
head()
```
## `slice()`
- Select the first five rows of the `flights` data frame.
```{r slice}
flights %>%
slice(1:5)
```
- Select the last two rows of the `flights` data frame.
```{r last-two}
flights %>%
slice((n()-1):n())
```
## `arrange()`
- Let's arrange the data by departure delay, so the flights with the shortest departure delays will be at the top of the data frame. **What does it mean for the `dep_delay` to have a negative value?**
```{r arrange-delays}
flights %>%
arrange(dep_delay)
```
- Now let's arrange the data by descending departure delay, so the flights with the longest departure delays will be at the top.
```{r arrange-delays-desc}
## add code
```
- **Question 2 (5 minutes)**: Create a data frame that only includes the plane tail number (`tailnum`), carrier (`carrier`), and departure delay for the flight with the longest departure delay. What is the plane tail number (`tailnum`) for this flight? Submit your response on Ed Discussion: https://edstem.org/us/courses/8027/discussion/590079
```{r max-delay}
## add code
```
## `filter()`
- Filter the data frame by selecting the rows where the destination airport is RDU.
```{r rdu}
flights %>%
filter(dest == "RDU")
```
- We can also filter using more than one condition. Here we select all rows where the destination airport is RDU and the arrival delay is less than 0.
```{r rdu-ontime}
flights %>%
filter(dest == "RDU", arr_delay < 0)
```
We can do more complex tasks using logical operators:
| operator | definition |
|:--------------|:-----------------------------|
| `<` | is less than? |
| `<=` | is less than or equal to? |
| `>` | is greater than? |
| `>=` | is greater than or equal to? |
| `==` | is exactly equal to? |
| `!=` | is not equal to? |
| `x & y` | is x AND y? |
| `x \| y` | is x OR y? |
| `is.na(x)` | is x NA? |
| `!is.na(x)` | is x not NA? |
| `x %in% y` | is x in y? |
| `!(x %in% y)` | is x not in y? |
| `!x` | is not x? |
The final operator only makes sense if `x` is logical (TRUE / FALSE).
- **Question 3 (4 minutes)**: Describe what the code is doing in words. Submit your response in Ed Discussion: https://edstem.org/us/courses/8027/discussion/590083
```{r nc-early}
flights %>%
filter(dest %in% c("RDU", "GSO"),
arr_delay < 0 | dep_delay < 0)
```
## `count()`
- Create a frequency table of the destination locations for flights from New York.
```{r count-dest}
flights %>%
count(dest)
```
- In which month was there the fewest number of flights? How many flights were there in that month?
```{r count-month}
## add code
```
- **Question 4 (5 minutes)**: On which date (month + day) was there the largest number of flights? How many flights were there on that day? Submit your response on Ed Discussion: https://edstem.org/us/courses/8027/discussion/590086
```{r count-date}
## add code
```
## `mutate()`
Use `mutate()` to create a new variable.
- In the code chunk below, `air_time` (minutes in the air) is converted to hours, and then new variable `mph` is created, corresponding to the miles per hour of the
flight.
```{r calculate-mph}
flights %>%
mutate(hours = air_time / 60,
mph = distance / hours) %>%
select(air_time, distance, hours, mph)
```
- **Question (4 minutes)**: Create a new variable to calculate the percentage of flights in each month. What percentage of flights take place in July?
```{r months-perc}
## add code
```
## `summarize()`/ `summarise()`
`summarise()` collapses the rows into summary statistics and removes columns irrelevant to the calculation.
Be sure to name your columns!
```{r find-mean-delay}
flights %>%
summarise(mean_dep_delay = mean(dep_delay))
```
**Question:** Why did this code return `NA`?
Let's fix it
```{r find-mean-delay-no-na}
flights %>%
summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE))
```
### `group_by()`
`group_by()` is used for grouped operations. It's very powerful when
paired with `summarise()` to calculate summary statistics by group.
Here we find the mean and standard deviation of departure delay for each month.
```{r delays-by-month}
flights %>%
group_by(month) %>%
summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE),
sd_dep_delay = sd(dep_delay, na.rm = TRUE))
```
- **Question 5 (4 minutes)**: What is the median departure delay for each airports around NYC (`origin`)? Which airport has the shortest median departure delay? Submit your response on Ed Discussion: https://edstem.org/us/courses/8027/discussion/590091
```{r dep-origin}
## add code
```
## Additional Practice
(1) Create a new dataset that only contains flights that do not have a missing departure time. Include the columns `year`, `month`, `day`,
`dep_time`, `dep_delay`, and `dep_delay_hours` (the departure delay in hours). *Hint: Note you may need to use `mutate()` to make one or more of these variables.*
```{r add-practice-1}
```
(2) For each airplane (uniquely identified by `tailnum`), use a
`group_by()` paired with `summarize()` to find the sample size,
mean, and standard deviation of flight distances. Then include only the top 5 and bottom 5 airplanes in terms of mean distance traveled per flight in the final data frame.
```{r add-practice-2}
```