-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path1_rmd_tidy_intro_additions.Rmd
340 lines (234 loc) · 10.9 KB
/
1_rmd_tidy_intro_additions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
---
title: "Intro to R Markdown and Tidyverse"
author: "Monica Alexander"
date: "11 January 2022"
output:
pdf_document:
toc: true
number_sections: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
# By the end of this lab you should know the basics of
- RStudio Projects
- R Markdown
- Main tidyverse functions
- `ggplot`
# RStudio Projects
RStudio projects are associated with R working directories. They are good to use for several reasons:
- Each project has their own working directory, so make dealing with file paths easier
- Make it easy to divide your work into multiple contexts
- Can open multiple projects at one time in separate windows
To make a new project in RStudio, go to File --> New Project. If you've already set up a repo for this class, then select 'Existing Directory' and choose the folder that will contain all your class materials. This will open a new RStudio window, that will be called the name of your folder.
In future, when you want to do work for this class, go to your class folder and open the .Rproj file. This will open an RStudio window, with the correct working directory, and show the files you were last working on.
# R Markdown
This is an R Markdown document. R Markdown allows you to create nicely formatted documents (HTML, PDF, Word) that also include code and output of the code. This is good because it's reproducible, and also makes reports easier to update when new data comes in. Each of the grey chunks contains R code, just like a normal R script. You can choose to run each chunk separately, or knit the whole document using Knit the button above, which creates your document.
To start a new R Markdown file in Rstudio, go to File --> New File --> R Markdown, then select Document and whatever you want to compile the document as (I chose pdf). Notice that this and the other inputs (title, author) are used to create the 'yaml', the bit at the start of the document. You can edit this, like I have for example to include table of contents.
There are various options for output code, results, etc. For example, if you don't want your final report to include the code (but just the output, e.g. graphs or tables) then you can specify `echo=FALSE` at the beginning of the chunk within the curly brackets (or set global options like I have done above).
## Writing math
Writing equations is essentially the same as in LaTeX. You can write inline equations using the \$ e.g. $y = ax+b$. You can write equations on a separate line with two \$s e.g.
$$
y = ax + b
$$
In pdf documents you can have numbered equations using
\begin{equation}
y = ax + b
\end{equation}
Getting greek letters, symbols, subscripts, bars etc is the same as LaTeX. A few examples are below
- $Y_{i,j}$
- $\bar{X} = \frac{\sum_{i = 1}^n X_i}{n}$
- $\alpha \beta \gamma$
- $X \rightarrow Y$
- $Y \sim N(\mu, \sigma^2)$
# Tidyverse
Read in some packages that we'll be using:
```{r}
#install.packages("tidyverse")
library(tidyverse)
```
```{r}
library(dplyr)
library(readr)
```
On top of the base R functionality, there's lots of different packages that different people have made to improve the usability of the language. One of the most successful suite of packages is now called the 'tidyverse'. The tidyverse contains a range of functionality that help to manipulate and visualize data.
Read in mortality rates for Ontario. These data come from the [Canadian Human Mortality Database](http://www.bdlc.umontreal.ca/chmd/prov/ont/ont.htm).
```{r}
dm <- read_table("https://www.prdh.umontreal.ca/BDLC/data/ont/Mx_1x1.txt", skip = 2, col_types = "ddddd")
head(dm)
```
The object `dm` is a data frame, or tibble. Every column can be a different data type (e.g. we have integers and characters).
## Important tidyverse functions
You should feel comfortable using the following functions
- The pipe `%>%`
- `filter`
- `select`
- `arrange`
- `mutate`
- `group_by`
- `summarize`
- `pivot_longer` and `pivot_wider`
## Piping, filtering, selecting, arranging
A central part of manipulating tibbles is using the `%>%` function. This is a pipe, but should be read as saying 'and then'.
For example, say we just want to pull out mortality rates for 1935. We would take our tibble *and then* filter to only include 1935:
```{r}
dm %>%
filter(Year==1935) # two equals signs logical
```
You can also filter by more than one condition; say we just wanted to look at 10 year olds in 1935:
```{r}
dm %>%
filter(Year==1935, Age==10)
```
If we only wanted to look at 10 year olds in 1935 who were female, we could filter *and then* select the female column.
```{r}
dm %>%
filter(Year==1935, Age==10) %>%
select(Female)
```
You can also remove columns by selecting the negative of that column name.
```{r}
dm %>%
select(-Total)
```
Sort the tibble according to a particular column using `arrange`, for example, Year in descending order:
```{r}
dm %>%
arrange(-Year)
```
NOTE: none of the above operations are saving.
```{r}
dm_filter <- dm %>% filter(Year==1935)
dm_filter
```
## Grouping, summarizing, mutating
In addition to `filter` and `select`, two useful functions are `mutate`, which allows you to create new variables, and `summarize`, which allows you to produce summary statistics. These are particularly powerful when combined with `group_by()` which allows you to do any operation on a tibble by group.
For example, let's create a new variable that is the ratio of male to female mortality at each age and year:
```{r}
dm %>%
mutate(mf_ratio = Male/Female)
```
Now, let's calculate the mean female mortality rate by age over all the years. To do this, we need to `group_by` Age, and then use `summarize` to calculate the mean:
```{r}
# mean mortality rate by age group over all years
dm %>%
group_by(Age) %>%
summarize(mean_mort = mean(Female))
```
Mean female mortality rate over all ages and years:
```{r}
dm %>%
summarize(mean_mort = mean(Female, na.rm = TRUE))
```
Mean of males and females by age
```{r}
dm %>%
group_by(Age) %>%
summarize(mean_male = mean(Male),
mean_female = mean(Female))
```
Alternatively
```{r}
dm %>%
group_by(Age) %>%
summarize_at(vars(Female:Male), mean)
```
Now using `across`
```{r}
dm %>%
group_by(Age) %>%
summarize(across(Female:Male,mean))
```
## Pivoting
We often need to switch between wide and long data format. The `dm` tibble is currently in wide format. To get it in long format we can use `pivot_longer`
```{r}
dm_long <- dm %>%
pivot_longer(Female:Total, names_to = "sex", values_to = "mortality")
dm_long
```
## Using ggplot
You can plot things in R using the base `plot` function, but plots using `ggplot` are much prettier.
Say we wanted to plot the mortality rates for 30 year old males over time. In the function `ggplot`, we need to specify our data (in this case, a filtered version of dm), an x axis (Year) and y axis (Male). The axes are defined withing the `aes()` function, which stands for 'aesthetics'.
First let's get our data:
```{r}
d_to_plot <- dm %>%
filter(Age==30) %>%
select(Year, Male)
d_to_plot
```
Now start the ggplot:
```{r}
p <- ggplot(data = d_to_plot, aes(x = Year, y = Male))
p
```
Notice the object `p` is just an empty box. The key to ggplot is layering: we now want to specify that we want a line plot using `geom_line()`:
```{r}
p + geom_line()
```
Let's change the color of the line, and the y-axis label, and give the plot a title:
```{r}
p +
geom_line(col = "purple") +
labs(y = "Mortality", title = "Male mortality rates in Ontario over time")
```
### More than one group
Now say we wanted to have trends for 30-year old males and females on the one plot. The easiest way to do this is to first reshape our data so it's in long format: so instead of having a column for each sex, we have one column indicating the sex, and another column indicating the Mx value
```{r}
dm_to_long <- dm %>%
pivot_longer(Female:Total, names_to = "sex", values_to = "mortality") %>%
filter(Age == 30, sex!="Total")
```
Now we can do a similar plot to before but we now have an added component in the `aes()` function: color, which is determined by sex:
```{r}
dm_to_long %>%
ggplot(aes(Year, mortality, color = sex)) +
geom_line()
```
### Faceting
A neat thing about ggplot is that it's relatively easy to create 'facets' or smaller graphs divided by groups. Say we wanted to look at trends for 30 year olds and 60 year olds for both males and females. Let's get the data ready to plot:
```{r}
dm_to_plot <- dm %>%
select(-Total) %>%
filter(Age==30|Age==60) %>%
pivot_longer(Female:Male, names_to = "sex", values_to = "mortality")
dm_to_plot
```
Now let's plot, with a separate facet for each sex:
```{r}
dm_to_plot %>%
ggplot(aes(Year, mortality, color = as.factor(Age))) + facet_grid(~sex) +
geom_line() +
scale_color_brewer(name = "Age", palette = "Set1")
```
# Lab Exercises
1. Plot the ratio of male to female mortality rates over time for ages 10,20,30 and 40 (different color for each age) and change the theme
```{r}
dm %>% filter(Age==10|Age==20|Age==30|Age==40) %>% ggplot(aes(Year, Male/Female, color = as.factor(Age))) +
geom_line() + theme_classic()+
scale_color_brewer(name = "Age", palette = "Set1")
```
2. Find the age that has the highest female mortality rate each year
```{r}
dm[(!is.na(dm$Age)),] %>%
group_by(Year) %>% filter(Female==max(Female,na.rm=1)) %>% select(Age)
```
3. Use the `summarize_at()` function OR `summarize(across())` to calculate the standard deviation of mortality rates by age for the Male, Female and Total populations.
```{r}
dm %>%
group_by(Age) %>%
summarize(across(Female:Total,sd))
```
4. The Canadian HMD also provides population sizes over time (https://www.prdh.umontreal.ca/BDLC/data/ont/Population.txt). Use these to calculate the population weighted average mortality rate separately for males and females, for every year. Make a nice line plot showing the result (with meaningful labels/titles) and briefly comment on what you see (1 sentence). Hint: `left_join` will probably be useful here.
```{r}
df <- read_table('https://www.prdh.umontreal.ca/BDLC/data/ont/Population.txt',skip=2)
df$Age <- as.numeric(df$Age)
dff <- dm %>% left_join(df,by=c('Year','Age'))
df2 <- dff%>% group_by(Year) %>% summarize(
Male_mean = weighted.mean(Male.x,Male.y,na.rm=T),
Female_mean = weighted.mean(Female.x,Female.y,na.rm=T)
)
df3 <- df2 %>% pivot_longer(Male_mean:Female_mean, names_to = "sex", values_to = "averaged_mortality")
df3 %>% ggplot(aes(Year, averaged_mortality, color = as.factor(sex))) +
geom_line() +
scale_color_brewer(name = "Sex", palette = "Set1")
```