1_rmd_tidy_intro_additions.Rmd

---
title: "Intro to R Markdown and Tidyverse"
author: "Monica Alexander"
date: "11 January 2022"
output: 
  pdf_document:
    toc: true
    number_sections: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```

# By the end of this lab you should know the basics of

- RStudio Projects
- R Markdown
- Main tidyverse functions
- `ggplot`

# RStudio Projects

RStudio projects are associated with R working directories. They are good to use for several reasons:

- Each project has their own working directory, so make dealing with file paths easier
- Make it easy to divide your work into multiple contexts
- Can open multiple projects at one time in separate windows

To make a new project in RStudio, go to File --> New Project. If you've already set up a repo for this class, then select 'Existing Directory' and choose the folder that will contain all your class materials. This will open a new RStudio window, that will be called the name of your folder. 

In future, when you want to do work for this class, go to your class folder and open the .Rproj file. This will open an RStudio window, with the correct working directory, and show the files you were last working on. 

# R Markdown

This is an R Markdown document. R Markdown allows you to create nicely formatted documents (HTML, PDF, Word) that also include code and output of the code. This is good because it's reproducible, and also makes reports easier to update when new data comes in. Each of the grey chunks contains R code, just like a normal R script. You can choose to run each chunk separately, or knit the whole document using Knit the button above, which creates your document. 

To start a new R Markdown file in Rstudio, go to File --> New File --> R Markdown, then select Document and whatever you want to compile the document as (I chose pdf). Notice that this and the other inputs (title, author) are used to create the 'yaml', the bit at the start of the document. You can edit this, like I have for example to include table of contents. 

There are various options for output code, results, etc. For example, if you don't want your final report to include the code (but just the output, e.g. graphs or tables) then you can specify `echo=FALSE` at the beginning of the chunk within the curly brackets (or set global options like I have done above). 

## Writing math

Writing equations is essentially the same as in LaTeX. You can write inline equations using the \$ e.g. $y = ax+b$. You can write equations on a separate line with two \$s e.g. 

$$
y = ax + b
$$
In pdf documents you can have numbered equations using

\begin{equation}
y = ax + b
\end{equation}

Getting greek letters, symbols, subscripts, bars etc is the same as LaTeX. A few examples are below

- $Y_{i,j}$
- $\bar{X} = \frac{\sum_{i = 1}^n X_i}{n}$
- $\alpha \beta \gamma$
- $X \rightarrow Y$
- $Y \sim N(\mu, \sigma^2)$

# Tidyverse

Read in some packages that we'll be using:

```{r}
#install.packages("tidyverse")
library(tidyverse)
```

```{r}
library(dplyr)
library(readr)
```


On top of the base R functionality, there's lots of different packages that different people have made to improve the usability of the language. One of the most successful suite of packages is now called the 'tidyverse'. The tidyverse contains a range of functionality that help to manipulate and visualize data. 

Read in mortality rates for Ontario. These data come from the [Canadian Human Mortality Database](http://www.bdlc.umontreal.ca/chmd/prov/ont/ont.htm). 

```{r}
dm <- read_table("https://www.prdh.umontreal.ca/BDLC/data/ont/Mx_1x1.txt", skip = 2, col_types = "ddddd")
head(dm)
```

The object `dm` is a data frame, or tibble. Every column can be a different data type (e.g. we have integers and characters). 

## Important tidyverse functions

You should feel comfortable using the following functions

- The pipe `%>%`
- `filter`
- `select`
- `arrange`
- `mutate`
- `group_by`
- `summarize`
- `pivot_longer` and `pivot_wider`


## Piping, filtering, selecting, arranging

A central part of manipulating tibbles is using the `%>%` function. This is a pipe, but should be read as saying 'and then'. 

For example, say we just want to pull out mortality rates for 1935. We would take our tibble *and then* filter to only include 1935:

```{r}
dm %>% 
  filter(Year==1935) # two equals signs logical
```

You can also filter by more than one condition; say we just wanted to look at 10 year olds in 1935:

```{r}
dm %>% 
  filter(Year==1935, Age==10)
```

If we only wanted to look at 10 year olds in 1935 who were female, we could filter *and then* select the female column.

```{r}
dm %>% 
  filter(Year==1935, Age==10) %>% 
  select(Female)
```
You can also remove columns by selecting the negative of that column name. 

```{r}
dm %>% 
  select(-Total)
```

Sort the tibble according to a particular column using `arrange`, for example, Year in descending order:

```{r}
dm %>% 
  arrange(-Year)
```

NOTE: none of the above operations are saving. 

```{r}
dm_filter <- dm %>% filter(Year==1935)
dm_filter
```



## Grouping, summarizing, mutating

In addition to `filter` and `select`, two useful functions are `mutate`, which allows you to create new variables, and `summarize`, which allows you to produce summary statistics. These are particularly powerful when combined with `group_by()` which allows you to do any operation on a tibble by group. 

For example, let's create a new variable that is the ratio of male to female mortality at each age and year:

```{r}
dm %>% 
  mutate(mf_ratio = Male/Female)
```


Now, let's calculate the mean female mortality rate by age over all the years. To do this, we need to `group_by` Age, and then use `summarize` to calculate the mean:

```{r}
# mean mortality rate by age group over all years
dm %>% 
  group_by(Age) %>% 
  summarize(mean_mort = mean(Female))
```

Mean female mortality rate over all ages and years:

```{r}
dm %>% 
  summarize(mean_mort = mean(Female, na.rm = TRUE))
```

Mean of males and females by age

```{r}
dm %>% 
  group_by(Age) %>% 
  summarize(mean_male = mean(Male),
            mean_female = mean(Female))
```

Alternatively

```{r}
dm %>% 
  group_by(Age) %>% 
  summarize_at(vars(Female:Male), mean)
```

Now using `across`

```{r}
dm %>% 
  group_by(Age) %>% 
  summarize(across(Female:Male,mean))
```



## Pivoting

We often need to switch between wide and long data format. The `dm` tibble is currently in wide format. To get it in long format we can use `pivot_longer`

```{r}
dm_long <- dm %>% 
  pivot_longer(Female:Total, names_to = "sex", values_to = "mortality")
dm_long
```



## Using ggplot

You can plot things in R using the base `plot` function, but plots using `ggplot` are much prettier. 

Say we wanted to plot the mortality rates for 30 year old males over time. In the function `ggplot`, we need to specify our data (in this case, a filtered version of dm), an x axis (Year) and y axis (Male). The axes are defined withing the `aes()` function, which stands for 'aesthetics'.

First let's get our data:

```{r}
d_to_plot <- dm %>% 
  filter(Age==30) %>% 
  select(Year, Male)
d_to_plot
```

Now start the ggplot:

```{r}
p <- ggplot(data = d_to_plot, aes(x = Year, y = Male))
p
```

Notice the object `p` is just an empty box. The key to ggplot is layering: we now want to specify that we want a line plot using `geom_line()`:

```{r}
p + geom_line()
```

Let's change the color of the line, and the y-axis label, and give the plot a title:

```{r}
p +
  geom_line(col = "purple") +
  labs(y = "Mortality", title = "Male mortality rates in Ontario over time")
```


### More than one group

Now say we wanted to have trends for 30-year old males and females on the one plot. The easiest way to do this is to first reshape our data so it's in long format: so instead of having a column for each sex, we have one column indicating the sex, and another column indicating the Mx value

```{r}
dm_to_long <- dm %>% 
  pivot_longer(Female:Total, names_to = "sex", values_to = "mortality") %>% 
  filter(Age == 30, sex!="Total") 
```

Now we can do a similar plot to before but we now have an added component in the `aes()` function: color, which is determined by sex:

```{r}
dm_to_long %>% 
  ggplot(aes(Year, mortality, color = sex)) + 
  geom_line()
```

### Faceting

A neat thing about ggplot is that it's relatively easy to create 'facets' or smaller graphs divided by groups. Say we wanted to look at trends for 30 year olds and 60 year olds for both males and females. Let's get the data ready to plot:

```{r}
dm_to_plot <- dm %>% 
  select(-Total) %>% 
  filter(Age==30|Age==60) %>% 
  pivot_longer(Female:Male, names_to = "sex", values_to = "mortality") 
dm_to_plot
```

Now let's plot, with a separate facet for each sex:

```{r}
dm_to_plot %>% 
  ggplot(aes(Year, mortality, color = as.factor(Age))) + facet_grid(~sex) +
  geom_line() + 
  scale_color_brewer(name = "Age", palette = "Set1")
```

# Lab Exercises

1. Plot the ratio of male to female mortality rates over time for ages 10,20,30 and 40 (different color for each age) and change the theme 

```{r}
dm %>% filter(Age==10|Age==20|Age==30|Age==40) %>% ggplot(aes(Year, Male/Female, color = as.factor(Age))) + 
  geom_line() + theme_classic()+
  scale_color_brewer(name = "Age", palette = "Set1")
```



2. Find the age that has the highest female mortality rate each year

```{r}
dm[(!is.na(dm$Age)),] %>% 
  group_by(Year) %>% filter(Female==max(Female,na.rm=1)) %>% select(Age)
```

3. Use the `summarize_at()` function OR `summarize(across())` to calculate the standard deviation of mortality rates by age for the Male, Female and Total populations. 

```{r}
dm %>% 
  group_by(Age) %>% 
  summarize(across(Female:Total,sd))
```


4. The Canadian HMD also provides population sizes over time (https://www.prdh.umontreal.ca/BDLC/data/ont/Population.txt). Use these to calculate the population weighted average mortality rate separately for males and females, for every year. Make a nice line plot showing the result (with meaningful labels/titles) and briefly comment on what you see (1 sentence). Hint: `left_join` will probably be useful here. 

```{r}
df <- read_table('https://www.prdh.umontreal.ca/BDLC/data/ont/Population.txt',skip=2)
df$Age <- as.numeric(df$Age)
dff <- dm %>% left_join(df,by=c('Year','Age')) 
df2 <- dff%>% group_by(Year) %>% summarize(
  Male_mean = weighted.mean(Male.x,Male.y,na.rm=T),
  Female_mean = weighted.mean(Female.x,Female.y,na.rm=T)
  )

df3 <- df2 %>% pivot_longer(Male_mean:Female_mean, names_to = "sex", values_to = "averaged_mortality") 
  
df3 %>% ggplot(aes(Year, averaged_mortality, color = as.factor(sex))) + 
  geom_line() + 
  scale_color_brewer(name = "Sex", palette = "Set1")
```