03-Tidy.rmd

---
title: "Tidy Data"
output: html_document
---

```{r setup}
library(tidyverse)

# Toy data
cases <- tribble(
  ~Country, ~"2011", ~"2012", ~"2013",
      "FR",    7000,    6900,    7000,
      "DE",    5800,    6000,    6200,
      "US",   15000,   14000,   13000
)

pollution <- tribble(
       ~city,   ~size, ~amount,
  "New York", "large",      23,
  "New York", "small",      14,
    "London", "large",      22,
    "London", "small",      16,
   "Beijing", "large",     121,
   "Beijing", "small",     56
)


bp_systolic <- tribble(
  ~ subject_id,  ~ time_1, ~ time_2, ~ time_3,
             1,       120,      118,      121,
             2,       125,      131,       NA,
             3,       141,       NA,       NA 
)

bp_systolic2 <- tribble(
  ~ subject_id,  ~ time, ~ systolic,
             1,       1,        120,
             1,       2,        118,
             1,       3,        121,
             2,       1,        125,
             2,       2,        131,
             3,       1,        141
)

```

## Tidy and untidy data

`table1` is tidy:
```{r}
table1 
```

For example, it's easy to add a rate column with `mutate()`:
```{r}
table1 %>%
  mutate(rate = cases/population)
```

`table2` isn't tidy, the count column really contains two variables:
```{r}
table2
```

It makes it very hard to manipulate.


## Your Turn 1

Is `bp_systolic` tidy?
Yes, it is.

```{r}
bp_systolic 
```

## Your Turn 2

Using `bp_systolic2` with `group_by()`, and `summarise()`:

* Find the average systolic blood pressure for each subject
* Find the last time each subject was measured

```{r}
bp_systolic2 %>% 
  group_by(subject_id) %>% 
  summarise(avg_sys = mean(systolic), 
            last_measurement = max(time))
```

## Your Turn 3

On a sheet of paper, draw how the cases data set would look if it had the same values grouped into three columns: **country**, **year**, **n**

## Your Turn 4

Use `pivot_longer()` to reorganize `table4a` into three columns: **country**, **year**, and **cases**.

```{r}
table4a %>% 
  pivot_longer(2:3, names_to = "year", values_to = "cases")
```
## Just for fun

```{r}
table4a %>%
  gather(key = "year", value = "n", 2:3)
```

## Your Turn 5

On a sheet of paper, draw how this data set would look if it had the same values grouped into three columns: **city**, **large**, **small**

## Your Turn 6

Use `pivot_wider()` to reorganize `table2` into four columns: **country**, **year**, **cases**, and **population**.

```{r}
table2 %>% 
  pivot_wider(names_from = "type", values_from = "count")
```

***

# Take Aways

Data comes in many formats but R prefers just one: _tidy data_.

A data set is tidy if and only if:

1. Every variable is in its own column
2. Every observation is in its own row
3. Every value is in its own cell (which follows from the above)

What is a variable and an observation may depend on your immediate goal.

<!-- This file by Charlotte Wickham is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->