03_tidy_data.Rmd

# Tidy data and its friends {#tidy_data}

```{r include = FALSE}
# Caching this markdown file
#knitr::opts_chunk$set(cache = TRUE)
```

## Setup

-   Check your `dplyr` package is up-to-date by typing `packageVersion("dplyr")`. If the current installed version is less than 1.0, then update by typing `update.packages("dplyr")`. You may need to restart R to make it work.

```{r}

ifelse(packageVersion("dplyr") >= 1,
  "The installed version of dplyr package is greater than or equal to 1.0.0", update.packages("dplyr")
)

if (!require("pacman")) install.packages("pacman")

pacman::p_load(
  tidyverse, # the tidyverse framework
  skimr, # skimming data
  here, # computational reproducibility
  #infer, # statistical inference
  tidymodels, # statistical modeling
  gapminder, # toy data
  nycflights13, # for exercise
  ggthemes, # additional themes
  ggrepel, # arranging ggplots
  patchwork, # arranging ggplots
  broom, # tidying model outputs
  waldo # side-by-side code comparison
)
```

## Base R data structure 

The rest of the chapter follows the basic structure in [the Data Wrangling Cheat Sheet](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) created by RStudio.

To make the best use of the R language, you'll need a strong understanding of the basic data types and data structures and how to operate on those. R is an **object-oriented** language, so the importance of this cannot be understated. 

It is **critical** to understand because these are the objects you will manipulate on a day-to-day basis in R, and they are not always as easy to work with as they sound at the outset. Dealing with object conversions is one of the most common sources of frustration for beginners.

> To understand computations in R, two slogans are helpful:
  - Everything that exists is an object.
  - Everything that happens is a function call.
  
> __John Chambers__the creator of S (the mother of R)

1. [Main Classes](#main-classes) introduces you to R's one-dimensional or atomic classes and data structures. R has five basic atomic classes: logical, integer, numeric, complex, character. Social scientists don't use complex classes. 

2. [Attributes](#attributes) takes a small detour to discuss attributes, R's flexible metadata specification. Here, you'll learn about factors, an important data structure created by setting attributes of an atomic vector. R has many data structures: vector, list, matrix, data frame, factors, tables.


![Concept map for data types. By Meghan Sposato, Brendan Cullen, Monica Alonso.](https://github.com/rstudio/concept-maps/raw/master/en/data-types.svg)


### 1D data: Vectors 

#### Atomic classes

`R`'s main atomic classes are:

* character (or a "string" in Python and Stata)
* numeric (integer or float)
* integer (just integer)
* logical (booleans)

| Example | Type |
| ------- | ---- |
| "a", "swc" | character |
| 2, 15.5 | numeric | 
| 2 (Must add a `L` at end to denote integer) | integer |
| `TRUE`, `FALSE` | logical |

Like Python, R is dynamically typed. There are a few differences in terminology, however, that are pertinent. 

- First, "types" in Python are referred to as "classes" in R. 

What is a class?

![from https://brilliant.org/](https://ds055uzetaobb.cloudfront.net/brioche/uploads/pJZt3mh3Ht-prettycars.png?width=2400)

- Second, R has different names for the types string, integer, and float --- specifically **character**, **integer** (not different), and **numeric**. Because there is no "float" class in R, users tend to default to the "numeric" class when working with numerical data.

The function for recovering object classes is ```class()```. L suffix to qualify any number with the intent of making it an explicit integer. See more from the [R language definition](https://cran.r-project.org/doc/manuals/R-lang.html).

```{r}
class(3)
class(3L)
class("Three")
class(F)
```

### Data structures

R's base data structures can be organized by their dimensionality (1d, 2d, or nd) and whether they're homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types). This gives rise to the five data types most often used in data analysis: 

|    | Homogeneous   | Heterogeneous |
|----|---------------|---------------|
| 1d | Atomic vector | List          |
| 2d | Matrix        | Data frame    |
| nd | Array         |               |

Each data structure has its specifications and behavior. For our purposes, an important thing to remember is that R is always **faster** (more efficient) working with homogeneous (**vectorized**) data.

#### Vector properties

Vectors have three common properties:

* Class, `class()`, or what type of object it is (same as `type()` in Python).
* Length, `length()`, how many elements it contains (same as `len()` in Python).
* Attributes, `attributes()`, additional arbitrary metadata.

They differ in the types of their elements: all atomic vector elements must be the same type, whereas the elements of a list can have different types.

#### Creating different types of atomic vectors

Remember, there are four common types of vectors: 
* `logical` 
* `integer` 
* `numeric` (same as `double`)
* `character`.

You can create an empty vector with `vector()` (By default, the mode is `logical.` You can be more explicit as shown in the examples below.) It is more common to use direct constructors such as `character()`, `numeric()`, etc.

```{r}

x <- vector()

# with a length and type
vector("character", length = 10)

## character vector of length 5
character(5)

numeric(5)

logical(5)
```

Atomic vectors are usually created with `c()`, which is short for concatenate:

```{r}

x <- c(1, 2, 3)

x

length(x)
```

`x` is a numeric vector. These are the most common kind. You can also have logical vectors. 

```{r}

y <- c(TRUE, TRUE, FALSE, FALSE)

y
```

Finally, you can have character vectors:

```{r}

kim_family <- c("Jae", "Sun", "Jane")

is.integer(kim_family) # integer?

is.character(kim_family) # character?

is.atomic(kim_family) # atomic?

typeof(kim_family) # what's the type?
```

**Short exercise: Create and examine your vector**  

Create a character vector called `fruit` containing 4 of your favorite fruits. Then evaluate its structure using the commands below.

```{r, eval = FALSE}

# First, create your fruit vector
# YOUR CODE HERE
fruit <-
  # Examine your vector
  length(fruit)
class(fruit)
str(fruit)
```

**Add elements**

You can add elements to the end of a vector by passing the original vector into the `c` function, like the following:

```{r}

z <- c("Beyonce", "Kelly", "Michelle", "LeToya")

z <- c(z, "Farrah")

z
```

More examples of vectors

```{r}

x <- c(0.5, 0.7)

x <- c(TRUE, FALSE)

x <- c("a", "b", "c", "d", "e")

x <- 9:100
```

You can also create vectors as a sequence of numbers:

```{r}

series <- 1:10

series

seq(10)

seq(1, 10, by = 0.1)
```

Atomic vectors are always flat, even if you nest `c()`'s:

```{r eval = TRUE}

c(1, c(2, c(3, 4)))

# the same as
c(1, 2, 3, 4)
```

**Types and Tests**

Given a vector, you can determine its class with `class`, or check if it's a specific type with an "is" function: `is.character()`, `is.numeric()`, `is.integer()`, `is.logical()`, or, more generally, `is.atomic()`.

```{r }

char_var <- c("harry", "sally")

class(char_var)

is.character(char_var)

is.atomic(char_var)

num_var <- c(1, 2.5, 4.5)

class(num_var)

is.numeric(num_var)

is.atomic(num_var)
```

NB: `is.vector()` does not test if an object is a vector. Instead, it returns `TRUE` only if the object is a vector with no attributes apart from names. Use `is.atomic(x) || is.list(x)` to test if an object is actually a vector.

**Coercion**

All atomic vector elements must be the same type, so when you attempt to combine different types, they will be __coerced__ to the **most flexible type.** Types from least to most flexible are: logical > integer > double > character. 

For example, combining a character and an integer yields a character:

```{r}
str(c("a", 1))
```

**Guess what the following do without running them first**

```{r, eval = FALSE}

c(1.7, "a")

c(TRUE, 2)

c("a", TRUE)
```

Notice that when a logical vector is coerced to an integer or double, `TRUE` becomes 1, and `FALSE` becomes 0. This is very useful in conjunction with `sum()` and `mean()`

```{r}

x <- c(FALSE, FALSE, TRUE)

as.numeric(x)

# Total number of TRUEs
sum(x)

# Proportion that is TRUE
mean(x)
```

Coercion often happens automatically. This is called implicit coercion. Most mathematical functions (`+`, `log`, `abs`, etc.) will coerce to a numeric or integer, and most logical operations (`&`, `|`, `any`, etc) will coerce to a logical. You will usually get a warning message if the coercion might lose information. 

```{r}

1 < "2"

"1" > 2
```

You can also coerce vectors explicitly coerce with `as.character()`, `as.numeric()`, `as.integer()`, or `as.logical()`. Example:

```{r}

x <- 0:6

as.numeric(x)

as.logical(x)

as.character(x)
```

Sometimes coercions, especially nonsensical ones, won’t work.

```{r}

x <- c("a", "b", "c")

as.numeric(x)

as.logical(x)
```

**Short Exercise**

```{r, eval=FALSE}

# 1. Create a vector of a sequence of numbers between 1 to 10.

# 2. Coerce that vector into a character vector

# 3. Add the element "11" to the end of the vector

# 4. Coerce it back to a numeric vector.
```

#### Lists

Lists are also vectors, but different from atomic vectors because their elements can be of any type. In short, they are generic vectors. For example, you construct lists by using `list()` instead of `c()`: 

Lists are sometimes called recursive vectors, because a list can contain other lists. This makes them fundamentally different from atomic vectors. 

```{r}

x <- list(1, "a", TRUE, c(4, 5, 6))

x
```

You can coerce other objects using `as.list()`. You can test for a list with `is.list()`

```{r}

x <- 1:10

x <- as.list(x)

is.list(x)

length(x)
```

`c()` will combine several lists into one. If given a combination of atomic vectors and lists, `c()` (con**c**atenate) will coerce the vectors to lists before combining them. Compare the results of `list()` and `c()`:

```{r}

x <- list(list(1, 2), c(3, 4))

y <- c(list(1, 2), c(3, 4))

str(x)

str(y)
```

You can turn a list into an atomic vector with `unlist()`. If the elements of a list have different types, `unlist()` uses the same coercion rules as `c()`.

```{r}

x <- list(list(1, 2), c(3, 4))

x

unlist(x)
```

Lists are used to build up many of the more complicated data structures in R. For example, both data frames and linear models objects (as produced by `lm()`) are lists:

```{r}

is.list(mtcars)

mod <- lm(mpg ~ wt, data = mtcars)

is.list(mod)
```

For this reason, lists are handy inside functions. You can "staple" together many different kinds of results into a single object that a function can return.

A list does not print to the console like a vector. Instead, each element of the list starts on a new line.

```{r}
x.vec <- c(1, 2, 3)
x.list <- list(1, 2, 3)
x.vec
x.list
```

For lists, elements are **indexed by double brackets**. Single brackets will still return a(nother) list. (We'll talk more about subsetting and indexing in the fourth lesson.)

**Exercises**

1. What are the four basic types of atomic vectors? How does a list differ from an atomic vector?

2. Why is `1 == "1"` true? Why is `-1 < FALSE` true? Why is `"one" < 2` false?

3. Create three vectors and then combine them into a list.

4.  If `x` is a list, what is the class of `x[1]`?  How about `x[[1]]`?


### Attributes

Attributes provide additional information about the data to you, the user, and to R. We've already seen the following three attributes in action:

* Names (`names(x)`), a character vector giving each element a name. 

* Dimensions (`dim(x)`), used to turn vectors into matrices.

* Class (`class(x)`), used to implement the S3 object system.

**Additional tips**

In an object-oriented system, a [class](https://www.google.com/search?q=what+is+class+programming&oq=what+is+class+programming&aqs=chrome.0.0l6.3543j0j4&sourceid=chrome&ie=UTF-8) (an extensible problem-code-template) defines a type of object like what its properties are, how it behaves, and how it relates to other types of objects. Therefore, technically, an object is an [instance](https://en.wikipedia.org/wiki/Instance_(computer_science)) (or occurrence) of a class. A method is a function associated with a particular type of object.

#### Names

You can name a vector when you create it:

```{r}

x <- c(a = 1, b = 2, c = 3)
```

You can also modify an existing vector: 

```{r}

x <- 1:3

names(x)

names(x) <- c("e", "f", "g")

names(x)
```

Names don't have to be unique. However, character subsetting, described in the next lesson, is the most important reason to use names, and it is most useful when the names are unique. (For Python users: when names are unique, a vector behaves like a Python dictionary key.)

Not all elements of a vector need to have a name. If some names are missing, `names()` will return an empty string for those elements. If all names are missing, `names()` will return `NULL`.

```{r}

y <- c(a = 1, 2, 3)

names(y)

z <- c(1, 2, 3)

names(z)
```

You can create a new vector without names using `unname(x)`, or remove names in place with `names(x) <- NULL`.

#### Factors

Factors are special vectors that represent categorical data. Factors can be ordered (ordinal variable) or unordered (nominal or categorical variable) and are important for modeling functions such as `lm()` and `glm()` and also in plot methods.

**Quiz**
1. If you want to enter dummy variables (Democrats = 1, Non-democrats = 0) in your regression model, should you use a numeric or factor variable?

Factors can only contain pre-defined values. Set allowed values using the `levels()` attribute. Note that a factor's levels will always be character values. 


```{r}

x <- c("a", "b", "b", "a")

x <- factor(c("a", "b", "b", "a"))

x

class(x)

levels(x)

# You can't use values that are not in the levels
x[2] <- "c"

# NB: you can't combine factors
c(factor("a"), factor("b"))

rep(1:5, rep(6, 5))
```

Factors are pretty much integers that have labels on them. Underneath, it's really numbers (1, 2, 3...). 

```{r}

x <- factor(c("a", "b", "b", "a"))

str(x)
```

They are better than using simple integer labels because factors are what are called self-describing. For example, `democrat` and `republican` is more descriptive than `1`s and `2`s. 

Factors are useful when you know the possible values a variable may take, even if you don't see all values in a given dataset. Using a factor instead of a character vector makes it obvious when some groups contain no observations:

```{r}

party_char <- c("democrat", "democrat", "democrat")

party_char

party_factor <- factor(party_char, levels = c("democrat", "republican"))

party_factor

table(party_char) # shows only democrats

table(party_factor) # shows republicans too
```

Sometimes factors can be left unordered. Example: `democrat`, `republican.`

Other times you might want factors to be ordered (or ranked). Example: `low`, `medium`, `high`. 

```{r}

x <- factor(c("low", "medium", "high"))

str(x)

is.ordered(x)

y <- ordered(c("low", "medium", "high"), levels = c("high", "medium", "low"))

is.ordered(y)
```

While factors look (and often behave) like character vectors, they are integers. So be careful when treating them like strings. Some string methods (like `gsub()` and `grepl()`) will coerce factors to strings, while others (like `nchar()`) will throw an error, and still others (like `c()`) will use the underlying integer values. 

```{r}

x <- c("a", "b", "b", "a")

x

is.factor(x)

x <- as.factor(x)

x

c(x, "c")
```

For this reason, it's usually best to explicitly convert factors to character vectors if you need string-like behavior. There was a memory advantage to using factors instead of character vectors in early versions of R, but this is no longer the case.

Unfortunately, most data loading functions in R automatically convert character vectors to factors. This is suboptimal, because there's no way for those functions to know the set of all possible levels or their optimal order. If this becomes a problem, use the argument `stringsAsFactors = FALSE` to suppress this behavior and manually convert character vectors to factors using your knowledge of the data.

**More attributes**

All R objects can have arbitrary additional attributes used to store metadata about the object. Attributes can be considered a named list (with unique names). Attributes can be accessed individually with `attr()` or all at once (as a list) with `attributes().` 

```{r}

y <- 1:10

attr(y, "my_attribute") <- "This is a vector"

attr(y, "my_attribute")

# str returns a new object with modified information
str(attributes(y))
```

**Exercises**

1.  What happens to a factor when you modify its levels? 
    
```{r, results = "none"}

f1 <- factor(letters)

levels(f1) <- rev(levels(f1))

f1
```

2.  What does this code do? How do `f2` and `f3` differ from `f1`?

```{r, results = "none"}

f2 <- rev(factor(letters))

f3 <- factor(letters, levels = rev(letters))
```

### 2D data: Matrices and dataframes 

1. Matrices:  data structures for storing 2d data that is all the same class.
2. Dataframes: teaches you about the dataframe, the most important data structure for storing data in R, because it stores different kinds of (2d) data.

#### Matrices

Matrices are created when we combine multiple vectors with the same class (e.g., numeric). This creates a dataset with rows and columns. By definition, if you want to combine multiple classes of vectors, you want a dataframe. You can coerce a matrix to become a dataframe and vice-versa, but as with all vector coercions, the results can be unpredictable, so be sure you know how each variable (column) will convert.

```{r}

m <- matrix(nrow = 2, ncol = 2)

m

dim(m)
```

Matrices are filled column-wise. 

```{r}

m <- matrix(1:6, nrow = 2, ncol = 3)

m
```

Other ways to construct a matrix

```{r}

m <- 1:10

dim(m) <- c(2, 5)

m

dim(m) <- c(5, 2)

m
```

You can transpose a matrix (or dataframe) with `t()`

```{r}

m <- 1:10

dim(m) <- c(2, 5)

m

t(m)
```

Another way is to bind columns or rows using `cbind()` and `rbind()`.

```{r}

x <- 1:3

y <- 10:12

cbind(x, y)

# or

rbind(x, y)
```

You can also use the `byrow` argument to specify how the matrix is filled. From R's own documentation:

```{r}

mdat <- matrix(c(1, 2, 3, 11, 12, 13),
  nrow = 2,
  ncol = 3,
  byrow = TRUE,
  dimnames = list(
    c("row1", "row2"),
    c("C.1", "C.2", "C.3")
  )
)
mdat
```   

Notice that we gave `names` to the dimensions in `mdat`.

```{r}

dimnames(mdat)

rownames(mdat)

colnames(mdat)
```

#### Dataframes 

A data frame is an essential data type in R. It's pretty much the **de facto** data structure for most tabular data and what we use for statistics. 

##### Creation

You create a data frame using `data.frame()`, which takes named vectors as input:

```{r}
vec1 <- 1:3
vec2 <- c("a", "b", "c")
df <- data.frame(vec1, vec2)
df
str(df)
```

Beware: `data.frame()`'s default behavior which turns strings into factors. Remember to use `stringAsFactors = FALSE` to suppress this behavior as needed:

```{r}
df <- data.frame(
  x = 1:3,
  y = c("a", "b", "c"),
  stringsAsFactors = FALSE
)
str(df)
```

In reality, we rarely type up our datasets ourselves, and certainly not in R. The most common way to make a data.frame is by calling a file using `read.csv` (which relies on the `foreign` package), `read.dta` (if you're using a Stata file), or some other kinds of data file input.

##### Structure and Attributes

Under the hood, a data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list. 

```{r}
vec1 <- 1:3
vec2 <- c("a", "b", "c")
df <- data.frame(vec1, vec2)

str(df)
```

This means that a dataframe has `names()`, `colnames()`, and `rownames()`, although `names()` and `colnames()` are the same thing. 

** Summary **

- Set column names: `names()` in data frame, `colnames()` in matrix 
- Set row names: `row.names()` in data frame, `rownames()` in matrix

```{r}
vec1 <- 1:3
vec2 <- c("a", "b", "c")
df <- data.frame(vec1, vec2)

# these two are equivalent
names(df)
colnames(df)

# change the colnames
colnames(df) <- c("Number", "Character")
df
```

```{r}
names(df) <- c("Number", "Character")
df
```

```{r}
# change the rownames
rownames(df)
rownames(df) <- c("donut", "pickle", "pretzel")
df
```

The `length()` of a dataframe is the length of the underlying list and so is the same as `ncol()`; `nrow()` gives the number of rows. 

```{r}
vec1 <- 1:3
vec2 <- c("a", "b", "c")
df <- data.frame(vec1, vec2)

# these two are equivalent - number of columns
length(df)

ncol(df)

# get number of rows
nrow(df)

# get number of both columns and rows
dim(df)
```

##### Testing and coercion

To check if an object is a dataframe, use `class()` or test explicitly with `is.data.frame()`:

```{r}
class(df)
is.data.frame(df)
```

You can coerce an object to a dataframe with `as.data.frame()`:

* A vector will create a one-column dataframe.

* A list will create one column for each element; it's an error if they're 
  not all the same length.
  
* A matrix will create a data frame with the same number of columns and rows as the matrix.

##### Combining dataframes

You can combine dataframes using `cbind()` and `rbind()`:

```{r}
df <- data.frame(
  x = 1:3,
  y = c("a", "b", "c"),
  stringsAsFactors = FALSE
)

cbind(df, data.frame(z = 3:1))
rbind(df, data.frame(x = 10, y = "z"))
```

When combining column-wise, the number of rows must match, but row names are ignored. When combining row-wise, both the number and names of columns must match. (If you want to combine rows that don't have the same columns, other functions/packages in R can help.)

It's a common mistake to try and create a dataframe by `cbind()`ing vectors together. This doesn't work because `cbind()` will create a matrix unless one of the arguments is already a dataframe. Instead use `data.frame()` directly:

```{r}
bad <- (cbind(x = 1:2, y = c("a", "b")))
bad
str(bad)

good <- data.frame(
  x = 1:2, y = c("a", "b"),
  stringsAsFactors = FALSE
)
good
str(good)
```

The conversion rules for `cbind()` are complicated and best avoided by ensuring all inputs are of the same type.

**Other objects**

Missing values are specified with `NA,` which is a logical vector of length 1. `NA` will always be coerced to the correct type if used inside `c()`

```{r}
x <- c(NA, 1)
x
typeof(NA)
typeof(x)
```

`Inf` is infinity. You can have either positive or negative infinity.

```{r}
1 / 0
1 / Inf
```

`NaN` means Not a number. It's an undefined value.

```{r}
0 / 0
```

### Subset

When working with data, you'll need to subset objects early and often. Luckily, R's subsetting operators are powerful and fast. Mastery of subsetting allows you to succinctly express complex operations in a way that few other languages can match. Subsetting is hard to learn because you need to master several interrelated concepts:

* The three subsetting operators, `[`, `[[`, and `$`.

* Important differences in behavior for different objects (e.g., vectors, lists, factors, matrices, and data frames).

* The use of subsetting in conjunction with assignment.

This unit helps you master subsetting by starting with the simplest type of subsetting: subsetting an atomic vector with `[`. It then gradually extends your knowledge to more complicated data types (like dataframes and lists) and then to the other subsetting operators, `[[` and `$`. You'll then learn how subsetting and assignment can be combined to modify parts of an object, and, finally, you'll see a large number of useful applications.

#### Atomic vectors

Let's explore the different types of subsetting with a simple vector, `x`. 

```{r}
x <- c(2.1, 4.2, 3.3, 5.4)
```

Note that the number after the decimal point gives the original position in the vector.

**NB:** In R, positions start at 1, unlike Python, which starts at 0. Fun!**

There are five things that you can use to subset a vector: 

##### Positive integers

```{r}
x <- c(2.1, 4.2, 3.3, 5.4)
x
x[1]
x[c(3, 1)]

# `order(x)` gives the positions of smallest to largest values.
order(x)
x[order(x)]
x[c(1, 3, 2, 4)]

# Duplicated indices yield duplicated values
x[c(1, 1)]
```

##### Negative integers

```{r}
x <- c(2.1, 4.2, 3.3, 5.4)
x[-1]
x[-c(3, 1)]
```

You can't mix positive and negative integers in a single subset:

```{r, error = TRUE}
x <- c(2.1, 4.2, 3.3, 5.4)
x[c(-1, 2)]
```

##### Logical vectors

```{r}
x <- c(2.1, 4.2, 3.3, 5.4)

x[c(TRUE, TRUE, FALSE, FALSE)]
```

This is probably the most useful type of subsetting because you write the expression that creates the logical vector.

```{r}
x <- c(2.1, 4.2, 3.3, 5.4)

# this returns a logical vector
x > 3
x

# use a conditional statement to create an implicit logical vector
x[x > 3]
```

You can combine conditional statements with `&` (and), `|` (or), and `!` (not)

```{r}
x <- c(2.1, 4.2, 3.3, 5.4)

# combing two conditional statements with &
x > 3 & x < 5


x[x > 3 & x < 5]

# combing two conditional statements with |
x < 3 | x > 5
x[x < 3 | x > 5]

# combining conditional statements with !
!x > 5
x[!x > 5]
```

Another way to generate implicit conditional statements is using the `%in%` operator, which works like the `in` keywords in Python.

```{r}
# generate implicit logical vectors through the %in% operator
x %in% c(3.3, 4.2)
x
x[x %in% c(3.3, 4.2)]
```

##### Character vectors

```{r}
x <- c(2.1, 4.2, 3.3, 5.4)

# apply names
names(x) <- c("a", "b", "c", "d")
x

# subset using names
x[c("d", "c", "a")]

# Like integer indices, you can repeat indices
x[c("a", "a", "a")]

# Careful! names are always matched exactly
x <- c(abc = 1, def = 2)
x
x[c("a", "d")]
```

###### More on string operations 

```{r}
firstName <- "Jae Yeon"
lastName <- "Kim"
```

Unlike in Python, R does not have a reserved operator for string concatenation such as `+`.  Furthermore, using the usual concatenation operator ```c()``` on two or more character strings will not create a single character string, but rather a **vector** of character strings. 

```{r}
fullName <- c(firstName, lastName)

print(fullName)

length(fullName)
```

To combine two or more character strings into one larger character string, we use the ```paste()``` function.  This function takes character strings or vectors and collapses their values into a single character string, with each value separated by a character string selected by the user.

```{r eval = FALSE}
fullName <- paste(firstName, lastName)

print(fullName)

fullName <- paste(firstName, lastName, sep = "+")

print(fullName)

fullName <- paste(firstName, lastName, sep = "___")
print(fullName)
```

As with Python, R can also extract substrings based on the index position of its characters.  There are, however, two critical differences.  First, **index positions in R start at 1**.  This is in contrast to Python, where indexation begins at 0.  

Second, **object subsets using index positions in R contain all the elements in the specified range**.  If some object called ```data``` contains five elements, ```data[2:4]``` will return the elements at the second, third, and fourth positions.  By contrast, the same subset in Python would return the objects at the third and fourth positions (or second and third positions, depending upon whether your index starts at 0 or 1).  

Third, **R does not allow indexing of character strings***. Instead, you must use the ```substr()``` function.  Note that this function must receive both the ```start``` and ```stop``` arguments.  So if you want to get all the characters between some index and the end of the string, you must use the ```nchar()``` function, which will tell you the length of a character string.

```{r}

fullName <- paste(firstName, lastName)

# this won't work like in Python
fullName[1] # R sees the string as a unitary object - it can't be indexed this way
fullName[1:4]

# So use this instead
substr(x = fullName, start = 1, stop = 2)
substr(x = fullName, start = 5, stop = 5)
substr(x = fullName, start = 1, stop = 10)
substr(x = fullName, start = 11, stop = nchar(fullName))
```

Like Python, R has a number of string methods, though these exist as individual rather than "mix-and-match" functions. For example:

```{r}
toupper(x = fullName)
tolower(x = fullName)

strsplit(x = fullName, split = " ")
strsplit(x = fullName, split = "n")

gsub(pattern = "Kim", replacement = "Choi", x = fullName)
gsub(pattern = "Jae Yeon", replacement = "Danny", x = fullName)

# Note the importance of cases! This doesn't throw an error, so you won't realize your function didn't work unless you double-check several entries.

gsub(pattern = " ", replacement = "", x = fullName) # The same function is used for replacements and stripping
```

#### Lists

Subsetting a list works in the same way as subsetting an atomic vector. Using `[` will always return a list; `[[` and `$`, as described below, let you pull out the list's components.

```{r}
l <- list("a" = 1, "b" = 2)
l

l[1]
l[[1]]
l["a"]
```

#### Matrices

The most common way of subsetting matrices (2d) is a simple generalization of 1d subsetting: you supply a 1d index for each dimension, separated by a comma. Blank subsetting is now useful because it lets you keep all rows or all columns.

```{r}
a <- matrix(1:9, nrow = 3)
colnames(a) <- c("A", "B", "C")
a

# rows come first, then columns
a[c(1, 2), ]
a[c(T, F, T), c("B", "A")]
a[0, -2]
a[c(1, 2), -2]
```

#### Data frames

Data from data frames can be addressed like matrices (with row and column indicators separated by a comma).

```{r}
df <- data.frame(x = 4:6, y = 3:1, z = letters[1:3])
df

# return only the rows where x == 6
df[df$x == 6, ]

# return the first and third row
df[c(1, 3), ]

# return the first and third row and the first and second column
df[c(1, 3), c(1, 2)]
```

Data frames possess both lists and matrices' characteristics: if you subset with a single vector, they behave like lists and return only the columns.

```{r}
# There are two ways to select columns from a data frame
# Like a list:
df[c("x", "z")]
# Like a matrix
df[, c("x", "z")]
```

But there's a significant difference when selecting a single column: matrix subsetting simplifies by default, list subsetting does not.

```{r}
(df["x"])
class((df["x"]))

(df[, "x"])
class((df[, "x"]))
```

See the bottom section on [Simplying and Preserving to know more](#simplify-preserve)

#### Subsetting operators 

There are two other subsetting operators: `[[` and `$`. 

* `[[` is similar to `[`, except it can only return a single value, and it allows you to pull pieces out of a list. 
* `$` is a useful shorthand for `[[` combined with character subsetting. 

##### `[[`

You need `[[` when working with lists. When `[` is applied to a list it always returns a list: it never gives you the list's contents. To get the contents, you need `[[`:

>  "If list `x` is a train carrying objects, then `x[[5]]` is
> the object in car 5; `x[4:6]` is a train of cars 4-6." 
>
> --- @RLangTip

Because data frames are lists of columns, you can use `[[` to extract a column from data frames:

```{r}
mtcars

# these two are equivalent
mtcars[[1]]
mtcars[, 1]

# which differs from this:
mtcars[1]
```

##### `$`

`$` is a shorthand operator, where `x$y` is equivalent to `x[["y", exact = FALSE]]`.  It's often used to access variables in a data frame:

```{r}
# these two are equivalent
mtcars[["cyl"]]
mtcars$cyl
```

One common mistake with `$` is to try and use it when you have the name of a column stored in a variable:

```{r}
var <- "cyl"
# Doesn't work - mtcars$var translated to mtcars[["var"]]
mtcars$var

# Instead use [[
mtcars[[var]]
```

#### Subassignment

All subsetting operators can be combined with an assignment operator to modify selected values of the input vector.

```{r, error = TRUE}
x <- 1:5
x
x[c(1, 2)] <- 2:3
x

# The length of the LHS needs to match the RHS!
x[-1] <- 4:1
x

x[1] <- 4:1

# This is mostly useful when conditionally modifying vectors
df <- data.frame(a = c(1, 10, NA))
df
df$a[df$a < 5] <- 0
df
```

## Tidyverse

### The Big Picture

> "Tidy data sets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table." - Hadley Wickham

1.  Variables -\> **Columns**
2.  Observations -\> **Rows**
3.  Values -\> **Cells**

![Tidy Data Example (Source: R for Data Science)](https://garrettgman.github.io/images/tidy-1.png)

If dataframes are tidy, it's easy to transform, visualize, model, and program them using tidyverse packages (a whole workflow).

![Tidyverse: an opinionated collection of R packages](https://miro.medium.com/max/960/0*mlPyX0NE0WQwEzpS.png)

-   Nevertheless, don't be **religious**.

> In summary, tidy data is a useful conceptual idea and is often the right way to go for general, small data sets, but may not be appropriate for all problems. - Jeff Leek

For instance, in many data science applications, linear algebra-based computations are essential (e.g., [Principal Component Analysis](https://www.math.upenn.edu/~kazdan/312S13/JJ/PCA-JJ.pdf)). These computations are optimized to work on matrices, not tidy data frames (for more information, read [Jeff Leek's blog post](https://simplystatistics.org/2016/02/17/non-tidy-data/)).

This is what tidy data looks like.

```{r}
library(tidyverse)

table1
```


**Additional tips** 

There are so many different ways of looking at data in R. Can you discuss the pros and cons of each approach? Which one do you prefer and why?


* `str(table1)`

* `glimpse(table1)`: similar to `str()` cleaner output 

* `skim(table1)`: `str()` + `summary()` + more 


- The big picture 
    - Tidying data with **tidyr**
    - Processing data with **dplyr**
    
These two packages don't do anything new but simplify most common tasks in data manipulation. Plus, they are fast, consistent, and more readable.

Practically, this approach is right because you will have consistency in data format across all the projects you're working on. Also, tidy data works well with key packages (e.g., `dplyr,` `ggplot2`) in R.

Computationally, this approach is useful for vectorized programming because "different variables from the same observation are always paired". Vectorized means a function applies to a vector that treats each element individually (=operations working in parallel).

## Tidying (tidyr)

### Reshaping

**Signs of messy datasets**

* 1. Column headers are values, not variable names.
* 2. Multiple variables are not stored in one column.
* 3. Variables are stored in both rows and columns.
* 4. Multiple types of observational units are stored in the same table.
* 5. A single observational unit is stored in multiple tables.

Let's take a look at the cases of untidy data.

![Messy Data Case 1 (Source: R for Data Science)](https://garrettgman.github.io/images/tidy-5.png)

-   Make It Longer

    | Col1 | Col2 | Col3 |
    |------|------|------|
    |      |      |      |
    |      |      |      |
    |      |      |      |

**Challenge**: Why is this data not tidy?

```{r}

table4a
```

-   Let's pivot (rotate by 90 degrees).


![Concept map for pivoting. By Florian Schmoll, Monica Alonso.](https://github.com/rstudio/concept-maps/raw/master/en/pivoting.svg)


-   [`pivot_longer()`](https://tidyr.tidyverse.org/reference/pivot_longer.html) increases the number of rows (longer) and decreases the number of columns. The inverse function is `pivot_wider()`. These functions improve the usability of `gather()` and `spread()`.

![What pivot\_longer() does (Source: <https://www.storybench.org>)](https://www.storybench.org/wp-content/uploads/2019/08/pivot-longer-image.png)


![Concept map for pipe operator. By Jeroen Janssens, Monica Alonso.](https://education.rstudio.com/blog/2020/09/concept-maps/pipe-operator.png)

- The pipe operator `%>%` originally comes from the `magrittr` package. The idea behind the pipe operator is [similar to](https://www.datacamp.com/community/tutorials/pipe-r-tutorial) what we learned about chaining functions in high school. f: B -> C and g: A -> B can be expressed as $f(g(x))$. The pipe operator chains operations. When reading the pipe operator, read as "and then" (Wickham's recommendation). The keyboard shortcut is ctrl + shift + M. The key idea here is not creating temporary variables and focusing on verbs (functions). We'll learn more about this functional programming paradigm later on.

```{r}

table4a

# Old way, less intuitive
table4a %>%
  gather(
    key = "year", # Current column names
    value = "cases", # The values matched to cases
    c("1999", "2000")
  ) # Selected columns
```

```{r}

# New way, more intuitive
table4a %>%
  pivot_longer(
    cols = c("1999", "2000"), # Selected columns
    names_to = "year", # Shorter columns (the columns going to be in one column called year)
    values_to = "cases"
  ) # Longer rows (the values are going to be in a separate column called named cases)
```

-   There's another problem, did you catch it?

-   The data type of `year` variable should be `numeric` not `character`. By default, `pivot_longer()` transforms uninformative columns to character.

-   You can fix this problem by using `names_transform` argument.

```{r}

table4a %>%
  pivot_longer(
    cols = c("1999", "2000"), # Put two columns together
    names_to = "year", # Shorter columns (the columns going to be in one column called year)
    values_to = "cases", # Longer rows (the values are going to be in a separate column called named cases)
    names_transform = list(year = readr::parse_number)
  ) # Transform the variable
```

**Additional tips**

`parse_number()` also keeps only numeric information in a variable.

```{r}

parse_number("reply1994")
```

A flat file (e.g., CSV) is a rectangular shaped combination of strings. [Parsing](https://cran.r-project.org/web/packages/readr/vignettes/readr.html) determines the type of each column and turns into a vector of a more specific type. Tidyverse has `parse_` functions (from `readr` package) that are flexible and fast (e.g., `parse_integer()`, `parse_double()`, `parse_logical()`, `parse_datetime()`, `parse_date()`, `parse_time()`, `parse_factor()`, etc).

-   Let's do another practice.

**Challenge**

1.  Why is this data not tidy? (This exercise comes from [`pivot` function vigenette](https://tidyr.tidyverse.org/articles/pivot.html).) Too long or too wide?

```{r}

billboard
```

2.  How can you fix it? Which pivot?

```{r}

# Old way
billboard %>%
  gather(
    key = "week",
    value = "rank",
    starts_with("wk")
  ) %>% # Use regular expressions
  drop_na() # Drop NAs
```

-   Note that `pivot_longer()` is more versatile than `gather()`.

```{r}

# New way
billboard %>%
  pivot_longer(
    cols = starts_with("wk"), # Use regular expressions
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE # Drop NAs
  )
```

-   Make It Wider

-   Why is this data not tidy?

```{r}
table2
```

-   Each observation is spread across two rows.

-   How can you fix it?: `pivot_wider()`.

**Two differences between `pivot_longer()` and `pivot_wider()`**

-   In `pivot_longer()`, the arguments are named `names_to` and `values_to` (*to*).

-   In `pivot_wider()`, this pattern is opposite. The arguments are named `names_from` and `values_from` (*from*).

-   The number of required arguments for `pivot_longer()` is 3 (col, names\_to, values\_to).

-   The number of required arguments for `pivot_wider()` is 2 (names\_from, values\_from).

![What pivot\_wider() does (Source: <https://www.storybench.org>)](https://www.storybench.org/wp-content/uploads/2019/08/pivot-wider-image.png)

```{r}

# Old way
table2 %>%
  spread(
    key = type,
    value = count
  )
```

```{r}
# New way
table2 %>%
  pivot_wider(
    names_from = type, # first
    values_from = count # second
  )
```

Sometimes, a consultee came to me and asked: "I don't have missing values in my original dataframe. Then R said that I had missing values after doing some data transformations. What happened?"

Here's an answer.

R defines missing values in two ways.

-   *Implicit missing values*: simply not present in the data.

-   *Explicit missing values*: flagged with NA

**Challenge**

The example comes from [*R for Data Science*](https://r4ds.had.co.nz/tidy-data.html).

```{r}


stocks <- tibble(
  year = c(2019, 2019, 2019, 2020, 2020, 2020),
  qtr = c(1, 2, 3, 2, 3, 4),
  return = c(1, 2, 3, NA, 2, 3)
)

stocks
```

-   Where is the explicit missing value?

-   Does `stocks` have implicit missing values?

```{r}
# implicit missing values become explicit
stocks %>%
  pivot_wider(
    names_from = year,
    values_from = return
  )
```

**Challenge**

-   This exercise comes from [`pivot` function vigenette](https://tidyr.tidyverse.org/articles/pivot.html).

-   Could you make `station` a series of dummy variables using `pivot_wider()`?

```{r}
fish_encounters
```

1.  Which pivot should you use?

2.  Are there explicit missing values?

3.  How could you turn these NAs into 0s? Check `values_fill` argument in the `pivot_wider()` function.

-   Separate

![Messy Data Case 2 (Source: R for Data Science)](https://garrettgman.github.io/images/tidy-6.png)

```{r}

# Toy example
df <- data.frame(x = c(NA, "Dad.apple", "Mom.orange", "Daughter.banana"))

df
```

```{r}

# Separate
df %>%
  separate(x, into = c("Name", "Preferred_fruit"))

# Don't need the first variable

df %>%
  separate(x, into = c(NA, "Preferred_fruit"))
```

**Practice**

```{r}
table3
```

-   Note `sep` argument. You can specify how to separate joined values.

```{r}
table3 %>%
  separate(rate,
    into = c("cases", "population"),
    sep = "/"
  )
```

-   Note `convert` argument. You can specify whether automatically convert the new values or not.

```{r}
table3 %>%
  separate(rate,
    into = c("cases", "population"),
    sep = "/",
    convert = TRUE
  ) # cases and population become integers
```

-   Unite

`pivot_longer()` \<-\> `pivot_wider()`

`separate()` \<-\> `unite()`

```{r}

# Create a toy example
df <- data.frame(
  name = c("Jae", "Sun", "Jane", NA),
  birthmonth = c("April", "April", "June", NA)
)

# Include missing values
df %>% unite(
  "contact",
  c("name", "birthmonth")
)

# Do not include missing values
df %>% unite("contact",
  c("name", "birthmonth"),
  na.rm = TRUE
)
```

### Filling

This is a relatively less-known function of the tidyr package. However, I found this function super useful to complete time-series data. For instance, how can you replace NA in the following example (this use case is drawn from the [tidyr package vignette](https://tidyr.tidyverse.org/reference/fill.html).)?

```{r}
# Example
stock <- tibble::tribble(
  ~quarter, ~year, ~stock_price,
  "Q1", 2000, 10000,
  "Q2", NA, 10001, # Replace NA with 2000
  "Q3", NA, 10002, # Replace NA with 2000
  "Q4", NA, 10003, # Replace NA with 2000
  "Q1", 2001, 10004,
  "Q2", NA, 10005, # Replace NA with 2001
  "Q3", NA, 10006, # Replace NA with 2001
  "Q4", NA, 10007, # Replace NA with 2001
)

fill(stock, year)
```

Let's take a slightly more complex example. 

```{r}
# Example
yelp_rate <- tibble::tribble(
  ~neighborhood, ~restraurant_type, ~popularity_rate,
  "N1", "Chinese", 5,
  "N2", NA, 4,
  "N3", NA, 3,
  "N4", NA, 2,
  "N1", "Indian", 1,
  "N2", NA, 2,
  "N3", NA, 3,
  "N4", NA, 4,
  "N1", "Mexican", 5
)

fill(yelp_rate, restraurant_type) # default is direction = .down
fill(yelp_rate, restraurant_type, .direction = "up")
```

## Manipulating (dplyr)

![Concept map for dplyr. By Monica Alonso, Greg Wilson.](https://education.rstudio.com/blog/2020/09/concept-maps/dplyr.png)

`dplyr` is better than the base R approaches to data processing:

- fast to run (due to the C++ backed) and intuitive to type
- works well with tidy data and databases (thanks to [`dbplyr`](https://dbplyr.tidyverse.org/))

### Rearranging

-   Arrange

-   Order rows

```{r}

dplyr::arrange(mtcars, mpg) # Low to High (default)

dplyr::arrange(mtcars, desc(mpg)) # High to Row
```

-   Rename

-   Rename columns

```{r}

df <- tibble(y = c(2011, 2012, 2013))

df %>%
  rename(
    Year = # NEW name
      y
  ) # OLD name
```

### Subset observations (rows)

-   Choose row by logical condition

-   Single condition

```{r}
starwars %>%
  dplyr::filter(gender == "feminine") %>%
  arrange(desc(height))
```

The following filtering example was inspired by [the suzanbert's dplyr blog post](https://suzan.rbind.io/2018/02/dplyr-tutorial-3/).

-   Multiple conditions (numeric)

```{r}

# First example
starwars %>%
  dplyr::filter(height < 180, height > 160) %>%
  nrow()

# Same as above
starwars %>%
  dplyr::filter(height < 180 & height > 160) %>%
  nrow()

# Not same as above
starwars %>%
  dplyr::filter(height < 180 | height > 160) %>%
  nrow()
```

**Challenge**

(1) Use `filter(between())` to find characters whose heights are between 180 and 160 and (2) count the number of these observations.

-   Minimum reproducible example

```{r}

df <- tibble(
  heights = c(160:180),
  char = rep("none", length(c(160:180)))
)

df %>%
  dplyr::filter(between(heights, 161, 179))
```

-   Multiple conditions (character)

```{r}

# Filter names include ars; `grepl` is a base R function

starwars %>%
  dplyr::filter(grepl("ars", tolower(name)))

# Or, if you prefer dplyr way

starwars %>%
  dplyr::filter(str_detect(tolower(name), "ars"))

# Filter brown and black hair_color

starwars %>%
  dplyr::filter(hair_color %in% c("black", "brown"))
```

**Challenge**

Use `str_detect()` to find characters whose names include "Han".

-   Choose row by position (row index)

```{r}

starwars %>%
  arrange(desc(height)) %>%
  slice(1:6)
```

-   Sample by a fraction

```{r}

# For reproducibility
set.seed(1234)

# Old way

starwars %>%
  sample_frac(0.10,
    replace = FALSE
  ) # Without replacement

# New way

starwars %>%
  slice_sample(
    prop = 0.10,
    replace = FALSE
  )
```

-   Sample by number

```{r}

# Old way

starwars %>%
  sample_n(20,
    replace = FALSE
  ) # Without replacement

# New way

starwars %>%
  slice_sample(
    n = 20,
    replace = FALSE
  ) # Without replacement
```

-   Top 10 rows orderd by height

```{r}

# Old way
starwars %>%
  top_n(10, height)

# New way
starwars %>%
  slice_max(height, n = 10) # Variable first, Argument second
```

### Subset variables (columns)

```{r}

names(msleep)
```

-   Select only numeric columns

```{r}

# Only numeric
msleep %>%
  dplyr::select(where(is.numeric))
```

**Challenge**

Use `select(where())` to find only non-numeric columns

-   Select the columns that include "sleep" in their names

```{r}

msleep %>%
  dplyr::select(contains("sleep"))
```

-   Select the columns that include either "sleep" or "wt" in their names

-   Basic R way

`grepl` is one of the R base pattern matching functions.

```{r}

msleep[grepl("sleep|wt", names(msleep))]
```

**Challenge**

Use `select(match())` to find columns whose names include either "sleep" or "wt".

-   Select the columns that start with "b"

```{r}

msleep %>%
  dplyr::select(starts_with("b"))
```

-   Select the columns that end with "wt"

```{r}

msleep %>%
  dplyr::select(ends_with("wt"))
```

-   Select the columns using both beginning and end string patterns

The key idea is you can use Boolean operators (`!`, `&`, `|`)to combine different string pattern matching statements.

```{r}

msleep %>%
  dplyr::select(starts_with("b") & ends_with("wt"))
```

-   Select the order and move it before everything

```{r}

# By specifying a column
msleep %>%
  dplyr::select(order, everything())
```

-   Select variables from a character vector.

```{r}

msleep %>%
  dplyr::select(any_of(c("name", "order"))) %>%
  colnames()
```

-   Select the variables named in character + number pattern

```{r}

msleep$week8 <- NA

msleep$week12 <- NA

msleep$week_extra <- 0

msleep %>%
  dplyr::select(num_range("week", c(1:12)))
```

**Additional tips**

`msleep` data has nicely cleaned column names. But real-world data are usually messier. The `janitor` package is useful to fix this kind of problem.

```{r}

messy_df <- tibble::tribble(
  ~"ColNum1", ~"COLNUM2", ~"COL & NUM3",
  1, 2, 3
)

messy_df

pacman::p_load(janitor)

janitor::clean_names(messy_df)
```

`janitor::tabyl()` is helpful for doing crosstabulation and a nice alternative to `table()` function. 

```{r}

# Frequency table; The default output class is table
table(gapminder$country)

# Frequency table (unique value, n, percentage)
janitor::tabyl(gapminder$country)

# If you want to add percentage ...
gapminder %>%
  tabyl(country) %>%
  adorn_pct_formatting(digits = 0, affix_sign = TRUE)
```


### Create variables 

```{r include=FALSE, eval=FALSE}

mutate(
  .data, # data.frame
  ...
) # new column

mutate(mtcars, column0 = 0)
```

#### Change values using conditions 

You can think of `case_when()` (multiple conditions) as an extended version of `ifelse()` (binary conditions). 

```{r}

mtcars <- mtcars %>%
  mutate(cyl_dummy = case_when(
    cyl > median(cyl) ~ "High", # if condition
    cyl < median(cyl) ~ "Low", # else if condition
    TRUE ~ "Median"
  )) # else condition

mtcars %>% pull(cyl_dummy)
```

#### Change values manually 

```{r}

mtcars %>%
  mutate(cyl_dummy = recode(cyl_dummy, # Target column
    "High" = "2", # Old - New
    "Low" = "0",
    "Median" = "1"
  )) %>%
  pull(cyl_dummy)
```


### Counting

-   How many countries are in each continent?

```{r}
gapminder %>%
  count(continent)
```

-   Let's arrange the result.

```{r}

# Just add a new argument `sort = TRUE`
gapminder %>%
  count(continent, sort = TRUE)

# Same as above; How nice!
gapminder %>%
  count(continent) %>%
  arrange(desc(n))
```

**Challenge**

Count the number of observations per `continent` and `year` and arrange them in descending order.

Let's take a deeper look at how things work under the hood.

-   `tally()` works similar to `nrow()`: Calculate the total number of cases in a dataframe

-   `count` = `group_by()` + `tally()`

```{r}

gapminder %>%
  tally()
```

-   `add_tally()` = `mutate(n = n())`

**Challenge**

What does n in the below example represent?

```{r}

gapminder %>%
  dplyr::select(continent, country) %>%
  add_tally()
```

-   `add_count`

Add count as a column.

```{r}

# Add count as a column
gapminder %>%
  group_by(continent) %>%
  add_count(year)
```

**Challenge**

Do cases 1 and 2 in the below code chunk produce the same outputs? If so, why?

```{r}

# Case 1
gapminder %>%
  group_by(continent, year) %>%
  count()

# Case 2
gapminder %>%
  group_by(continent) %>%
  count(year)
```

`count()` is a simple function, but it is still helpful to learn an essential concept underlying complex data wrangling: split-apply-combine strategy. For more information, read Wickham's article (2011) ["The Split-Apply-Combine Strategy for Data Analysis"](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.182.5667&rep=rep1&type=pdf) published in the *Journal of Statistical Software* (especially pages 7-8). [`plyr`](https://github.com/hadley/plyr) was the package (retired) that demonstrated this idea, which has evolved into two directions: [dplyr](https://dplyr.tidyverse.org/) (for data frames) and [purrr](https://purrr.tidyverse.org/) (for lists)

### Summarizing

#### Basic

- Create a summary
- Think of `summarise()` as an extended version of `count()`.

```{r}

gapminder %>%
  group_by(continent) %>%
  summarise(
    n = n(),
    mean_gdp = mean(gdpPercap),
    sd_gdp = sd(gdpPercap)
  )

tablea <- gapminder %>%
  group_by(continent) %>%
  summarise(
    n = n(),
    mean_gdp = mean(gdpPercap),
    sd_gdp = sd(gdpPercap)
  )
```

-   Produce publishable tables

```{r}
pacman::p_load(
  kableExtra,
  flextable
)

# For HTML and LaTeX
tablea %>% kableExtra::kable()

# For HTML and MS Office suite
tablea %>% flextable::flextable()
```

#### Scoped summaries

-   Old way

-   `summarise_all()`

```{r}

# Create a wide-shaped data example
wide_gapminder <- gapminder %>%
  dplyr::filter(continent == "Europe") %>%
  pivot_wider(
    names_from = country,
    values_from = gdpPercap
  )

# Apply summarise_all
wide_gapminder %>%
  dplyr::select(-c(1:4)) %>%
  summarise_all(mean, na.rm = TRUE)
```

-   `summarise_if()`: using a logical condition

```{r}

wide_gapminder %>%
  summarise_if(is.double, mean, na.rm = TRUE)
```

-   `summarise_at()`

-   `vars() = select()`

```{r}

wide_gapminder %>%
  summarise_at(vars(-c(1:4)),
    mean,
    na.rm = TRUE
  )

wide_gapminder %>%
  summarise_at(vars(contains("life")),
    mean,
    na.rm = TRUE
  )
```

**Additional tips**


![Concept map for regular expressions. By Monica Alonso, Greg Wilson.](https://github.com/rstudio/concept-maps/raw/master/en/regular-expressions.svg)


-   New way

-   `summarise()` + `across()`


![Concept map for across. By Emma Vestesson](https://github.com/rstudio/concept-maps/raw/master/en/across.svg)


-   If you find using `summarise_all()`, `summarise_if()` and `summarise_at()` confusing, here's a solution: use `summarise()` with `across()`.

-   `summarise_all()`

```{r}

wide_gapminder %>%
  summarise(across(Albania:`United Kingdom`, mean, na.rm = TRUE))

wide_gapminder %>%
  summarise(across(-c(1:4), mean, na.rm = TRUE))
```

-   `summarise_if()`

```{r}

wide_gapminder %>%
  summarise(across(is.double, mean, na.rm = TRUE))
```

-   `summarise_at()`

```{r}

wide_gapminder %>%
  summarise(across(-c(1:4),
    mean,
    na.rm = TRUE
  ))

wide_gapminder %>%
  summarise(across(contains("life"),
    mean,
    na.rm = TRUE
  ))

wide_gapminder %>%
  summarise(across(contains("A", ignore.case = FALSE)))
```

Note that this workshop does not cover creating and manipulating variables using `mutate()` because many techniques you learned from playing with `summarise()` can be directly applied to `mutate()`.

**Challenge**

1.  Summarize the average GDP of countries whose names start with the alphabet "A."

2.  Turn the summary dataframe into a publishable table using either `kableExtra` or `flextable` package.

#### Tabulation (TBD)

### Grouping

#### Grouped summaries

- Calculate the mean of `gdpPercap`.

- Some functions are designed to work together. For instance, the 	`group_by()` function defines the strata you will use for summary statistics. Then, use `summarise()` to obtain summary statistics.

```{r}
gapminder %>%
  group_by(continent) %>% #
  summarise(mean_gdp = mean(gdpPercap))
```

-   Calculate multiple summary statistics.

```{r}
gapminder %>%
  group_by(continent) %>% #
  summarise(
    mean_gdp = mean(gdpPercap),
    count = n()
  )
```

**Optional**

-   Other summary statistics

1.  Measures of spread: `median(x)`, `sd(x)`, `IQR(x)`, `mad(x)` (the median absolute deviation)

```{r}
# The Interquartile Range = The Difference Between 75t and 25t Percentiles

gapminder %>%
  group_by(continent) %>% #
  summarise(IQR_gdp = IQR(gdpPercap))
```

2.  Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`

```{r}
gapminder %>%
  group_by(continent) %>% #
  summarise(
    min_gdp = min(gdpPercap),
    max_gdp = max(gdpPercap)
  )
```

3.  Measures of position: `first(x)`, `last(x)`, `nth(x, 2)`

```{r}
gapminder %>%
  group_by(continent) %>%
  summarise(
    first_gdp = first(gdpPercap),
    last_gdp = last(gdpPercap)
  )

gapminder %>%
  group_by(continent) %>%
  arrange(gdpPercap) %>% # Adding arrange
  summarise(
    first_gdp = first(gdpPercap),
    last_gdp = last(gdpPercap)
  )
```

4.  Measures of counts: `n(x)` (all rows), `sum(!is.na(x))` (only non-missing rows) = `n_distinct(x)`

```{r}
gapminder %>%
  group_by(continent) %>%
  summarise(ns = n())
```

5.  Counts and proportions of logical values: `sum(condition about x)` (the number of TRUEs in x), `mean(condition about x)` (the proportion of TRUEs in x)

```{r}
gapminder %>%
  group_by(continent) %>%
  summarise(rich_countries = mean(gdpPercap > 20000))
```

**Additional tips**

Also, check out window functions such as `cumsum()` and `lag()`. Window functions are a variant of aggregate functions that take a vector as input then return a vector of the same length as an output. 

```{r}

vec <- c(1:10)

# Typical aggregate function
sum(vec) # The output length is one

# Window function
cumsum(vec) # The output length is ten

# Let's compare them side-by-side
compare(
  sum(vec),
  cumsum(vec)
)
```

### Joining

Relational data = multiple tables of data

![Relational data example](https://d33wubrfki0l68.cloudfront.net/245292d1ea724f6c3fd8a92063dcd7bfb9758d02/5751b/diagrams/relational-nycflights.png)

**Key ideas**

- A **primary key** "uniquely identifies an observation in its table"

```{r}

# Example
planes$tailnum %>% head()
```
Verify primary key

`tailnum` should be unique. 

**Challenge**

What do you expect the outcome?

```{r}
planes %>%
  count(tailnum) %>%
  dplyr::filter(n > 1)
```
**Optional**

If a dataframe doesn't have a primary key, you can add one called a **surrogate** key.

```{r}

# Toy example
df <- tibble(
  x = c(1:3),
  y = c(4:6)
)

# Add a row_index column
df <- df %>% rowid_to_column("ID")
```

- A **foreign** key "uniquely identifies an observation in another table."

```{r}

flights$tailnum %>% head()
```
For joining, don't be distracted by other details and focus on KEYS!

#### Mutating joins

> Add new variables to one data frame from matching observations in another"

Using a simple toy example is great because it is easy to see how things work in that much narrow context.

-   Toy example

```{r}

# Table 1
x <- tibble(
  key = c(1:4),
  val_x = c("x1", "x2", "x3", "x4")
)

# Table 2
y <- tibble(
  key = c(1:5),
  val_y = c("y1", "y2", "y3", "y4", "y5")
)
```

-   Inner Join

`inner_join()` keeps the matched values in both tables. If the left table is a subset of the right table, then `left_join()` is the same as `inner_join()`.

**Challenge**

What is going to be the shared keys?

```{r}

inner_join(x, y)
```

![Mutating joins](https://d33wubrfki0l68.cloudfront.net/aeab386461820b029b7e7606ccff1286f623bae1/ef0d4/diagrams/join-venn.png)

-   Left Join

`left_join()`, `right_join()` and `full_join()` are outer join functions. Unlike `inner_join()`, outer join functions keep observations that appear in at least one of the tables.

`left_join()` keeps only the matched observations in the right table.

```{r}

left_join(x, y)
```

-   Right Join

`right_join()` does the opposite. 

```{r}

right_join(x, y)
```

-   Full Join

`full_join()` keeps the observations from both tables. NAs were recorded in one of the two tables if they were unmatched.

```{r}

full_join(x, y)
```

#### Filtering joins

> Filter observations from one data frame based on whether they match an observation in the other table.

-   Semi Join

In SQL, this type of query is also called subqueries.

-   Filtering without joining

```{r}

# Create the list of the top 10 destinations
top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  top_n(10)

# Filter
filtered <- flights %>%
  dplyr::filter(dest %in% top_dest$dest)
```

-   Using semi join: only keep (INCLUDE) the rows that were matched between the two tables

```{r}

joined <- flights %>%
  semi_join(top_dest)

head(filtered == joined)
```

-   Anti Join

`anti_join()` does the opposite. Exclude the rows that were matched between the two tables. A great technique to filter stopwords when you do computational text analysis.

```{r}

flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = TRUE)
```

## Modeling (broom)

### Nesting

#### nest

The following example comes from [R for Data Science](https://r4ds.had.co.nz/many-models.html) by Garrett Grolemund and Hadley Wickham.

-   How can you run multiple models simultaneously? Using a nested data frame.

```{=html}

<iframe width="560" height="315" src="https://www.youtube.com/embed/rz3_FDVt9eg" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

<p> Hadley Wickham: Managing many models with R </p>
```

-   **Grouped data: each row = an observation**

-   **Nested data: each row = a group**

**Challenge**

Why did we use `country` and `continent` for nesting variables in the following example?

```{r}

nested <- gapminder %>%
  group_by(country, continent) %>%
  nest()

head(nested)

nested$data %>% pluck(1)
```

-   Custom function

```{r}

lm_model <- function(df) {
  lm(lifeExp ~ year, data = df)
}
```

-   Apply function to the nested data

```{r}

# Apply m_model to the nested data

nested <- nested %>%
  mutate(models = map(data, lm_model)) # Add the list object as a new column

head(nested)
```

S3 is part of R's object-oriented systems. If you need further information, check out [this section](http://adv-r.had.co.nz/S3.html) in Hadley's Advanced R.

#### unnest

- glance() 

`glance()` function from `broom` package inspects the quality of a statistical model.

**Additional tips**

-   `broom::glance(model)`: for evaluating model quality and/or complexity
-   `broom::tidy(model)`: for extracting each coefficient in the model (the estimates + its variability)
-   `broom::augment(model, data)`: for getting extra values (residuals, and influence statistics). A convenient tool if you want to plot fitted values and raw data together. 

![Broom: Converting Statistical Models to Tidy Data Frames by David Robinson](https://www.youtube.com/watch?v=7VGPUBWGv6g&ab_channel=Work-Bench)

```{r}

glanced <- nested %>%
  mutate(glance = map(models, broom::glance))

# Pluck the first item on the list
glanced$glance %>% pluck(1)

# Pull p.value
glanced$glance %>%
  pluck(1) %>%
  pull(p.value)
```

`unnest()` unpacks the list objects stored in the `glanced` column

```{r}

glanced %>%
  unnest(glance) %>%
  arrange(r.squared)

glanced %>%
  unnest(glance) %>%
  ggplot(aes(continent, r.squared)) +
  geom_jitter(width = 0.5)
```

- tidy() 

```{r}
nested <- gapminder %>%
  group_by(continent) %>%
  nest()

nested <- nested %>%
  mutate(models = map(data, ~ lm(lifeExp ~ year + country, data = .)))

tidied <- nested %>%
  mutate(tidied = map(models, broom::tidy))

model_out <- tidied %>%
  unnest(tidied) %>%
  mutate(term = str_replace(term, "country", "")) %>%
  dplyr::select(continent, term, estimate, p.value) %>%
  mutate(p_threshold = ifelse(p.value < 0.05, 1, 0))

model_out %>%
  dplyr::filter(p_threshold == 1) %>%
  pull(term) %>%
  unique()
model_out %>%
  dplyr::filter(p_threshold == 0) %>%
  pull(term) %>%
  unique()
```


### Mapping

We tasted a bit of how `map()` function works. Let's dig into it more in-depth, as this family of functions is useful. See Rebecca Barter's excellent tutorial on the `purrr` package for more information. In her words, this is "the tidyverse's answer to apply functions for iteration". `map()` function can take a vector (of any type), a list, and a dataframe for input.

```{r}

multiply <- function(x) {
  x * x
}

df <- list(
  first_obs = rnorm(7, 1, sd = 1),
  second_obs = rnorm(7, 2, sd = 2)
) # normal distribution
```

**Challenge**

Try `map_df(.x = df, .f = multiply)` and tell me what's the difference between the output you got and what you saw earlier.

If you want to know more about the power and joy of functional programming in R (e.g., `purrr::map()`), then please take ["How to Automate Repeated Things in R"](https://github.com/dlab-berkeley/R-functional-programming) workshop.

## Visualizing (ggplot2)

- The following material is adapted from Kieran Healy's excellent book (2019) on [data visualization](https://socviz.co/) and Hadley Wickham's equally excellent book on [ggplot2](https://ggplot2-book.org/). For more theoretical discussions, I recommend you to read [The Grammar of Graphics](https://link.springer.com/book/10.1007%2F0-387-28695-0) by Leland Wilkinson.

- Why should we care about data visualization? More precisely, why should we learn the grammar of statistical graphics?
- Sometimes, pictures are better tools than words in 1) exploring, 2) understanding, and 3) explaining data.

### Motivation 

[Anscombe](https://en.wikipedia.org/wiki/Frank_Anscombe)'s quarter comprises four datasets, which are so alike in terms of their descriptive statistics but quite different when presented graphically.

```{r}
# Set theme
theme_set(theme_minimal())
```

```{r}

# Data
anscombe
```

```{r}

# Correlation
cor(anscombe)[c(1:4), c(5:8)]
```

```{r}

# gather and select
anscombe_processed <- anscombe %>%
  gather(x_name, x_value, x1:x4) %>%
  gather(y_name, y_value, y1:y4)

# plot
anscombe_processed %>%
  ggplot(aes(x = x_value, y = y_value)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  facet_grid(x_name ~ y_name) +
  theme_bw() +
  labs(
    x = "X values",
    y = "Y values",
    title = "Anscombe's quartet"
  )
```

### The grammar of graphics 

- the grammar of graphics 

    - data
    - aesthetic attributes (color, shape, size)
    - geometric objects (points, lines, bars)
    - stats (summary stats)
    - scales (map values in the data space)
    - coord (data coordinates)
    - facet (facetting specifications)
    
No worries about new terms. We're going to learn them by actually plotting. 

- Workflow: 

    1. Tidy data 
    2. Mapping 
    3. Geom 
    4. Cor_ordinates and scales 
    5. Labels and guides
    6. Themes
    7. Save files 

### mapping and geom

- `aes` (aesthetic mappings or aesthetics) tells which variables (x, y) in your data should be represented by which visual elements (color, shape, size) in the plot.

- `geom_` tells the type of plot you are going to use 

### basic aes (x , y)

```{r}

p <- ggplot(
  data = gapminder,
  mapping = aes(x = gdpPercap, y = lifeExp)
) # ggplot or R in general takes positional arguments too. So, you don't need to name data, mapping each time you use ggplot2.

p

p + geom_point()

p + geom_point() + geom_smooth() # geom_smooth has calculated a smoothed line;
# the shaded area is the standard error for the line
```

### Univariate distribution

- `geom_histogram()`: For the probability distribution of a continuous variable. Bins divide the entire range of values into a series of intervals (see [the Wiki entry](https://en.wikipedia.org/wiki/Histogram)). 
- `geom_density()`: Also for the probability distribution of a continuous variable. It calculates a [kernel density estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation) of the underlying distribution. 

#### Histogram 

```{r}

data(midwest) # load midwest dataset

midwest
```

```{r, eval = FALSE}
midwest %>%
  ggplot(aes(x = area)) +
  geom_point() # not working.
```

```{r}
midwest %>%
  ggplot(aes(x = area)) +
  geom_histogram() # stat_bin argument picks up 30 bins (or "bucket") by default.

midwest %>%
  ggplot(aes(x = area)) +
  geom_histogram(bins = 10) # only 10 bins.

ggplot(
  data = subset(midwest, state %in% c("OH", "IN")),
  mapping = aes(x = percollege, fill = state)
) +
  geom_histogram(alpha = 0.7, bins = 20) +
  scale_fill_viridis_d()
```

#### Density 

```{r}
midwest %>%
  ggplot(aes(x = area, fill = state, color = state)) +
  geom_density(alpha = 0.3) +
  scale_color_viridis_d() +
  scale_fill_viridis_d()
```

### Advanced aes (size, color)

- There's also `fill` argument (mostly used in `geom_bar()`). Color `aes` affects the appearance of lines and points, fill is for the filled areas of bars, polygons, and in some cases, the interior of a smoother's standard error ribbon.

- The property size/color/fill represents... 

```{r}
ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    size = pop
  )
) +
  geom_point()
```

```{r}
ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    size = pop,
    color = continent
  )
) +
  geom_point() +
  scale_color_viridis_d()
```

```{r}
# try red instead of "red"
ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    size = pop,
    color = "red"
  )
) +
  geom_point()
```

Aesthetics also can be mapped per Geom. 

```{r}
p + geom_point() +
  geom_smooth()

p + geom_point(alpha = 0.3) + # alpha controls transparency
  geom_smooth(color = "red", se = FALSE, size = 2)

p + geom_point(alpha = 0.3) + # alpha controls transparency
  geom_smooth(color = "red", se = FALSE, size = 2, method = "lm")
```

```{r}
ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    color = continent
  )
) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", color = "red") +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  )

ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    color = continent,
    fill = continent
  )
) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", color = "red") +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  ) +
  scale_color_viridis_d() +
  scale_fill_viridis_d()
```

### Co-ordinates and scales 

```{r}

p + geom_point() +
  coord_flip() # coord_type
```

The data is heavily bunched up against the left side. 
```{r}

p + geom_point() # without scaling

p + geom_point() +
  scale_x_log10() # scales the axis of a plot to a log 10 basis

p + geom_point() +
  geom_smooth(method = "lm") +
  scale_x_log10()
```


### Labels and guides 

`scales` package has some useful premade formatting functions. You can either load scales or just grab the function you need from the library using `scales::` 

```{r}

p + geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", color = "red") +
  scale_x_log10(labels = scales::dollar) +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  )
```

6. Themes
```{r}
p + geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", color = "red") +
  scale_x_log10(labels = scales::dollar) +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  ) +
  theme_economist()
```

### ggsave 

```{r eval = FALSE}
figure_example <- p + geom_point(alpha = 0.3) +
  geom_smooth(method = "gam", color = "red") +
  scale_x_log10(labels = scales::dollar) +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  ) +
  theme_economist()

ggsave(figure_example, here("outputs", "figure_example.png"))
```

### Many plots 

Basic ideas:

- Grouping: tell `ggplot2` about the structure of your data 
- Facetting: break up your data into pieces for a plot 

#### Grouping

- Can you guess what's wrong?

```{r}

p <- ggplot(gapminder, aes(x = year, y = gdpPercap))

p + geom_point()

p + geom_line()
```

`geom_line` joins up all the lines for each particular year in the order they appear in the dataset. `ggplot2` does not know the yearly observations in your data are grouped by country. 

You need grouping when the grouping information you need to tell is not built into the mapped variables (like continent).

```{r}
gapminder
```

#### Facetting 

Facetting is to make small multiples. 

- `facet_wrap`: based on a single categorical variable like `facet_wrap(~single_categorical_variable)`. Your panels will be laid out in order and then wrapped into a grid.

- `facet_grid`: when you want to cross-classify some data by two categorical variables like `facet_grid(one_cat_variable ~ two_cat_variable)`. 

```{r}
p <- ggplot(gapminder, aes(x = year, y = gdpPercap))

p + geom_line(aes(group = country)) # group by, # The outlier is Kuwait.

p + geom_line(aes(group = country)) + facet_wrap(~continent) # facetting

p + geom_line(aes(group = country), color = "gray70") +
  geom_smooth(size = 1.1, method = "loess", se = FALSE) +
  scale_y_log10(labels = scales::dollar) +
  facet_wrap(~continent, ncol = 5) + # for single categorical variable; for multiple categorical variables use facet_grid()
  labs(
    x = "Year",
    y = "GDP per capita",
    title = "GDP per capita on Five continents"
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```

```{r}
p + geom_line(aes(group = country), color = "gray70") +
  geom_smooth(size = 1.1, method = "loess", se = FALSE) +
  scale_y_log10(labels = scales::dollar) +
  facet_grid(~continent) + # for single categorical variable; for multiple categorical variables use facet_grid()
  labs(
    x = "Year",
    y = "GDP per capita",
    title = "GDP per capita on Five continents"
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```


### Transforming

- Transforming: perform some calculations on or summarize your data before producing the plot 

#### Use pipes to summarize data

Also, we experiment with bar charts here. By default, `geom_bar` [uses](https://www.rdocumentation.org/packages/ggplot2/versions/1.0.1/topics/geom_bar) stat = "bins", which makes the height of each bar equal to the number of cases in each group. If you have a y column, then you should use `stat = "identity"` argument. Alternatively, you can use `geom_col().`

```{r}

gapminder_formatted <- gapminder %>%
  group_by(continent, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  )

ggplot(data = gapminder_formatted, aes(x = year, y = lifeExp_mean, color = continent)) +
  geom_point() +
  labs(
    x = "Year",
    y = "Life expectancy",
    title = "Life expectancy on Five continents"
  )

gapminder %>%
  dplyr::filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = year, y = lifeExp_mean, color = country)) +
  geom_point() +
  labs(
    x = "Year",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  )
```

```{r}
# geom point
gapminder %>%
  dplyr::filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = year, y = lifeExp_mean)) +
  geom_point() +
  labs(
    x = "Year",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  ) +
  facet_wrap(~country)

# geom bar
gapminder %>%
  dplyr::filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = year, y = lifeExp_mean)) +
  geom_bar(stat = "identity") +
  labs(
    x = "Year",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  ) +
  facet_wrap(~country)

# no facet
gapminder %>%
  dplyr::filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = year, y = lifeExp_mean, fill = country)) +
  geom_bar(stat = "identity") + # even if you not stack, still the plot looks messy or you can use geom_col()
  labs(
    x = "Year",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  )
```

```{r}

gapminder %>%
  dplyr::filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = country, y = lifeExp_mean)) +
  geom_boxplot() +
  labs(
    x = "Country",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  ) +
  coord_flip()
```

```{r}
# without ordering
gapminder %>%
  dplyr::filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = reorder(country, lifeExp_mean), y = lifeExp_mean)) +
  geom_boxplot() +
  labs(
    x = "Country",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  ) +
  coord_flip()

# reorder
gapminder %>%
  dplyr::filter(continent == "Europe") %>%
  group_by(country, year) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = reorder(country, -lifeExp_mean), y = lifeExp_mean)) +
  geom_boxplot() +
  labs(
    x = "Country",
    y = "Life expectancy",
    title = "Life expectancy in Europe"
  ) +
  coord_flip()
```

#### Plotting text

```{r}
gapminder %>%
  dplyr::filter(continent == "Asia" | continent == "Americas") %>%
  group_by(continent, country) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = gdp_mean, y = lifeExp_mean)) +
  geom_point() +
  geom_text(aes(label = country)) +
  scale_x_log10() +
  facet_grid(~continent)
```

```{r}
# with label
gapminder %>%
  dplyr::filter(continent == "Asia" | continent == "Americas") %>%
  group_by(continent, country) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = gdp_mean, y = lifeExp_mean)) +
  geom_point() +
  geom_label(aes(label = country)) +
  scale_x_log10() +
  facet_grid(~continent)
```

```{r}
# no overlaps
gapminder %>%
  dplyr::filter(continent == "Asia" | continent == "Americas") %>%
  group_by(continent, country) %>%
  summarize(
    gdp_mean = mean(gdpPercap),
    lifeExp_mean = mean(lifeExp)
  ) %>%
  ggplot(aes(x = gdp_mean, y = lifeExp_mean)) +
  geom_point() +
  geom_text_repel(aes(label = country)) + # there's also geom_label_repel
  scale_x_log10() +
  facet_grid(~continent)
```

### Ploting models 

In plotting models, we extensively use David Robinson's [broom package](https://cran.r-project.org/web/packages/broom/vignettes/broom.html) in R. The idea is to transform model outputs (i.e., predictions and estimations) into tidy objects so that we can easily combine, separate, and visualize these elements. 

#### Plotting several fits at the same time

```{r}
model_colors <- RColorBrewer::brewer.pal(3, "Set1") # select three qualitatively different colors from a larger palette.

gapminder %>%
  ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS")) +
  geom_smooth(
    method = "lm", formula = y ~ splines::bs(x, df = 3),
    aes(color = "Cubic Spline", fill = "Cubic Spline")
  ) +
  geom_smooth(method = "loess", aes(color = "LOESS", fill = "LOESS")) +
  theme(legend.position = "top") +
  scale_color_manual(name = "Models", values = model_colors) +
  scale_fill_manual(name = "Models", values = model_colors)
```

#### Extracting model outcomes 

```{r}

# regression model
out <- lm(
  formula = lifeExp ~ gdpPercap + pop + continent,
  data = gapminder
)
```

`tidy()` is a method in the `broom` package. It "constructs a dataframe that summarizes the model's statistical findings". As the description states, tidy is a function that can be used for various models. For instance, a tidy can extract the following information from a regression model.

- `Term`: a term being estimated 
- `p.value`
- `statistic`: a test statistic used to compute p-value
- `estimate` 
- `conf.low`: the low end of a confidence interval 
- `conf.high`: the high end of a confidence interval
- `df`: degrees of freedom

**Challenge**

Try `glance(out)`; what did you get from these commands? If you're curious, you can try `?glance`.

The followings are to show your degree of confidence.

##### Coefficients

```{r}
# estimates
out_comp <- tidy(out)

p <- out_comp %>%
  ggplot(aes(x = term, y = estimate))

p + geom_point() +
  coord_flip() +
  theme_bw()
```

##### Confidence intervals

```{r}
# plus confidence intervals
out_conf <- tidy(out, conf.int = TRUE)

# plotting coefficients using ggplot2 (pointrange)
out_conf %>%
  ggplot(aes(x = reorder(term, estimate), y = estimate, ymin = conf.low, ymax = conf.high)) +
  geom_pointrange() +
  coord_flip() +
  labs(x = "", y = "OLS Estimate") +
  theme_bw()

# another way to do it (errorbar)
out_conf %>%
  ggplot(aes(x = estimate, y = reorder(term, estimate))) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
  labs(y = "", x = "OLS Estimate") +
  theme_bw()
```

You can also calculate marginal effects using the [`marginaleffects`](https://vincentarelbundock.github.io/marginaleffects/) package.