01-probability.Rmd

# Probability

## Load packages, load data, set theme

Let's load the packages that we need for this chapter. 

```{r, message=FALSE}
library("knitr")        # for rendering the RMarkdown file
library("kableExtra")   # for nicely formatted tables
library("arrangements") # fast generators and iterators for creating combinations
library("DiagrammeR")   # for drawing diagrams
library("tidyverse")    # for data wrangling 
```

Set the plotting theme.

```{r}
theme_set(theme_classic() + 
            theme(text = element_text(size = 20)))

opts_chunk$set(comment = "",
               fig.show = "hold")
```

## Learning goals

I like to think of statistics as "applied epistemology", the art of making precise statements about what we can and can't "know" from limited and noisy observations.
When we're trying to do science under these conditions, we're usually expressing our (lack of) knowledge as *uncertainty*.
We're often not sure whether there's really an effect in our experiments, or how big it might be.
We can measure a number, but we know that if we did the experiment again, we would get a slightly different number.
*Probability theory* gives us a language for talking about these concepts in a less hand-wavy way; it helps us *quantify* uncertainty.
Probabilities are the conceptual foundation of everything we're going to do in this course, and we're going to start out by learning to look at the phenomena from PSYCH 610 from a different perspective.

## Counting

Imagine that there are three balls in an urn. The balls are labeled 1, 2, and 3. Let's consider a few possible situations.

```{r}
balls = 1:3 # number of balls in urn 
ndraws = 2 # number of draws

# order matters, without replacement
permutations(balls, ndraws)

# order matters, with replacement
permutations(balls, ndraws, replace = T)

# order doesn't matter, with replacement 
combinations(balls, ndraws, replace = T)

# order doesn't matter, without replacement 
combinations(balls, ndraws)
```

I've generated the figures below using the `DiagrammeR` package. It's a powerful package for drawing diagrams in R. See information on how to use the DiagrammeR package [here](https://rich-iannone.github.io/DiagrammeR/). 

```{r, echo=FALSE, fig.cap="Drawing two marbles out of an urn __with__ replacement."}
grViz("
digraph dot{
  
  # general settings for all nodes
  node [
    shape = circle,
    style = filled,
    color = black,
    label = ''
    fontname = 'Helvetica',
    fontsize = 24,
    fillcolor = lightblue
    ]
  
  # edges between nodes
  edge [color = black]
  0 -> {1 2 3}
  1 -> {11 12 13}
  2 -> {21 22 23}
  3 -> {31 32 33}
  
  # labels for each node
  0 [fillcolor = 'black', width = 0.1]
  1 [label = '1']
  2 [label = '2']
  3 [label = '3']
  11 [label = '1']
  12 [label = '2']
  13 [label = '3']
  21 [label = '1']
  22 [label = '2']
  23 [label = '3']
  31 [label = '1']
  32 [label = '2']
  33 [label = '3']
    
  # direction in which arrows are drawn (from left to right)
  rankdir = LR
}
")

```

```{r, echo=FALSE, fig.cap="Drawing two marbles out of an urn __without__ replacement."}
grViz("
digraph dot{
  
  # general settings for all nodes
  node [
    shape = circle,
    style = filled,
    color = black,
    label = ''
    fontname = 'Helvetica',
    fontsize = 24,
    fillcolor = lightblue
    ]
  
  # edges between nodes
  edge [color = black]
  0 -> {1 2 3}
  1 -> {12 13}
  2 -> {21 23}
  3 -> {31 32}
  
  # labels for each node
  0 [fillcolor = 'black', width = 0.1]
  1 [label = '1']
  2 [label = '2']
  3 [label = '3']
  12 [label = '2']
  13 [label = '3']
  21 [label = '1']
  23 [label = '3']
  31 [label = '1']
  32 [label = '2']
  
  # direction in which arrows are drawn (from left to right)
  rankdir = LR
}
")
```

## Sampling

We can also draw *samples* from the urn:

```{r}
numbers = 1:3

numbers %>% 
  sample(size = 10,
         replace = T)
```

Use the `prob = ` argument to change the probability with which each number should be drawn. 

```{r}
numbers = 1:3

numbers %>% 
  sample(size = 10,
         replace = T,
         prob = c(0.8, 0.1, 0.1))
```

Make sure to set the seed in order to make your code reproducible. The code chunk below may give a different outcome each time is run. 


```{r no-seed}
numbers = 1:5

numbers %>% 
  sample(5)
```

The chunk below will produce the same outcome every time it's run. 

```{r with-seed}
set.seed(1)

numbers = 1:5

numbers %>% 
  sample(5)
```

### Drawing rows from a data frame

We can do this with data frames too, imagining our data frame is the urn and the rows are balls. 

```{r}
set.seed(1)
n = 10
df.data = tibble(trial = 1:n,
                 stimulus = sample(c("flower", "pet"), size = n, replace = T),
                 rating = sample(1:10, size = n, replace = T))
```

Sample a given number of rows. 

```{r}
set.seed(1)
df.data %>% 
  slice_sample(n = 6, 
               replace = T)
```

```{r}
set.seed(1)
df.data %>% 
  slice_sample(prop = 0.5)
```

## More complex counting / matching

Imagine a secretary types four letters to four people and addresses the four envelopes. If they insert the letters at random, each in a different envelope, what is the probability that exactly three letters will go into the right envelope?

```{r}
df.letters = permutations(x = 1:4, k = 4) %>% 
  as_tibble(.name_repair = ~ str_c("person_", 1:4)) %>%
  mutate(n_correct = (person_1 == 1) + 
           (person_2 == 2) + 
           (person_3 == 3) +
           (person_4 == 4))

df.letters %>% 
  summarize(prob_3_correct = sum(n_correct == 3) / n())
```

```{r}
ggplot(data = df.letters,
       mapping = aes(x = n_correct)) + 
  geom_bar(aes(y = stat(count)/sum(count)),
           color = "black",
           fill = "lightblue") +
  scale_y_continuous(labels = scales::percent,
                     expand = c(0, 0)) + 
  labs(x = "number correct",
       y = "probability")
```

## Conditional probability

```{r}
who = c("ms_scarlet", "col_mustard", "mrs_white",
        "mr_green", "mrs_peacock", "prof_plum")
what = c("candlestick", "knife", "lead_pipe",
         "revolver", "rope", "wrench")
where = c("study", "kitchen", "conservatory",
          "lounge", "billiard_room", "hall",
          "dining_room", "ballroom", "library")

df.clue = expand_grid(who = who,
                      what = what,
                      where = where)

df.suspects = df.clue %>% 
  distinct(who) %>% 
  mutate(gender = ifelse(test = who %in% c("ms_scarlet", "mrs_white", "mrs_peacock"), 
                         yes = "female", 
                         no = "male"))
```

```{r}
df.suspects %>% 
  arrange(desc(gender)) %>% 
  kable() %>% 
  kable_styling("striped", full_width = F)
```

```{r}
# conditional probability (via rules of probability)
df.suspects %>% 
  summarize(p_prof_plum_given_male = 
              sum(gender == "male" & who == "prof_plum") /
              sum(gender == "male"))
```

```{r}
# conditional probability (via rejection)
df.suspects %>% 
  filter(gender == "male") %>% 
  summarize(p_prof_plum_given_male = 
              sum(who == "prof_plum") /
              n())
```

## Law of total probability

```{r, echo=FALSE}
grViz("
digraph dot{
  
  # general settings for all nodes
  node [
    shape = circle,
    style = filled,
    color = black,
    label = ''
    fontname = 'Helvetica',
    fontsize = 9,
    fillcolor = lightblue,
    fixedsize=true,
    width = 0.8
    ]
  
  # edges between nodes
  edge [color = black,
        fontname = 'Helvetica',
        fontsize = 10]
  1 -> 2 [label = 'p(female)']
  1 -> 3 [label = 'p(male)']
  2 -> 4 [label = 'p(revolver | female)'] 
  3 -> 4 [label = 'p(revolver | male)']
  
  
  # labels for each node
  1 [label = 'Gender?']
  2 [label = 'If female\nuse revolver?']
  3 [label = 'If male\nuse revolver?']
  4 [label = 'Revolver\nused?']
  
  rankdir='LR'
  }"
)
```

```{r}
# Make a deck of cards 
df.cards = tibble(suit = rep(c("Clubs", "Spades", "Hearts", "Diamonds"), each = 8),
                  value = rep(c("7", "8", "9", "10", "Jack", "Queen", "King", "Ace"), 4)) 
```

```{r}
# conditional probability: p(Hearts | Queen) (via rules of probability)
df.cards %>% 
  summarize(p_hearts_given_queen = 
              sum(suit == "Hearts" & value == "Queen") / 
              sum(value == "Queen"))
```

```{r}
# conditional probability: p(Hearts | Queen) (via rejection)
df.cards %>% 
  filter(value == "Queen") %>%
  summarize(p_hearts_given_queen = sum(suit == "Hearts")/n())
```

## Additional resources

### Cheatsheets

- [Probability cheatsheet](figures/probability.pdf)

### Books and chapters

- [Probability and Statistics with examples using R](http://www.isibang.ac.in/~athreya/psweur/)
- [Learning statistics with R: Chapter 9 Introduction to probability](https://learningstatisticswithr-bookdown.netlify.com/probability.html#probstats)

### Misc

- [Statistics 110: Probability; course at Harvard](https://projects.iq.harvard.edu/stat110)

## Session info

Information about this R session including which version of R was used, and what packages were loaded. 

```{r, echo=F}
sessionInfo()
```