week_6_lab.Rmd

---
title: "Visualizing the Bayesian Workflow"
author: "Monica Alexander"
date: "February 15 2022"
output:
  html_document:
    toc: yes
    df_print: paged
  pdf_document:
    number_sections: yes
    toc: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```


# Introduction

This lab will be looking at trying to replicate some of the visualizations in the lecture notes, involving prior and posterior predictive checks, and LOO model comparisons. 

The dataset is a 0.1% of all births in the US in 2017. I've pulled out a few different variables, but as in the lecture, we'll just focus on birth weight and gestational age. 

# The data

Read it in, along with all our packages. 

```{r, message=FALSE, warning=FALSE}
library(tidyverse)
library(here)
# for bayes stuff
library(rstan)
library(bayesplot) 
library(loo) 
library(tidybayes) 

ds <- read_rds(here("data","births_2017_sample.RDS"))
head(ds)
```

Brief overview of variables:

- `mager` mum's age
- `mracehisp` mum's race/ethnicity see here for codes: https://data.nber.org/natality/2017/natl2017.pdf page 15
- `meduc` mum's education see here for codes: https://data.nber.org/natality/2017/natl2017.pdf page 16
- `bmi` mum's bmi 
- `sex` baby's sex
- `combgest` gestational age in weeks
- `dbwt` birth weight in kg
- `ilive` alive at time of report y/n/ unsure

I'm going to rename some variables, remove any observations with missing gestational age or birth weight, restrict just to babies that were alive, and make a preterm variable. 

```{r}
ds <- ds %>% 
  rename(birthweight = dbwt, gest = combgest) %>% 
  mutate(preterm = ifelse(gest<32, 1, 0)) %>% 
  filter(ilive=="Y",gest< 99, birthweight<9.999)
```


## Question 1

```{r}
ggplot(ds, aes(x=gest, y=birthweight)) +
  geom_point() +
  geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) 
```
The first graph (above) show that the weight of babies increse as their age increasrs.



```{r}
boxplot(ds$birthweight ~ ds$sex , xlab='sex' ,ylab="weight" , col="#69b3a2", boxwex=0.4 , main="")
```
The second graph shows that boys' weight is slightly higher than girls' weight.


```{r}
ggplot(ds, aes(x=bmi, y=birthweight)) +
  geom_point() +
  geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) 
```
The third graph above shows that, mother's BMI has very little influence on babies' weight.


# The model

As in lecture, we will look at two candidate models 

Model 1 has log birth weight as a function of log gestational age

$$
\log(y_i) \sim N(\beta_1 + \beta_2\log(x_i), \sigma^2)
$$

Model 2 has an interaction term between gestation and prematurity

$$
\log(y_i) \sim N(\beta_1 + \beta_2\log(x_i) + \beta_2 z_i + \beta_4\log(x_i) z_i, \sigma^2)
$$

- $y_i$ is weight in kg
- $x_i$ is gestational age in weeks, CENTERED AND STANDARDIZED
- $z_i$ is preterm (0 or 1, if gestational age is less than 32 weeks)


# Prior predictive checks

Let's put some weakly informative priors on all parameters i.e. for the $\beta$s

$$
\beta \sim N(0, 1)
$$

and for $\sigma$

$$
\sigma \sim N^+(0,1)
$$
where the plus means positive values only i.e. Half Normal. 

Let's check to see what the resulting distribution of birth weights look like given Model 1 and the priors specified above, assuming we had no data on birth weight (but observations of gestational age).

## Question 2

For Model 1, simulate values of $\beta$s and $\sigma$ based on the priors above. Use these values to simulate (log) birth weights from the likelihood specified in Model 1, based on the set of observed gestational weights. Plot the resulting distribution of simulated (log) birth weights. Do 1000 simulations. **Remember the gestational weights should be centered and standardized**. 


```{r}
set.seed(0)
beta1 <- rnorm(1000)
beta2 <- rnorm(1000)
sigma <- abs(rnorm(1000))

w <- array(dim=c(1000,3842))
for (i in 1:1000){
  w[i,] <- rnorm(3842,beta1[i]+beta2[i]*scale(log(ds$gest)),sigma[i])
}

plot(density(w,from=-6,to=6))
  


```


# Run the model

Now we're going to run Model 1 in Stan. The stan code is in the `code/models` folder. 

First, get our data into right form for input into stan. 

```{r}
ds$log_weight <- log(ds$birthweight)
ds$log_gest_c <- (log(ds$gest) - mean(log(ds$gest)))/sd(log(ds$gest))

# put into a list
stan_data <- list(N = nrow(ds),
                  log_weight = ds$log_weight,
                  log_gest = ds$log_gest_c)
```

Now fit the model

```{r}
mod1 <- stan(data = stan_data, 
             file = "models/simple_weight.stan",
             iter = 500,
             seed = 243)
```

```{r}
summary(mod1)$summary[c("beta[1]", "beta[2]", "sigma"),]
```

## Question 3

Write a stan model to run Model 2, and run it. 

```{r}
#ds$log_weight <- log(ds$birthweight)
#ds$log_gest_c <- (log(ds$gest) - mean(log(ds$gest)))/sd(log(ds$gest))


# put into a list
stan_data <- list(N = nrow(ds),
                  log_weight = ds$log_weight,
                  z=ds$preterm,
                  zl=ds$preterm*ds$log_gest_c,
                  log_gest = ds$log_gest_c)
```

Now fit the model

```{r}
mod2 <- stan(data = stan_data, 
             file = "models/model2.stan",
             iter = 500,
             seed = 243)
```


```{r}
summary(mod2)$summary[c("beta[1]", "beta[2]", "beta[3]", "beta[4]","sigma"),]
```

## Question 4

For reference I have uploaded some model 2 results. Check your results are similar. ($\beta_2$ relates to gestational age, $\beta_3$ relates to preterm, $\beta_4$ is the interaction).

```{r}
load(here("output", "mod2.Rda"))
summary(mod2)$summary[c(paste0("beta[", 1:4, "]"), "sigma"),]
```

Yes results are similar (order of beta2 and beta3 are switched)


# PPCs

Now we've run two candidate models let's do some posterior predictive checks. The `bayesplot` package has a lot of inbuilt graphing functions to do this. For example, let's plot the distribution of our data (y) against 100 different datasets drawn from the posterior predictive distribution:

```{r}
set.seed(1856)
y <- ds$log_weight
yrep1 <- extract(mod1)[["log_weight_rep"]]
yrep2 <- extract(mod2)[["log_weight_rep"]] # will need mod2 for later
samp100 <- sample(nrow(yrep1), 100)
ppc_dens_overlay(y, yrep1[samp100, ])  + ggtitle("distribution of observed versus predicted birthweights")
```

## Question 5

Make a similar plot to the one above but for model 2, and **not** using the bayes plot in built function (i.e. do it yourself just with `geom_density`)
```{r}
library(reshape2)
```


```{r}
set.seed(22)
y <- ds$log_weight
yrep2 <- extract(mod2)[["log_weight_rep"]] # will need mod2 for later
samp100 <- sample(nrow(yrep2), 100)
df <- melt(yrep2[samp100,])
ggplot(df)+
  geom_density(aes(x=value,group=iterations,color='yrep'))+
  geom_density(data=data.frame(y),aes(y,color=y))
```

## Test statistics

We can also look at some summary statistics in the PPD versus the data, again either using `bayesplot` -- the function of interest is `ppc_stat` or `ppc_stat_grouped` -- or just doing it ourselves using ggplot. 

E.g. medians by prematurity for Model 1

```{r}
ppc_stat_grouped(ds$log_weight, yrep1, group = ds$preterm, stat = 'median')
```

## Question 6

Use a test statistic of the proportion of births under 2.5kg. Calculate the test statistic for the data, and the posterior predictive samples for both models, and plot the comparison (one plot per model). 

```{r}
ds$under25 <- ifelse(ds$birthweight<2.5,1,0)
```
```{r}
y1t <- matrix(as.integer(exp(yrep1)<2.5),nrow = 1000,byrow=F)
y2t <- matrix(as.integer(exp(yrep2)<2.5),nrow = 1000,byrow=F)
```


```{r}
ppc_stat_grouped(ds$log_weight, y1t, group = ds$preterm, stat = 'mean')
```


#plot for model 1

```{r}
ppc_stat(ds$under25, y1t, stat = 'mean')
```

#plot for model 2

```{r}
ppc_stat(ds$under25, y2t, stat = 'mean')
```



# LOO

Finally let's calculate the LOO elpd for each model and compare. The first step of this is to get the point-wise log likelihood estimates from each model:

```{r}
loglik1 <- extract(mod1)[["log_lik"]]
loglik2 <- extract(mod2)[["log_lik"]]
```


And then we can use these in the `loo` function to get estimates for the elpd. Note the `save_psis = TRUE` argument saves the calculation for each simulated draw, which is needed for the LOO-PIT calculation below. 

```{r}
loo1 <- loo(loglik1, save_psis = TRUE)
loo2 <- loo(loglik2, save_psis = TRUE)
```

Look at the output:


```{r}
loo1
loo2
```

Comparing the two models tells us Model 2 is better:

```{r}
loo_compare(loo1, loo2)
```

We can also compare the LOO-PIT of each of the models to standard uniforms. The both do pretty well. 

```{r}
ppc_loo_pit_overlay(yrep = yrep1, y = y, lw = weights(loo1$psis_object))
ppc_loo_pit_overlay(yrep = yrep2, y = y, lw = weights(loo2$psis_object))
```

## Bonus question

Create your own PIT histogram "from scratch" for Model 2.