-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathweek_6_lab.Rmd
337 lines (229 loc) · 8.96 KB
/
week_6_lab.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
---
title: "Visualizing the Bayesian Workflow"
author: "Monica Alexander"
date: "February 15 2022"
output:
html_document:
toc: yes
df_print: paged
pdf_document:
number_sections: yes
toc: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
# Introduction
This lab will be looking at trying to replicate some of the visualizations in the lecture notes, involving prior and posterior predictive checks, and LOO model comparisons.
The dataset is a 0.1% of all births in the US in 2017. I've pulled out a few different variables, but as in the lecture, we'll just focus on birth weight and gestational age.
# The data
Read it in, along with all our packages.
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
library(here)
# for bayes stuff
library(rstan)
library(bayesplot)
library(loo)
library(tidybayes)
ds <- read_rds(here("data","births_2017_sample.RDS"))
head(ds)
```
Brief overview of variables:
- `mager` mum's age
- `mracehisp` mum's race/ethnicity see here for codes: https://data.nber.org/natality/2017/natl2017.pdf page 15
- `meduc` mum's education see here for codes: https://data.nber.org/natality/2017/natl2017.pdf page 16
- `bmi` mum's bmi
- `sex` baby's sex
- `combgest` gestational age in weeks
- `dbwt` birth weight in kg
- `ilive` alive at time of report y/n/ unsure
I'm going to rename some variables, remove any observations with missing gestational age or birth weight, restrict just to babies that were alive, and make a preterm variable.
```{r}
ds <- ds %>%
rename(birthweight = dbwt, gest = combgest) %>%
mutate(preterm = ifelse(gest<32, 1, 0)) %>%
filter(ilive=="Y",gest< 99, birthweight<9.999)
```
## Question 1
```{r}
ggplot(ds, aes(x=gest, y=birthweight)) +
geom_point() +
geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE)
```
The first graph (above) show that the weight of babies increse as their age increasrs.
```{r}
boxplot(ds$birthweight ~ ds$sex , xlab='sex' ,ylab="weight" , col="#69b3a2", boxwex=0.4 , main="")
```
The second graph shows that boys' weight is slightly higher than girls' weight.
```{r}
ggplot(ds, aes(x=bmi, y=birthweight)) +
geom_point() +
geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE)
```
The third graph above shows that, mother's BMI has very little influence on babies' weight.
# The model
As in lecture, we will look at two candidate models
Model 1 has log birth weight as a function of log gestational age
$$
\log(y_i) \sim N(\beta_1 + \beta_2\log(x_i), \sigma^2)
$$
Model 2 has an interaction term between gestation and prematurity
$$
\log(y_i) \sim N(\beta_1 + \beta_2\log(x_i) + \beta_2 z_i + \beta_4\log(x_i) z_i, \sigma^2)
$$
- $y_i$ is weight in kg
- $x_i$ is gestational age in weeks, CENTERED AND STANDARDIZED
- $z_i$ is preterm (0 or 1, if gestational age is less than 32 weeks)
# Prior predictive checks
Let's put some weakly informative priors on all parameters i.e. for the $\beta$s
$$
\beta \sim N(0, 1)
$$
and for $\sigma$
$$
\sigma \sim N^+(0,1)
$$
where the plus means positive values only i.e. Half Normal.
Let's check to see what the resulting distribution of birth weights look like given Model 1 and the priors specified above, assuming we had no data on birth weight (but observations of gestational age).
## Question 2
For Model 1, simulate values of $\beta$s and $\sigma$ based on the priors above. Use these values to simulate (log) birth weights from the likelihood specified in Model 1, based on the set of observed gestational weights. Plot the resulting distribution of simulated (log) birth weights. Do 1000 simulations. **Remember the gestational weights should be centered and standardized**.
```{r}
set.seed(0)
beta1 <- rnorm(1000)
beta2 <- rnorm(1000)
sigma <- abs(rnorm(1000))
w <- array(dim=c(1000,3842))
for (i in 1:1000){
w[i,] <- rnorm(3842,beta1[i]+beta2[i]*scale(log(ds$gest)),sigma[i])
}
plot(density(w,from=-6,to=6))
```
# Run the model
Now we're going to run Model 1 in Stan. The stan code is in the `code/models` folder.
First, get our data into right form for input into stan.
```{r}
ds$log_weight <- log(ds$birthweight)
ds$log_gest_c <- (log(ds$gest) - mean(log(ds$gest)))/sd(log(ds$gest))
# put into a list
stan_data <- list(N = nrow(ds),
log_weight = ds$log_weight,
log_gest = ds$log_gest_c)
```
Now fit the model
```{r}
mod1 <- stan(data = stan_data,
file = "models/simple_weight.stan",
iter = 500,
seed = 243)
```
```{r}
summary(mod1)$summary[c("beta[1]", "beta[2]", "sigma"),]
```
## Question 3
Write a stan model to run Model 2, and run it.
```{r}
#ds$log_weight <- log(ds$birthweight)
#ds$log_gest_c <- (log(ds$gest) - mean(log(ds$gest)))/sd(log(ds$gest))
# put into a list
stan_data <- list(N = nrow(ds),
log_weight = ds$log_weight,
z=ds$preterm,
zl=ds$preterm*ds$log_gest_c,
log_gest = ds$log_gest_c)
```
Now fit the model
```{r}
mod2 <- stan(data = stan_data,
file = "models/model2.stan",
iter = 500,
seed = 243)
```
```{r}
summary(mod2)$summary[c("beta[1]", "beta[2]", "beta[3]", "beta[4]","sigma"),]
```
## Question 4
For reference I have uploaded some model 2 results. Check your results are similar. ($\beta_2$ relates to gestational age, $\beta_3$ relates to preterm, $\beta_4$ is the interaction).
```{r}
load(here("output", "mod2.Rda"))
summary(mod2)$summary[c(paste0("beta[", 1:4, "]"), "sigma"),]
```
Yes results are similar (order of beta2 and beta3 are switched)
# PPCs
Now we've run two candidate models let's do some posterior predictive checks. The `bayesplot` package has a lot of inbuilt graphing functions to do this. For example, let's plot the distribution of our data (y) against 100 different datasets drawn from the posterior predictive distribution:
```{r}
set.seed(1856)
y <- ds$log_weight
yrep1 <- extract(mod1)[["log_weight_rep"]]
yrep2 <- extract(mod2)[["log_weight_rep"]] # will need mod2 for later
samp100 <- sample(nrow(yrep1), 100)
ppc_dens_overlay(y, yrep1[samp100, ]) + ggtitle("distribution of observed versus predicted birthweights")
```
## Question 5
Make a similar plot to the one above but for model 2, and **not** using the bayes plot in built function (i.e. do it yourself just with `geom_density`)
```{r}
library(reshape2)
```
```{r}
set.seed(22)
y <- ds$log_weight
yrep2 <- extract(mod2)[["log_weight_rep"]] # will need mod2 for later
samp100 <- sample(nrow(yrep2), 100)
df <- melt(yrep2[samp100,])
ggplot(df)+
geom_density(aes(x=value,group=iterations,color='yrep'))+
geom_density(data=data.frame(y),aes(y,color=y))
```
## Test statistics
We can also look at some summary statistics in the PPD versus the data, again either using `bayesplot` -- the function of interest is `ppc_stat` or `ppc_stat_grouped` -- or just doing it ourselves using ggplot.
E.g. medians by prematurity for Model 1
```{r}
ppc_stat_grouped(ds$log_weight, yrep1, group = ds$preterm, stat = 'median')
```
## Question 6
Use a test statistic of the proportion of births under 2.5kg. Calculate the test statistic for the data, and the posterior predictive samples for both models, and plot the comparison (one plot per model).
```{r}
ds$under25 <- ifelse(ds$birthweight<2.5,1,0)
```
```{r}
y1t <- matrix(as.integer(exp(yrep1)<2.5),nrow = 1000,byrow=F)
y2t <- matrix(as.integer(exp(yrep2)<2.5),nrow = 1000,byrow=F)
```
```{r}
ppc_stat_grouped(ds$log_weight, y1t, group = ds$preterm, stat = 'mean')
```
#plot for model 1
```{r}
ppc_stat(ds$under25, y1t, stat = 'mean')
```
#plot for model 2
```{r}
ppc_stat(ds$under25, y2t, stat = 'mean')
```
# LOO
Finally let's calculate the LOO elpd for each model and compare. The first step of this is to get the point-wise log likelihood estimates from each model:
```{r}
loglik1 <- extract(mod1)[["log_lik"]]
loglik2 <- extract(mod2)[["log_lik"]]
```
And then we can use these in the `loo` function to get estimates for the elpd. Note the `save_psis = TRUE` argument saves the calculation for each simulated draw, which is needed for the LOO-PIT calculation below.
```{r}
loo1 <- loo(loglik1, save_psis = TRUE)
loo2 <- loo(loglik2, save_psis = TRUE)
```
Look at the output:
```{r}
loo1
loo2
```
Comparing the two models tells us Model 2 is better:
```{r}
loo_compare(loo1, loo2)
```
We can also compare the LOO-PIT of each of the models to standard uniforms. The both do pretty well.
```{r}
ppc_loo_pit_overlay(yrep = yrep1, y = y, lw = weights(loo1$psis_object))
ppc_loo_pit_overlay(yrep = yrep2, y = y, lw = weights(loo2$psis_object))
```
## Bonus question
Create your own PIT histogram "from scratch" for Model 2.