-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathbayes_inference.Rmd
476 lines (304 loc) · 18.3 KB
/
bayes_inference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
\providecommand{\E}{\operatorname{E}}
\providecommand{\V}{\operatorname{Var}}
\providecommand{\Cov}{\operatorname{Cov}}
\providecommand{\se}{\operatorname{se}}
\providecommand{\logit}{\operatorname{logit}}
\providecommand{\iid}{\; \stackrel{\text{iid}}{\sim}\;}
\providecommand{\asim}{\; \stackrel{.}{\sim}\;}
\providecommand{\xs}{x_1, x_2, \ldots, x_n}
\providecommand{\Xs}{X_1, X_2, \ldots, X_n}
\providecommand{\bB}{\boldsymbol{B}}
\providecommand{\bb}{\boldsymbol{\beta}}
\providecommand{\bx}{\boldsymbol{x}}
\providecommand{\bX}{\boldsymbol{X}}
\providecommand{\by}{\boldsymbol{y}}
\providecommand{\bY}{\boldsymbol{Y}}
\providecommand{\bz}{\boldsymbol{z}}
\providecommand{\bZ}{\boldsymbol{Z}}
\providecommand{\be}{\boldsymbol{e}}
\providecommand{\bE}{\boldsymbol{E}}
\providecommand{\bs}{\boldsymbol{s}}
\providecommand{\bS}{\boldsymbol{S}}
\providecommand{\bP}{\boldsymbol{P}}
\providecommand{\bI}{\boldsymbol{I}}
\providecommand{\bD}{\boldsymbol{D}}
\providecommand{\bd}{\boldsymbol{d}}
\providecommand{\bW}{\boldsymbol{W}}
\providecommand{\bw}{\boldsymbol{w}}
\providecommand{\bM}{\boldsymbol{M}}
\providecommand{\bPhi}{\boldsymbol{\Phi}}
\providecommand{\bphi}{\boldsymbol{\phi}}
\providecommand{\bN}{\boldsymbol{N}}
\providecommand{\bR}{\boldsymbol{R}}
\providecommand{\bu}{\boldsymbol{u}}
\providecommand{\bU}{\boldsymbol{U}}
\providecommand{\bv}{\boldsymbol{v}}
\providecommand{\bV}{\boldsymbol{V}}
\providecommand{\bO}{\boldsymbol{0}}
\providecommand{\bOmega}{\boldsymbol{\Omega}}
\providecommand{\bLambda}{\boldsymbol{\Lambda}}
\providecommand{\bSig}{\boldsymbol{\Sigma}}
\providecommand{\bSigma}{\boldsymbol{\Sigma}}
\providecommand{\bt}{\boldsymbol{\theta}}
\providecommand{\bT}{\boldsymbol{\Theta}}
\providecommand{\bpi}{\boldsymbol{\pi}}
\providecommand{\argmax}{\text{argmax}}
\providecommand{\KL}{\text{KL}}
\providecommand{\fdr}{{\rm FDR}}
\providecommand{\pfdr}{{\rm pFDR}}
\providecommand{\mfdr}{{\rm mFDR}}
\providecommand{\bh}{\hat}
\providecommand{\dd}{\lambda}
\providecommand{\q}{\operatorname{q}}
```{r, message=FALSE, echo=FALSE, cache=FALSE}
source("./customization/knitr_options.R")
```
# (PART) Bayesian Inference {-}
# Likelihood Function
## Same MLE, Different $L(\theta | \boldsymbol{x})$
```{r, echo=FALSE, fig.width=9}
y <- 1.2
x1 <- seq(-3, 12, 0.01)
x2 <- seq(0.001, 12, 0.001)
y1 <- dnorm(y, mean=x1)
y2 <- dexp(1/y, rate=x2)
distribution <- c(rep("Normal", length(x1)), rep("Exponential", length(x2)))
df <- data.frame(parameter = c(x1, x2), likelihood=c(y1, y2), distribution = distribution)
ggplot(df) +
geom_line(aes(x=parameter, y=likelihood, color=distribution), size=1.2) +
scale_color_manual(values = c("red", "blue"))
```
## Weighted Likelihood Estimate
Instead of employing estimator $\hat{\theta}_{{\rm MLE}} = \operatorname{argmax}_\theta L(\theta ; \boldsymbol{x})$, consider instead an arbitrary weight function, $g(\theta)$. We could take a weighted average of the likelihood function, assuming all of the integrals below exist.
$$
\tilde{\theta} = \frac{\int \theta g(\theta) L(\theta ; \boldsymbol{x}) d\theta}{\int g(\theta) L(\theta ; \boldsymbol{x}) d\theta}
$$
## Conditional Expected Value
If we set
$$
h(\theta | \boldsymbol{x}) = \frac{g(\theta) L(\theta ; \boldsymbol{x})}{\int g(\theta^*) L(\theta^* ; \boldsymbol{x}) d\theta^*}
$$
then $h(\theta | \boldsymbol{x})$ is a probability density function and
$$
\tilde{\theta} = \E_{h(\theta | \boldsymbol{x})}[\theta].
$$
## Standard Errror
Consider the model, $X_1, X_2, \ldots, X_n \iid F_{\theta}$.
Since $\tilde{\theta} = \E_{h(\theta | \boldsymbol{x})}[\theta]$ is a function of the data $\boldsymbol{x}$, it follows that in most circumstances it should be possible to obtain an approximation to its standard error, $\sqrt{\V(\tilde{\theta})}$ and an estimate of the standard error.
This allows for frequentist inference of estimates based on a weighted integral of the likelihood function.
# Bayesian Inference
## Frequentist Probability
The inference framework we have covered so far uses a **frequentist** intepretation of probability.
We made statements such as, "If we repeat this study over and over, the long run frequency is such that..."
## Bayesian Probability
Traditional **Bayesian inference** is based on a different interpretation of probability, that probability is a measure of subjective belief.
We will call this "subjective Bayesian statistics."
## The Framework
A **prior probability distribution** is introduced for an unknown parameter, which is a probability distribution on the unknown parameter that captures one's subjective belief about its possible values.
The **posterior probability distributuon** of the parameter is then calculated using Bayes theorem once data are observed. Analogs of confidence intervals and hypothesis tests can then be obtained through the posterior distribution.
## An Example
Prior: $P \sim \mbox{Uniform}(0,1)$
Data generating distribution: $X|P=p \sim \mbox{Binomial}(n,p)$
Posterior pdf (via Bayes Theorem):
\begin{align*}
f(p | X=x) & = \frac{\Pr(X=x | P=p) f(p)}{\Pr(X=x)} \\
& = \frac{\Pr(X=x | P=p) f(p)}{\int \Pr(X=x | P=p^*) f(p^*) dp^*}
\end{align*}
## Calculations
In the previous example, it is possible to analytically calculate the posterior distribution. (In the example, it is a Beta distribution with parameters that involve $x$.) However, this is often impossible.
Bayesian inference often involves complicated and intensive calculations to numerically approximate the posterior probability distribution.
## In Practice
Although the Bayesian inference framework has its roots in the subjective view of probability, in modern times this philosophical aspect is often ignored or unimportant.
When subjectivism is ignored, is this really Bayesian statistics, or is it frequentist statistics that includes a probability model on the unknown parameter(s) that employes Bayes Theorem?
Bayesian inference is often used because it provides a flexible and sometimes superior model for real world problems. But the interpretation and evaluation are often tacitly frequentist.
There are very few pure subjective Bayesians working in the natural sciences or in technology industries.
## Goal
Suppose we model $(X_1, X_2, \ldots, X_n) | \theta \ \iid \ F_{\theta}$ with **prior distribution** $\theta \sim F_{\tau}$ where it should be noted that $\theta$ also depends on (possibly unknown or subjective) parameter(s) $\tau$.
The ultimate goal is to determine the **posterior distribution** of $\theta | \boldsymbol{X}$ through Bayes theorem:
$$
f(\theta | \boldsymbol{X}) = \frac{f(\boldsymbol{X} | \theta) f(\theta)}{f(\boldsymbol{X})} = \frac{f(\boldsymbol{X} | \theta) f(\theta)}{\int f(\boldsymbol{X} | \theta^*) f(\theta^*) d\theta^*}.
$$
If there is a true fixed value of $\theta$, then a well-behaved model should be so that $f(\theta | \boldsymbol{X})$ concentrates around this fixed value as $n \rightarrow \infty$.
## Advantages
- Statements on measures of uncertainty and inference are easier to make
- Often superior numerical stability to the estimates
- Data across studies or multiple samples easier to combine (e.g., how to combine frequentist p-values?)
- High-dimensional inference works especially well in a Bayesian framework
## Computation
Bayesian inference can be particularly computationally intensive. The challenge is usually in calculating the denominator of the right hand side of Bayes thereom, $f(\boldsymbol{X})$:
$$
f(\theta | \boldsymbol{X}) = \frac{f(\boldsymbol{X} | \theta) f(\theta)}{f(\boldsymbol{X})}
$$
Markov chain Monte Carlo methods and variational inference methods are particularly popular for dealing with the numerical challenges of obtain good estimates of the posterior distribution.
# Estimation
## Assumptions
We will assume that $(X_1, X_2, \ldots, X_n) | \theta \iid F_{\theta}$ with prior distribution $\theta \sim F_{\tau}$ unless stated otherwise. Shorthand for the former is $\boldsymbol{X} | \theta \iid F_{\theta}$.
We will write the pdf or pmf of $X$ as $f(x | \theta)$ as opposed to $f(x ; \theta)$ because in the Bayesian framework this actually represents conditional probability.
We will write the pdf or pmf of $\theta$ as $f(\theta)$ or $f(\theta ; \tau)$ or $f(\theta | \tau)$. Always remember that prior distributions require paramater values, even if we don't explicitly write them.
## Posterior Distribution
The posterior distribution of $\theta | \boldsymbol{X}$ is obtained through Bayes theorem:
\begin{align*}
f(\theta | \boldsymbol{x}) & = \frac{f(\boldsymbol{x} | \theta) f(\theta)}{f(\boldsymbol{x})} = \frac{f(\boldsymbol{x} | \theta) f(\theta)}{\int f(\boldsymbol{x} | \theta^*) f(\theta^*) d\theta^*} \\
& \propto L(\theta ; \boldsymbol{x}) f(\theta)
\end{align*}
## Posterior Expectation
A very common point estimate of $\theta$ in Bayesian inference is the posterior expected value:
\begin{align*}
\operatorname{E}[\theta | \boldsymbol{x}] & = \int \theta f(\theta | \boldsymbol{x}) d\theta \\
& = \frac{\int \theta L(\theta ; \boldsymbol{x}) f(\theta) d\theta}{\int L(\theta ; \boldsymbol{x}) f(\theta) d\theta}
\end{align*}
## Posterior Interval
The Bayesian analog of the frequentist confidence interval is the $1-\alpha$ posterior interval, where $C_{\ell}$ and $C_{u}$ are determined so that:
$$
1-\alpha = \Pr(C_\ell \leq \theta \leq C_u | \boldsymbol{x})
$$
## Maximum *A Posteriori* Probability
The maximum *a posteriori* probability (MAP) is the value (or values) of $\theta$ that maximize the posterior pdf or pmf:
\begin{align*}
\hat{\theta}_{\text{MAP}} & = \operatorname{argmax}_\theta \Pr(\theta | \boldsymbol{x}) \\
& = \operatorname{argmax}_\theta L(\theta ; \boldsymbol{x}) f(\theta)
\end{align*}
This is a frequentist-esque use of the Bayesian framework.
## Loss Functions
Let $\mathcal{L}\left(\theta, \tilde{\theta}\right)$ be a **loss function** for a given estimator $\tilde{\theta}$. Examples are
$$
\mathcal{L}\left(\theta, \tilde{\theta}\right) = \left(\theta - \tilde{\theta}\right)^2 \mbox{ or }
\mathcal{L}\left(\theta, \tilde{\theta}\right) = \left|\theta - \tilde{\theta}\right|.
$$
Note that, where the expected value is over $f(\boldsymbol{x}; \theta)$:
\begin{align*}
\operatorname{E}\left[\left(\theta - \tilde{\theta}\right)^2\right] & = \left(\operatorname{E}\left[\tilde{\theta}\right] - \theta\right)^2 + \operatorname{Var}\left(\tilde{\theta}\right) \\
& = \mbox{bias}^2 + \mbox{variance}
\end{align*}
## Bayes Risk
The **Bayes risk**, $R\left(\theta, \tilde{\theta}\right)$, is the expected loss with respect to the posterior:
$$
\E\left[ \left. \mathcal{L}\left(\theta, \tilde{\theta}\right) \right| \boldsymbol{x} \right]
= \int \mathcal{L}\left(\theta, \tilde{\theta}\right) f(\theta | \boldsymbol{x}) d\theta
$$
## Bayes Estimators
The **Bayes estimator** minimizes the Bayes risk.
The posterior expectation $\E\left[ \left. \theta \right| \boldsymbol{x} \right]$ minimizes the Bayes risk of $\mathcal{L}\left(\theta, \tilde{\theta}\right) = \left(\theta - \tilde{\theta}\right)^2$.
The median of $f(\theta | \boldsymbol{x})$, calculated by $F^{-1}_{\theta | \boldsymbol{x}}(1/2)$, minimizes the Bayes risk of $\mathcal{L}\left(\theta, \tilde{\theta}\right) = \left|\theta - \tilde{\theta}\right|$.
# Classification
## Assumptions
Let $(X_1, X_2, \ldots, X_n) | \theta \iid F_\theta$ where $\theta \in \Theta$ and $\theta \sim F_{\tau}$. Let $\Theta_0, \Theta_1 \subseteq \Theta$ so that $\Theta_0 \cap \Theta_1 = \varnothing$ and $\Theta_0 \cup \Theta_1 = \Theta$.
Given observed data $\boldsymbol{x}$, we wish to classify whether $\theta \in \Theta_0$ or $\theta \in \Theta_1$.
This is the Bayesian analog of hypothesis testing.
## Prior Probability on *H*
Let $H$ be a rv such that $H=0$ when $\theta \in \Theta_0$ and $H=1$ when $\theta \in \Theta_1$.
From the prior distribution on $\theta$, we can calculate
$$
\Pr(H=0) = \int_{\theta \in \Theta_0} f(\theta) d\theta
$$
and $\Pr(H=1) = 1-\Pr(H=0)$.
## Posterior Probability
Using Bayes theorem, we can also calculate
\begin{align*}
\Pr(H=0 | \boldsymbol{x})
& = \frac{f(\boldsymbol{x} | H=0) \Pr(H=0)}{f(\boldsymbol{x})} \\
& = \frac{\int_{\theta \in \Theta_0} f(\boldsymbol{x} | \theta) f(\theta) d\theta}{\int_{\theta \in \Theta} f(\boldsymbol{x} | \theta) f(\theta) d\theta}
\end{align*}
where note that $\Pr(H=1 | \boldsymbol{x}) = 1-\Pr(H=0 | \boldsymbol{x})$.
## Loss Function
Let $\mathcal{L}\left(\tilde{H}, H\right)$ be such that
\begin{align*}
\mathcal{L}\left(\tilde{H}=1, H=0 \right) & = c_{I}\\
\mathcal{L}\left(\tilde{H}=0, H=1 \right) & = c_{II}
\end{align*}
for some $c_{I}, c_{II} > 0$.
## Bayes Risk
The Bayes risk, $R\left(\tilde{H}, H\right)$, is
\begin{align*}
\operatorname{E}\left[ \left. \mathcal{L}\left(\theta, \tilde{\theta}\right) \right| \boldsymbol{x} \right]
& = c_{I} \Pr(\tilde{H}=1, H=0) + c_{II} \Pr(\tilde{H}=0, H=1) \\
& = c_{I} \Pr(\tilde{H}=1 | H=0) \Pr(H=0) \\
& \quad\quad + c_{II} \Pr(\tilde{H}=0 | H=1) \Pr(H=1)
\end{align*}
Notice how this balances what frequentists call Type I error and Type II error.
## Bayes Rule
The estimate $\tilde{H}$ that minimizes $R\left(\tilde{H}, H\right)$ is
$$\tilde{H}=1 \mbox{ when } \Pr(H=1 | \boldsymbol{x}) \geq \frac{c_{I}}{c_{I} + c_{II}}$$
and $\tilde{H}=0$ otherwise.
# Priors
## Conjugate Priors
A **conjugate prior** is a prior distribution for a data generating distribution so that the posterior distribution is of the same type as the prior.
Conjugate priors are useful for obtaining stratightforward calculations of the posterior.
There is a systematic method for calculating conjugate priors for exponential family distributions.
## Example: Beta-Bernoulli
Suppose $\boldsymbol{X} | \mu \iid \mbox{Bernoulli}(p)$ and suppose that $p \sim \mbox{Beta}(\alpha, \beta)$.
\begin{align*}
f(p | \boldsymbol{x}) & \propto L(p ; \boldsymbol{x}) f(p) \\
& = p^{\sum x_i} (1-p)^{\sum (1-x_i)} p^{\alpha - 1} (1-p)^{\beta-1} \\
& = p^{\alpha - 1 + \sum x_i} (1-p)^{\beta - 1 + \sum (1-x_i)} \\
& \propto \mbox{Beta}(\alpha + \sum x_i, \beta + \sum (1-x_i))
\end{align*}
Therefore,
$$
\E[p | \boldsymbol{x}] = \frac{\alpha + \sum x_i}{\alpha + \beta + n}.
$$
## Example: Normal-Normal
Suppose $\boldsymbol{X} | \mu \iid \mbox{Normal}(\mu, \sigma^2)$, where $\sigma^2$ is known, and suppose that $\mu \sim \mbox{Normal}(a, b^2)$.
Then it can be shown that $\mu | \boldsymbol{x} \sim \mbox{Normal}(\E[\mu | \boldsymbol{x}], \V(\mu | \boldsymbol{x}))$ where
$$
\E[\mu | \boldsymbol{x}] = \frac{b^2}{\frac{\sigma^2}{n} + b^2} \overline{x} + \frac{\frac{\sigma^2}{n}}{\frac{\sigma^2}{n} + b^2} a
$$
$$
\V(\mu | \boldsymbol{x}) = \frac{b^2 \frac{\sigma^2}{n}}{\frac{\sigma^2}{n} + b^2}
$$
## Example: Dirichlet-Multinomial
\
This is a problem on Homework 3!
## Example: Gamma-Poisson
\
This is a problem on Homework 3!
## Jeffreys Prior
If we do inference based on prior $\theta \sim F_{\tau}$ to obtain $f(\theta | \boldsymbol{x}) \propto L(\theta; \boldsymbol{x}) f(\theta)$, it follows that this inference may *not* be invariant to transformations of $\theta$, such as $\eta = g(\theta)$.
If we utilize a **Jeffreys prior**, which means it is such that
$$f(\theta) \propto \sqrt{I(\theta)}$$
then the prior will be invariant to transformations of $\theta$. We would want to show that $f(\theta) \propto \sqrt{I(\theta)}$ implies $f(\eta) \propto \sqrt{I(\eta)}$.
## Examples: Jeffreys Priors
\
Normal$(\mu, \sigma^2)$, $\sigma^2$ known: $f(\mu) \propto 1$
Normal$(\mu, \sigma^2)$, $\mu$ known: $f(\sigma) \propto \frac{1}{\sigma}$
Poisson$(\lambda)$: $f(\lambda) \propto \frac{1}{\sqrt{\lambda}}$
Bernoulli$(p)$: $f(p) \propto \frac{1}{\sqrt{p(1-p)}}$
## Improper Prior
An **improper prior** is a prior such that $\int f(\theta) d\theta = \infty$. Nevertheless, sometimes it still may be the case that $f(\theta | \boldsymbol{x}) \propto L(\theta; \boldsymbol{x}) f(\theta)$ yields a probability distribution.
Take for example the case where $\boldsymbol{X} | \mu \iid \mbox{Normal}(\mu, \sigma^2)$, where $\sigma^2$ is known, and suppose that $f(\mu) \propto 1$. Then $\int f(\theta) d\theta = \infty$, but
$$ f(\theta | \boldsymbol{x}) \propto L(\theta; \boldsymbol{x}) f(\theta) \sim \mbox{Normal}\left(\overline{x}, \sigma^2/n\right)$$
which is a proper probability distribution.
# Theory
## de Finetti's Theorem
Let $X_1, X_2, \ldots$ be an infinite exchangeable sequence of Bernoulli rv's. There exists a random variable $P \in [0, 1]$ such that:
- $X_1|P, X_2|P, \ldots$ are conditionally independent
- $X_1, X_2, \ldots | P=p \stackrel{{\rm iid}}{\sim} \mbox{Bernoulli}(p)$
This theorem is often used to justify the assumption of exchangeability, which is weaker than iid, with a prior distribution on the parameter(s).
## Admissibility
An estimator $\tilde{\theta}$ is **admissible** with respect to risk function $R(\cdot, \theta)$ if there is exists no other estimator $\hat{\theta}$ such that $R(\hat{\theta}, \theta) < R(\tilde{\theta}, \theta)$ for all $\theta \in \Theta$.
There's a theoretical result that says *all* admissible estimators are Bayes estimates.
# Empirical Bayes
## Rationale
Under the scenario that $\boldsymbol{X} | \theta \iid F_{\theta}$ with prior distribution $\theta \sim F_{\tau}$, we have to determine values for $\tau$.
The **empirical Bayes** approach uses the observed data to estimate the prior parameter(s), $\tau$.
This is especially useful for high-dimensional data when many parameters are simultaneously drawn from a prior with multiple observations drawn per parameter realization.
## Approach
The usual approach is to first integrate out the parameter to obtain
$$
f(\boldsymbol{x} ; \tau) = \int f(\boldsymbol{x} | \theta) f(\theta ; \tau) d\theta.
$$
An estimation method (such as MLE) is then applied to estimate $\tau$. Then inference proceeds as usual under the assumption that $\theta \sim f(\theta ; \hat{\tau})$.
## Example: Normal
Suppose that $X_i | \mu_i \sim \mbox{Normal}(\mu_i, 1)$ for $i=1, 2, \ldots, n$ where these rv's are independent. Also suppose that $\mu_i \iid \mbox{Normal}(a, b^2)$.
$$
f(x_i ; a, b) = \int f(x_i | \mu_i) f(\mu_i; a, b) d\mu_i \sim \mbox{Normal}(a, 1+b^2).
$$
$$
\implies \hat{a} = \overline{x}, \ 1+\hat{b}^2 = \frac{\sum_{k=1}^n (x_k - \overline{x})^2}{n}
$$
\begin{align*}
\operatorname{E}[\mu_i | x_i] & = \frac{1}{1+b^2}a + \frac{b^2}{1+b^2}x_i \implies \\
& \\
\hat{\operatorname{E}}[\mu_i | x_i] & = \frac{1}{1+\hat{b}^2}\hat{a} + \frac{\hat{b}^2}{1+\hat{b}^2}x_i \\
& = \frac{n}{\sum_{k=1}^n (x_k - \overline{x})^2} \overline{x} + \left(1-\frac{n}{\sum_{k=1}^n (x_k - \overline{x})^2}\right) x_i
\end{align*}