-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01-intro.Rmd
762 lines (549 loc) · 38.9 KB
/
01-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
\newcommand{\bx}{\boldsymbol{x}}
\newcommand{\btheta}{\boldsymbol{\theta}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\bvarepsilon}{\boldsymbol{\varepsilon}}
\newcommand{\by}{\boldsymbol{y}}
\newcommand{\rT}{\mathrm{T}}
\newcommand{\Var}{\operatorname{Var}}
\newcommand{\bY}{\boldsymbol{y}}
\newcommand{\btau}{\boldsymbol{\tau}}
# Motivation, introduction and revision {#intro}
::: {.definition #exp}
An **experiment** is the process through which data are collected to answer a scientific question (physical science, social science, actuarial science $\dots$) by **deliberately** varying some features of the process under study in order to understand the impact of these changes on measureable responses.
In this course we consider only *intervention* experiments, in which some aspects of the process are under the experimenters' control. We do not consider *surveys* or *observational* studies.
:::
::: {.definition #design}
**Design of experiments** is the topic in Statistics concerned with the selection of settings of controllable variables or factors in an experiment and their allocation to experimental units in order to maximise the effectiveness of the experiment at achieving its aim.
:::
People have been designing experiments for as long as they have been exploring the natural world. Collecting empirical evidence is key for scientific development, as described in terms of clinical trials by [xkcd](https://xkcd.com/2530/):
```{r xkcd-clinical-trials, echo = F, out.width = '75%', fig.align = 'center'}
if (knitr::is_html_output())
{
knitr::include_graphics("https://imgs.xkcd.com/comics/clinical_trials.png")
} else {
knitr::include_url("https://imgs.xkcd.com/comics/clinical_trials.png")
}
```
Some notable milestones in the history of the design of experiments include:
- prior to the 20th century:
- [Francis Bacon](https://en.wikipedia.org/wiki/Baconian_method) (17th century; pioneer of the experimental methods)
- [James Lind](https://en.wikipedia.org/wiki/James_Lind) (18th century; experiments to eliminate scurvy)
- [Charles Peirce](https://en.wikipedia.org/wiki/Charles_Sanders_Peirce#Probability_and_statistics) (19th century; advocated randomised experiments and randomisation-based inference)
- 1920s: agriculture (particularly at the [Rothamsted Agricultural Research Station](https://www.rothamsted.ac.uk/history-and-heritage))
- 1940s: clinical trials ([Austin Bradford-Hill](https://en.wikipedia.org/wiki/Austin_Bradford_Hill))
- 1950s: (manufacturing) industry ([W. Edwards Deming](https://en.wikipedia.org/wiki/W._Edwards_Deming); [Genichi Taguchi](https://en.wikipedia.org/wiki/Genichi_Taguchi))
- 1960s: psychology and economics ([Vernon Smith](https://en.wikipedia.org/wiki/Vernon_L._Smith))
- 1980s: in-silico ([computer experiments](https://en.wikipedia.org/wiki/Computer_experiment))
- 2000s: online ([A/B testing](https://en.wikipedia.org/wiki/A/B_testing))
See @LB2020 for further history, anecdotes and examples, especially from psychology and technology.
Figure \@ref(fig:broadbalk) shows the [Broadbalk](http://www.era.rothamsted.ac.uk/Broadbalk) agricultural field experiment at Rothamsted, one of the longest continuous running experiments in the world, which is testing the impact of different manures and fertilizers on the growth of winter wheat.
```{r broadbalk, echo = F, out.width = '75%', fig.align = 'center', fig.cap = 'The Broadbalk experiment, Rothamsted (photograph taken 2016)'}
knitr::include_graphics("figures/broadbalk.jpg")
```
<!--
Before we can design an experiment, we need to know:
- what is being measured;
- what features (variables or factors) can be varied and controlled, and what values can they be set to;
- what is the aim of the experiment.
-->
## Motivation
::: {.example #motivation}
Consider an experiment to compare two treatments (e.g. drugs, diets, fertilisers, $\dots$). We have $n$ subjects (people, mice, plots of land, $\dots$), each of which can be assigned one of the two treatments. A response (protein measurement, weight, yield, $\dots$) is then measured.
:::
**Question:** How many subjects should be assigned to each treatment to gain the most precise^[Smallest variance.] inference about the difference in response from the two treatments?
Consider a linear statistical model^[In this course, we will almost always start with a statistical model which we wish to use to answer our scientific question.] for the response (see MATH2010 or MATH6174/STAT6123):
\begin{equation}
Y_j=\beta_{0}+\beta_{1}x_j+\varepsilon_j\,,\qquad j=1, \ldots, n\,,
(\#eq:slr)
\end{equation}
where $\varepsilon_j\sim N(0,\sigma^{2})$ are independent and identically distributed errors and $\beta_{0}, \beta_{1}$ are unknown constants (parameters).
Let^[Other choices of *coding* can be used: e.g. -1,1; it makes no difference for our current purpose.]
\begin{equation}
x_{j}=\left\{\begin{array}{cl}
0&\textrm{if treatment 1 is applied to the $j$th subject}\\
1&\textrm{if treatment 2 is applied to the $j$th subject}\nonumber ,
\end{array}
\right.
\end{equation}
for $j=1,\dots,n$.
The difference in expected response from treatments 1 and 2 is
\begin{equation}
\begin{split}
\textrm{E}[Y_j\, |\, x_j = 1] - \textrm{E}[Y_j\, |\, x_j = 0] & = \beta_{0}+\beta_{1}-\beta_{0} \\
& = \beta_{1}\,.
\end{split}
(\#eq:ex-ex-response)
\end{equation}
Therefore, we require the the most precise estimator of $\beta_{1}$ possible. That is, we wish to make the variance of our estimator of $\beta_1$ as small as possible.
Parameters $\beta_{0}$ and $\beta_{1}$ can be estimated using least squares (see MATH2010 or MATH6174/STAT6123). For $Y_1,\dots,Y_n$, we can write the model down in matrix form:
\begin{equation*}
\left[ \begin{array}{c}
Y_1\\
\vdots\\
Y_n\end{array}\right]
=\left[ \begin{array}{cc}
1&x_{1}\\
\vdots&\vdots\\
1&x_{n}\end{array}\right]
\left[ \begin{array}{c}
\beta_{0}\\
\beta_{1}\end{array}\right]
+\left[ \begin{array}{c}
\varepsilon_{1}\\
\vdots\\
\varepsilon_{n}\end{array}\right]\,.
\end{equation*}
Or, by defining some notation:
\begin{equation}
\boldsymbol{Y}=X\boldsymbol{\beta}+\boldsymbol{\varepsilon}\,
(\#eq:matrix-model)
\end{equation}
where
- $\boldsymbol{Y}$ - $n\times 1$ vector of responses;
- $X$ - $n\times p$ model matrix;
- $\boldsymbol{\beta}$ - $p\times 1$ vector of parameters;
- $\boldsymbol{\varepsilon}$ - $n\times 1$ vector of errors.
The **least squares estimators**, $\hat{\boldsymbol{\beta}}$, are chosen such that the quadratic form
\begin{equation*}
(\boldsymbol{Y}-X\boldsymbol{\beta})^{\textrm{T}}(\boldsymbol{Y}-X\boldsymbol{\beta})
\end{equation*}
is minimised (recall that $\textrm{E}(\textbf{Y})=X\boldsymbol{\beta}$). Therefore
\begin{equation*}
\hat{\boldsymbol{\beta}} = \textrm{argmin}_{\boldsymbol{\beta}}(\boldsymbol{Y}^{\textrm{T}}\boldsymbol{Y}+\boldsymbol{\beta}^{\textrm{T}}X^{\textrm{T}}X\boldsymbol{\beta}
-2\boldsymbol{\beta}^{\textrm{T}}X^{\textrm{T}}\boldsymbol{Y})\,.
\end{equation*}
If we differentiate with respect to $\boldsymbol{\beta}$^[Check the [Matrix Cookbook](https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf) for matrix calculus, amongst much else.],
\begin{equation*}
\frac{\partial}{\partial\boldsymbol{\beta}}=2X^{\textrm{T}}X\boldsymbol{\beta}-2X^{\textrm{T}}\boldsymbol{Y}\,,\nonumber
\end{equation*}
and equate to 0, we get the estimators
\begin{equation}
\hat{\boldsymbol{\beta}}=(X^{\textrm{T}}X)^{-1}X^{\textrm{T}}\boldsymbol{Y}\,.
(\#eq:lsestimators)
\end{equation}
These are the least squares estimators.
For Example \@ref(exm:motivation),
$$
X=\left[\begin{array}{cc}
1&x_{1}\\
\vdots&\vdots\\
1&x_{n}\end{array}\right]\,,
\qquad
X^{\textrm{T}}X=\left[\begin{array}{cc}
n&\sum x_j\\
\sum x_j&\sum x_j^{2}\end{array}\right]\,,
$$
$$
(X^{\textrm{T}}X)^{-1}=\frac{1}{n\sum x_j^{2}-(\sum x_j)^{2}}\left[\begin{array}{cc}
\sum x_j^{2}&-\sum x_j\\
-\sum x_j&n\end{array}\right]\,,
\qquad
X^{\textrm{T}}\boldsymbol{Y}=\left[\begin{array}{c}
\sum Y_j\\
\sum x_jY_j\end{array}\right]\,.
$$
Then,
\begin{align}
\hat{\boldsymbol{\beta}}=\left[\begin{array}{c}
\hat{\beta}_{0}\\
\hat{\beta}_{1}\end{array}\right]
& =\frac{1}{n\sum x_j^{2}-(\sum x_j)^{2}}
\left[\begin{array}{cc}
\sum x_j^{2}&-\sum x_j\\
-\sum x_j&n\end{array}\right]
\left[\begin{array}{c}
\sum Y_j\\
\sum x_jY_j\end{array}\right]\nonumber \\
&= \frac{1}{n\sum x_j^{2}-(\sum x_j)^{2}}\left[\begin{array}{c}
\sum Y_j\sum x_j^{2}-\sum x_j\sum x_jY_j\\
n\sum x_jY_j-\sum x_j\sum Y_j\end{array}\right]\,.
\end{align}
We don't usually work through the algebra in such detail; the matrix form is often sufficient for theoretical and numerical calculations and software, e.g. `R`, can be used.
The precision of $\hat{\boldsymbol{\beta}}$ is measured via the variance-covariance matrix, given by
\begin{align}
\textrm{Var}(\hat{\boldsymbol{\beta}}) & = \textrm{Var}\{(X^{\textrm{T}}X)^{-1}X^{\textrm{T}}\boldsymbol{Y}\}\\
& =(X^{\textrm{T}}X)^{-1}X^{\textrm{T}}\textrm{Var}(\boldsymbol{Y})X(X^{\textrm{T}}X)^{-1}\\
& = (X^{\textrm{T}}X)^{-1}\sigma^{2}\,,
\end{align}
where $\boldsymbol{Y}\sim N(X\boldsymbol{\beta},I_n\sigma^{2})$, where $I_n$ is an $n\times n$ identity matrix.
Hence, in our example,
\begin{align*}
\textrm{Var}(\hat{\boldsymbol{\beta}}) & = \frac{1}{n\sum x_j^{2}-(\sum x_j)^{2}}\left[\begin{array}{cc}
\sum x_j^{2}&-\sum x_j\\
-\sum x_j&n\end{array}\right]\sigma^{2}\\
& = \left[\begin{array}{cc}
\textrm{Var}(\hat\beta_{0})&\textrm{Cov}(\hat\beta_{0},\hat\beta_{1})\\
\textrm{Cov}(\hat\beta_{0},\hat\beta_{1})&\textrm{Var}(\hat\beta_{1})\end{array}\right]\,.
\end{align*}
For estimating the difference between treatments, we are interested in
\begin{align*}
\textrm{Var}(\hat{\beta}_{1})& = \frac{n}{n\sum x_j^{2}-(\sum x_j)^{2}}\sigma^{2}\\
& = \frac{1}{\sum x_j^{2} - n\bar{x}^2}\sigma^{2}\,,
\end{align*}
where $\bar{x} = \sum x_j / n$.
Assuming constant $\sigma^2$, to achieve the most precise estimator we need to minimise $\textrm{Var}(\hat{\beta}_{1})$ or equivalently maxmise $\sum x_j^{2} - n\bar{x}^2$. This goal can be achieved through the choice of $x_{1},\dots,x_{n}$:
- as each $x_j$ can only take one of two values, 0 or 1, this is equivalent to choosing the numbers of subjects assigned to treatment 1 and treatment 2;
- call these $n_{1}$ and $n_{2}$ respectively, with $n_{1}+n_{2}=n$.
We can find an upper bound for the quantity $\sum x_j^{2} - n\bar{x}^2$. As each $x_i\in\{0,1\}$ we have
\begin{align}
\sum x_j^2 & = \sum x_j \\
& = n\bar{x}\,.
\end{align}
Hence,
\begin{align*}
\sum x_j^{2} - n\bar{x}^2 & = n\bar{x} - n\bar{x}^2 \\
& = n\bar{x}(1-\bar{x}) \\
& \le n/4\,,
\end{align*}
as we have a quadratic equation in $\bar{x}$ that is maximised at $\bar{x} = 1/2$.
If we can find a set of design points that satisfy $\bar{x} = 1/2$, we will have an **optimal design**. Assuming $n$ is even, one possibility is
- $n_{1}=\frac{n}{2}$ subjects assigned to treatment 1 ($x_j = 0$) and
- $n_{2}=\frac{n}{2}$ subjects assigned to treatment 2 ($x_j = 1$).
For $n$ odd, we choose $n_{1}=\frac{n+1}{2}$, $n_{2}=\frac{n-1}{2}$, or vice versa, to get as close as possible to the optimal design.
::: {.definition #simple-efficiency}
We can assess different designs using their **efficiency**:
\begin{equation}
\textrm{Eff}=\frac{\textrm{Var}(\hat{\beta}_{1}\, |\, d^{*})}{\textrm{Var}(\hat{\beta}_{1}\, |\, d_{1})}
(\#eq:simple-efficiency)
\end{equation}
where $d_{1}$ is a design we want to assess and $d^{*}$ is the optimal design with smallest variance. Note that $0\leq\textrm{Eff}\leq 1$.
:::
In Figure \@ref(fig:simple-efficiency) below, we plot this efficiency for Example \@ref(exm:motivation), using different choices of $n_1$. The total number of runs is fixed at $n = 100$, and the function `eff` calculates the efficiency (assuming $n$ is even) from Definition \@ref(def:simple-efficiency) for a design with $n_1$ subjects assigned to treatment 1. Clearly, efficiency of 1 is achieved when $n_1 = n_2$ (equal allocation of treatments 1 and 2). If $n_1=0$ or $n_1 = 1$, the efficiency is zero; we cannot estimate the difference between two treatments if we only allocate subjects to one of them.
(ref:foo) Efficiencies for designs for Example \@ref(exm:motivation) with different numbers, $n_1$, of subjects assigned to treatment 1 when the total number of subjects is $n=100$.
```{r simple-efficiency, fig.cap = '(ref:foo)'}
n <- 100
eff <- function(n2) 4 * n2 * (n - n2) / n^2
curve(eff, from = 0, to = n, ylab = "Eff", xlab = expression(n[1]))
```
## Aims of experimentation and some examples
Some reasons experiments are performed:
1. Treatment comparison
- compare several treatments (and choose the best)
- e.g. clinical trial, agricultural field trial
2. Factor screening
- many complex systems may involve a large number of (discrete) factors (controllable features)
- which of these factors have a substantive impact?
- (relatively) small experiments
- e.g. industrial experiments on manufacturing processes
3. Response surface exploration
- detailed description of relationship between important (continuous) variables and response
- typically second order polynomial regression models
- larger experiments, often built up sequentially
- e.g. alcohol yields in a pharmaceutical experiments
4. Optimisation
- finding settings of variables that lead to maximum or minimum response
- typically use response surface methods and sequential "hill climbing" strategy
In this module, we will focus on **treatment comparison** (Chapters 2 and 3) and **factor screening** (Chapters 4, 5 and 6).
## Some definitions
::: {.definition #response}
The **response** $Y$ is the outcome measured in an experiment; e.g. yield from a chemical process. The response from the $n$ observations are denoted $Y_{1},\dots,Y_{n}$.
:::
::: {.definition #factor-variable}
**Factors** (discrete) or **variables** (continuous) are features which can be set or controlled in an experiment; $m$ denotes the number of factors or variables under investigation. For discrete factors, we call the possible settings of the factor its **levels**. We denote by $x_{ij}$ the value taken by factor or variable $i$ in the $j$th run of the experiment ($i = 1, \ldots, m$; $j = 1, \ldots, n$).
:::
<!--
::: {.definition #designpoint}
A **design point** is an $m$-vector $\boldsymbol{x}_{j} = (x_{j1, \ldots, x_{jm}})^\top$ giving the values taken by the $m$ factors or variables in $j$th run ($j=1,\dots,N$).
:::
-->
::: {.definition #treatment}
The **treatments** or **support points** are the *distinct* combinations of factor or variable values in the experiment.
:::
::: {.definition #unit}
An experimental **unit** is the basic element (material, animal, person, time unit, \ldots) to which a treatment can be applied to produce a response.
:::
In Example \@ref(exm:motivation) (comparing two treatments):
- Response $Y$: Measured outcome, e.g. protein level or pain score in clinical trial, yield in an agricultural field trial.
- Factor $x$: "treatment" applied
- Levels
$$
\begin{array}{ll}
\textrm{treatment 1}&x =0\\
\textrm{treatment 2}&x =1
\end{array}
$$
<!--- Design point: factor level applied to $j$th subject; $x_{j}=\pm 1$-->
- Treatment or support point: Two treatments or support points
- Experimental unit: Subject (person, animal, plot of land, $\ldots$).
## Principles of experimentation {#principles}
Three fundamental principles that need to be considered when designing an experiment are:
- replication
- randomisation
- stratification (blocking)
### Replication {#replication}
Each treatment is applied to a number of experimental units, with the $j$th treatment replicated $r_{j}$ times. This enables the estimation of the variances of treatment effect estimators; increasing the number of replications, or replicates, decreases the variance of estimators of treatment effects.
(Note: proper replication involves independent application of the treatment to different experimental units, not just taking several measurements from the same unit).
### Randomisation {#randomisation}
Randomisation should be applied to the allocation of treatments to units. Randomisation protects against **bias**; the effect of
variables that are unknown and potentially uncontrolled or
subjectivity in applying treatments. It also provides a formal basis
for inference and statistical testing.
For example, in a clinical trial to compare a new drug and a control random allocation protects against
- "unmeasured and uncontrollable" features (e.g. age, sex, health)
- bias resulting from the clinician giving new drug to patients who are sicker.
Clinical trials are usually also *double-blinded*, i.e. neither the healthcare professional nor the patient knows which treatment the patient is receiving.
### Stratification (or blocking) {#intro-blocking}
We would like to use a wide variety of experimental units (e.g. people or plots of land) to ensure **coverage** of our results, i.e. validity of our conclusions across the population of interest. However, if the sample of units from the population is too heterogenous, then this will induce too much random variability, i.e. increase $\sigma^{2}$ in $\varepsilon_{j}\sim N(0,\sigma^{2})$, and hence increase the variance of our parameter estimators.
We can reduce this extraneous variation by splitting our units into homogenous sets, or **blocks**, and including a blocking term in the model. The simplest blocked experiment is a **randomised complete block design**, where each block contains enough units for all treatments to be applied. Comparisons can then be made *within* each block.
Basic principle: block what you can, randomise what you cannot.
Later we will look at blocking in more detail, and the principle of **incomplete blocks**.
## Revision on the linear model {#lin-model-rev}
Recall: $\boldsymbol{Y}=X\boldsymbol{\beta}+\boldsymbol{\varepsilon}$, with $\boldsymbol{\varepsilon}\sim N(\boldsymbol{0},I_n\sigma^{2})$. Let the $j$th row of $X$ be denoted $\boldsymbol{x}^\textrm{T}_j$, which holds the values of the predictors, or explanatory variables, for the $j$th observation. Then
\begin{equation*}
Y_j=\boldsymbol{x}_j^{\textrm{T}}\boldsymbol{\beta}+\varepsilon_j\,,\quad j=1,\ldots,n\,.
\end{equation*}
For example, quite commonly, for continuous variables
$$
\boldsymbol{x}_j=(1,x_{1j},x_{2j},\dots,x_{mj})^{\textrm{T}}\,,
$$
and so
$$
\boldsymbol{x}_j^{\textrm{T}}\boldsymbol{\beta}=\beta_{0}+\beta_{1}x_{1j}+\dots+\beta_{m}x_{mj}\,.
$$
The least squares estimators are given by
\begin{equation}
\hat{\boldsymbol{\beta}}=(X^{\textrm{T}}X)^{-1}X^{\textrm{T}}\boldsymbol{Y}\,,\nonumber
\end{equation}
with
\begin{equation}
\textrm{Var}(\hat{\boldsymbol{\beta}})=(X^{\textrm{T}}X)^{-1}\sigma^{2}\,.\nonumber
\end{equation}
<!--
### Variance of a Prediction/Fitted Value
A prediction of the mean response at point $\boldsymbol{x}_0$ (which may or may not be in the design) is
$$
\hat{Y}_0 = \boldsymbol{x}_0^{\textrm{T}}\hat{\boldsymbol{\beta}}\,,
$$
with
\begin{align*}
\textrm{Var}(\hat{Y}_0) & = \textrm{Var}\left(\boldsymbol{x}_0^{\textrm{T}}\hat{\boldsymbol{\beta}}\right) \\
& = \boldsymbol{x}_0^{\textrm{T}}\textrm{Var}(\hat{\boldsymbol{\beta}})\boldsymbol{x}_0 \\
& = \boldsymbol{x}_0^{\textrm{T}}(X^{\textrm{T}}X)^{-1}\boldsymbol{x}_0\sigma^{2}\,.
\end{align*}
For a linear model, this variance depends only on the assumed regression model and the design (through $X$), the point at which prediction is to be made ($\boldsymbol{x}_0$) and the value of $\sigma^2$; it does not depend on data $\boldsymbol{Y}$ or parameters $\boldsymbol{\beta}$.
Similarly, we can find the variance-covariance matrix of the fitted values:
$$
\textrm{Var}(\hat{Y})=\textrm{Var}(X\hat{\boldsymbol{\beta}})=X(X^{\textrm{T}}X)^{-1}X^{\textrm{T}}\sigma^{2}\,.
$$
-->
### Analysis of Variance as Model Comparison {#anova-revision}
To assess the goodness-of-fit of a model, we can use the residual sum of squares
\begin{align*}
\textrm{RSS} & = (\boldsymbol{Y} - X\hat{\boldsymbol{\beta}})^{\textrm{T}} (\boldsymbol{Y} - X\hat{\boldsymbol{\beta}})\\
& = \sum^{n}_{j=1}\left\{Y_{j}-\boldsymbol{x}_{j}^{\textrm{T}}\hat{\boldsymbol{\beta}}\right\}^{2}\\
& = \sum^{n}_{j=1}r_{j}^{2}\,,
\end{align*}
where
$$
r_{j}=Y_{j}-\boldsymbol{x}_{j}^{\textrm{T}}\hat{\boldsymbol{\beta}}\,.
$$
Often, a comparison is made to the null model
$$
Y_{j}=\beta_{0}+\varepsilon_{j}\,,
$$
i.e. $Y_{i}\sim N(\beta_{0},\sigma^{2})$. The residual sum of squares for the null model is given by
$$
\textrm{RSS}(\textrm{null}) = \boldsymbol{Y}^{\textrm{T}}\boldsymbol{Y} - n\bar{Y}^{2}\,,
$$
as
$$
\hat{\beta}_{0} = \bar{Y} = \frac{1}{n}\sum_{j=1}^n Y_{j}\,.
$$
<!--How do we compare these models?
1. Ratio of residual sum of squares:
\begin{align*}
R^{2} & = 1 - \frac{\textrm{RSS}}{\textrm{RSS}(\textrm{null})} \\
& = 1 - \frac{(\boldsymbol{Y}-X\hat{\boldsymbol{\beta}})^{\textrm{T}}(\boldsymbol{Y}-X\hat{\boldsymbol{\beta}})}{\boldsymbol{Y}^{\textrm{T}}\boldsymbol{Y}-n\bar{Y}^{2}}\,.
\end{align*}
The quantity $0\leq R^{2}\leq 1$ is sometimes called the **coefficient of multiple determination**:
- high $R^{2}$ implies that the model describes much of the variation in the data;
- **but** note that $R^{2}$ will always increase as $p$ (the number of explanatory variables) increases, with $R^{2}=1$ when $p=n$;
- some software packages will report the adjusted $R^{2}$.
\begin{align*}
R^{2}_{a} & = 1-\frac{\textrm{RSS}/(n-p)}{\textrm{RSS}(\textrm{null})/(n-1)}\\
& = 1 - \frac{(\boldsymbol{Y} - X\hat{\boldsymbol{\beta}})^{\textrm{T}} (\boldsymbol{Y} - X\hat{\boldsymbol{\beta}})/(n-p)}{(\boldsymbol{Y}^{\textrm{T}}\boldsymbol{Y} - n\bar{Y}^{2})/(n-1)};
\end{align*}
- $R_a^2$ does not necessarily increase with $p$ (as we divide by degrees of freedom to adjust for complexity of the model).
-->
An Analysis of variance (ANOVA) table is compact way of presenting the results of (sequential) comparisons of nested models. You should be familiar with an ANOVA table of the following form.
Table: (\#tab:anova) A standard ANOVA table.
| Source | Degress of Freedom | (Sequential) Sum of Squares | Mean Square |
| :--- | :--- | :--- | :--- |
| Regression | $p-1$ | By subtraction; see \@ref(eq:SSS) | Reg SS/$(p-1)$ |
| Residual | $n-p$ | $(\boldsymbol{Y}-X\hat{\boldsymbol{\beta}})^{\textrm{T}}(\boldsymbol{Y}-X\hat{\boldsymbol{\beta}})$^[Residual sum of squares for the full regression model.] | RSS/$(n-p)$ |
| Total | $n-1$ | $\boldsymbol{Y}^{\textrm{T}}\boldsymbol{Y}-n\bar{Y}^{2}$^[Residual sum of squares for the null model.] | |
In row 1 of Table \@ref(tab:anova) above,
\begin{align}
\textrm{Regression SS = Total SS $-$ RSS} & = \boldsymbol{Y}^{\textrm{T}}\boldsymbol{Y} - n\bar{Y}^{2} - (\boldsymbol{Y}-X\hat{\boldsymbol{\beta}})^{\textrm{T}}(\boldsymbol{Y}-X\hat{\boldsymbol{\beta}})\\
& = -n\bar{Y}^{2}-\hat{\boldsymbol{\beta}}^{\textrm{T}}(X^{\textrm{T}}X)\hat{\boldsymbol{\beta}}+2\hat{\boldsymbol{\beta}}^{\textrm{T}}X^{\textrm{T}}\boldsymbol{Y} \\
& = \hat{\boldsymbol{\beta}}^{\textrm{T}}(X^{\textrm{T}}X)\hat{\boldsymbol{\beta}}-n\bar{Y}^{2}\,,
(\#eq:SSS)
\end{align}
with the last line following from
\begin{align*}
\hat{\boldsymbol{\beta}}^{\textrm{T}}X^{\textrm{T}}\boldsymbol{Y} & =
\hat{\boldsymbol{\beta}}^{\textrm{T}}(X^{\textrm{T}}X)(X^{\textrm{T}}X)^{-1}X^{\textrm{T}}\boldsymbol{Y} \\
& = \hat{\boldsymbol{\beta}}^{\textrm{T}}(X^{\textrm{T}}X)\hat{\boldsymbol{\beta}}
\end{align*}
This idea can be generalised to the comparison of a *sequence* of nested models - see Problem Sheet 1.
Hypothesis testing is performed using the mean square:
\begin{equation}
\frac{\textrm{Regression SS}}{p-1}=\frac{\hat{\boldsymbol{\beta}}^{\textrm{T}}(X^{\textrm{T}}X)\hat{\boldsymbol{\beta}}-n\bar{Y}^{2}}{p-1}\,.\nonumber
\end{equation}
Under $\textrm{H}_{0}: \beta_{1}=\dots=\beta_{p-1}=0$
\begin{align*}
\frac{\textrm{Regression SS}/(p-1)}{\textrm{RSS}/(n-p)} & = \frac{(\hat{\boldsymbol{\beta}}^{\textrm{T}}(X^{\textrm{T}}X)\hat{\boldsymbol{\beta}} - n\bar{Y}^{2})/(p-1)}{(\boldsymbol{Y}-X\hat{\boldsymbol{\beta}})^{\textrm{T}}(\boldsymbol{Y}-X\hat{\boldsymbol{\beta}})/(n-p)}\nonumber\\
& \sim F_{p-1,n-p}\,,
\end{align*}
an $F$ distribution with $p-1$ and $n-p$ degrees of freedom; defined via the ratio of two independent $\chi^{2}$ distributions.
Also,
\begin{equation*}
\frac{\textrm{RSS}}{n-p}=\frac{(\boldsymbol{Y}-X\hat{\boldsymbol{\beta}})^{\textrm{T}}(\boldsymbol{Y}-X\hat{\boldsymbol{\beta}})}{n-p}=\hat{\sigma}^{2}
\end{equation*}
is an unbiased estimator for $\sigma^{2}$, and
\begin{equation*}
\frac{(n-p)}{\sigma^{2}}\hat{\sigma}^{2}\sim\chi^{2}_{n-p}\,.
\end{equation*}
This is a Chi-squared distribution with $n-p$ degrees of freedom (see MATH2010 or MATH6174 notes).
## Exercises
1. [Adapted from @Morris2011] A classic and famous example of a simple hypothetical experiment was described by @Fisher1935:
>A lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was added first to the cup. We will consider the problem of designing an experiment by means of which this assertion can be tested. For this purpose let us first lay down a simple form of experiment with a view to studying its limitations and its characteristics, both those that same essential to the experimental method, when well developed, and those that are not essential but auxiliary.
>
>Our experiment consists in mixing eight cups of tea, four in one way and four in the other, and presenting them to the subject for judgement in a random order. The subject has been told in advance of what the test will consist, namely that she will be asked to taste eight cups, that these shall be four of each kind, and that they shall be presented to her in a random order, that is an order not determined arbitrarily by human choice, but by the actual manipulation of the physical appartatus used in games of chance, cards, dice, roulettes, etc., or, more expeditiously, from a published collection of random sampling numbers purporting to give the actual results of such manipulation^[Now, we would use routines such as `sample` in `R`.]. Her task is to divide the 8 cups into two sets of 4, agreeing, if possible, with the treatments received.
a. Define the treatments in this experiment.
b. Identify the units in this experiment.
c. How might a "physical appartatus" from a "game of chance" be used to perform the randomisation? Explain one example.
d. Suppose eight tea cups are available for this experiment but they are not identical. Instead they come from two sets. Four are made from heavy, thick porcelain; four from much lighter china. If each cup can only be used once, how might this fact be incorporated into the design of the experiment?
<details>
<summary><b>Solution</b></summary>
a. There are two treatments in the experiment - the two ingredients "milk first" and "tea first".
b. The experimental units are the "cups of tea", made up from the tea and milk used and also the cup itself.
c. The simplest method here might be to select four black playing cards and four red playing cards, assign one treatment to each colour, shuffle the cards, and then draw them in order. The colour drawn indicates the treatment that should be used to make the next cup of tea. This operation would give one possible randomisation.
We could of course also use `R`.
```{r tea-randomise}
sample(rep(c("Milk first", "Tea first"), c(4, 4)), size = 8, replace = F)
```
d. Type of cup could be considered as a blocking factor. One way of incorporating it would be to split the experiment into two (blocks), each with four cups (two milk first, two tea first). We would still wish to randomise allocation of treatments to units within blocks.
```{r tea-blocks}
# block 1
sample(rep(c("Milk first", "Tea first"), c(2, 2)), size = 4, replace = F)
# block 2
sample(rep(c("Milk first", "Tea first"), c(2, 2)), size = 4, replace = F)
```
</details>
2. Consider the linear model
$$\bY = X\bbeta + \bvarepsilon\,,$$
with $\bY$ an $n\times 1$ vector of responses, $X$ a $n\times p$ model matrix and $\bvarepsilon$ a $n\times 1$ vector of independent and identically distributed random variables with constant variance $\sigma^2$.
a. Derive the least squares estimator $\hat{\bbeta}$ for this multiple linear regression model, and show that this estimator is unbiased. Using the definition of (co)variance, show that
$$\mbox{Var}(\hat{\bbeta}) = \left(X^{\mathrm{T}}X\right)^{-1}\sigma^2\,.$$
b. If $\bvarepsilon\sim N (\boldsymbol{0},I_n\sigma^2)$, with $I_n$ being the $n\times n$ identity matrix, show that the maximum likelihood estimators for $\bbeta$ coincide with the least squares estimators.
<details>
<summary><b>Solution</b></summary>
a. The method of least squares minimises the sum of squared differences between the responses and the expected values, that is, minimises the expression
$$
(\bY-X\bbeta)^{\mathrm{T}}(\bY-X\bbeta) = \bY^{\mathrm{T}}\bY - 2\bbeta^{\mathrm{T}}X^{\mathrm{T}}\bY + \bbeta^{\mathrm{T}}X^{\mathrm{T}}X\bbeta\,.
$$
Differentiating with respect to the vector $\bbeta$, we obtain
$$
\frac{\partial}{\partial\bbeta} = -2X^{\mathrm{T}}\bY + 2X^{\mathrm{T}}X\bbeta\,.
$$
Set equal to $\boldsymbol{0}$ and solve:
$$
X^{\mathrm{T}}X\hat{\bbeta} = X^{\mathrm{T}}\bY \Rightarrow \hat{\bbeta} = \left(X^{\mathrm{T}}X\right)^{-1}X^{\mathrm{T}}\bY\,.
$$
The estimator $\hat{\bbeta}$ is unbiased:
$$
E(\hat{\bbeta}) = \left(X^{\mathrm{T}}X\right)^{-1}X^{\mathrm{T}}E(\bY) = \left(X^{\mathrm{T}}X\right)^{-1}X^{\mathrm{T}}X\bbeta = \bbeta\,,
$$
and has variance:
\begin{align*}
\mbox{Var}(\hat{\bbeta}) & =E\left\{ \left[\hat{\bbeta} - E(\hat{\bbeta})\right] \left[\hat{\bbeta} - E(\hat{\bbeta})\right]^{\mathrm{T}} \right\}\\
& = E\left\{ \left[\hat{\bbeta} - \bbeta\right] \left[\hat{\bbeta} - \bbeta\right]^{\mathrm{T}} \right\}\\
& = E\left\{ \hat{\bbeta}\hat{\bbeta}^{\mathrm{T}} - 2\bbeta\hat{\bbeta}^{\mathrm{T}} + \bbeta\bbeta^{\mathrm{T}} \right\}\\
& = E\left\{ \left(X^{\mathrm{T}}X\right)^{-1}X^{\mathrm{T}}\bY\bY^{\mathrm{T}}X\left(X^{\mathrm{T}}X\right)^{-1} - 2\bbeta \bY^{\mathrm{T}}X\left(X^{\mathrm{T}}X\right)^{-1} + \bbeta\bbeta^{\mathrm{T}}\right\}\\
& = \left(X^{\mathrm{T}}X\right)^{-1}X^{\mathrm{T}}E(\bY\bY^{\mathrm{T}})X\left(X^{\mathrm{T}}X\right)^{-1} - 2\bbeta E(\bY^{\mathrm{T}})X\left(X^{\mathrm{T}}X\right)^{-1} + \bbeta\bbeta^{\mathrm{T}}\\
& = \left(X^{\mathrm{T}}X\right)^{-1}X^{\mathrm{T}}\left[\mbox{Var}(\bY) + E(\bY)E(\bY^{\mathrm{T}})\right]X\left(X^{\mathrm{T}}X\right)^{-1} - 2\bbeta\bbeta^{\mathrm{T}}X^{\mathrm{T}}X\left(X^{\mathrm{T}}X\right)^{-1} + \bbeta\bbeta^{\mathrm{T}}\\
& = \left(X^{\mathrm{T}}X\right)^{-1}X^{\mathrm{T}}\left[I_N\sigma^2 + X\bbeta\bbeta^{\mathrm{T}}X^{\mathrm{T}}\right]X\left(X^{\mathrm{T}}X\right)^{-1} - \bbeta\bbeta^{\mathrm{T}}\\
& = \left(X^{\mathrm{T}}X\right)^{-1}\sigma^2\,.
\end{align*}
a. As $\bY\sim N\left(X\bbeta, I_N\sigma^2\right)$, the likelihood is given by
$$
L(\bbeta\,; \bY) = \left(2\pi\sigma^2\right)^{-N/2}\exp\left\{-\frac{1}{2\sigma^2}(\bY - X\bbeta)^{\mathrm{T}}(\bY - X\bbeta)\right\}\,.
$$
The log-likelihood is given by
$$
l(\bbeta\,;\bY) = -\frac{1}{2\sigma^2}(\bY - X\bbeta)^{\mathrm{T}}(\bY - X\bbeta) + \mbox{constant}\,.
$$
Up to a constant, this expression is $-1\times$ the least squares equations; hence maximising the log-likelihood is equivalent to minimising the least squares equation.
</details>
3. Consider the two nested linear models
(i) $Y_j = \beta_0 + \beta_1x_{1j} + \beta_2x_{2j} + \ldots + \beta_{p_1}x_{p_1j} + \varepsilon_j$, or $\bY = \boldsymbol{1}_n\beta_0 + X_1\bbeta_1 + \bvarepsilon$,
(ii) $Y_j = \beta_0 + \beta_1x_{1j} + \beta_2x_{2j} + \ldots + \beta_{p_1}x_{p_1j} + \beta_{p_1+1}x_{(p_1+1)j} + \ldots + \beta_{p_1+p_2}x_{p_1+p_2j} + \varepsilon_j$, or $\bY = \boldsymbol{1}_n\beta_0 + X_1\bbeta_1 + X_2\bbeta_2+ \bvarepsilon$
with $\varepsilon_j\sim N(0, \sigma^2)$, and $\varepsilon_{j}$, $\varepsilon_{k}$ independent $(\bvarepsilon\sim N(\boldsymbol{0},I_n\sigma^2))$.
a. Construct an ANOVA table to compare model (ii) with the null model $Y_j=\beta_0 + \varepsilon_j$.
b. Extend this ANOVA table to compare models (i) and (ii) by further decomposing the regression sum of squares for model (ii).
**Hint:** which residual sum of squares are you interested in to compare models (i) and (ii)?
You should end up with an ANOVA table of the form
| Source | Degrees of freedom | Sums of squares | Mean square |
| :----- | :----------------: | :-------------: | :---------: |
| Model (i) | $p_1$ | ? | ? |
| Model (ii) | $p_2$ | ? | ? |
| Residual | $n-p_1-p_2-1$ | ? | ? |
| Total | $n-1$ | $\bY^{\mathrm{T}}\bY - n\bar{Y}^2$ | |
The second row of the table gives the **extra sums of squares** for the additional terms in fitting model (ii), over and above those in model (i).
c. Calculate the extra sum of squares for fitting the terms in model (i), over and above those terms only in model (ii), i.e. those held in $X_2\bbeta_2$. Construct an ANOVA table containing both the extra sum of squares for the terms only in model (i) and the extra sum of squares for the terms only in model (ii). Comment on the table.
<details>
<summary><b>Solution</b></summary>
a. From lectures
| Source | Degrees of freedom | Sums of squares | Mean square |
| :----- | :----------------: | :-------------: | :---------: |
| Regression | $p_1+p_2$ | $\hat{\bbeta}^{\mathrm{T}}X^{\mathrm{T}}X\hat{\bbeta} - n\bar{Y}^2$ | $\left(\hat{\bbeta}^{\mathrm{T}}X^{\mathrm{T}}X\hat{\bbeta} - n\bar{Y}^2\right)/(p_1+p_2)$ |
| Residual | $n-p_1-p_2-1$ | $(\bY - X\hat{\bbeta})^{\mathrm{T}}(\bY - X\hat{\bbeta})$ | RSS$/(n-p_1-p_2-1)$ |
| Total | $n-1$ | $\bY^{\mathrm{T}}\bY - n\bar{Y}^2$ | |
where $X = [\boldsymbol{1}_n\, X_1 \, X_2]$.
\begin{align*}
\mbox{RSS(null) - RSS(ii)} & = \bY^{\mathrm{T}}\bY - n\bar{Y}^2 - (\bY - X\hat{\bbeta})^{\mathrm{T}}(\bY - X\hat{\bbeta})\\
& = \bY^{\mathrm{T}}\bY - n\bar{Y}^2 - \bY^{\mathrm{T}}\bY + 2\bY^{\mathrm{T}}X\hat{\bbeta} - \hat{\bbeta}^{\mathrm{T}}X^{\mathrm{T}}X\hat{\bbeta}\\
& = 2\hat{\bbeta}^{\mathrm{T}}X^{\mathrm{T}}X\hat{\bbeta} - \hat{\bbeta}^{\mathrm{T}}X^{\mathrm{T}}X\hat{\bbeta} - n\bar{Y}^2\\
& = \hat{\bbeta}^{\mathrm{T}}X^{\mathrm{T}}X\hat{\bbeta} - n\bar{Y}^2\,.
\end{align*}
a. To compare model (i) with the null model, we compute
\begin{align*}
\mbox{RSS(null) - RSS(i)} & = \bY^{\mathrm{T}}\bY - N\bar{Y}^2 - (\bY - X_1^*\hat{\bbeta}_1^*)^{\mathrm{T}}(\bY - X_1^*\hat{\bbeta}_1^*)\\
& = (\hat{\bbeta}_1^*)^{\mathrm{T}}(X_1^*)^{\mathrm{T}}X_1^*\hat{\bbeta}_1^* - n\bar{Y}^2\,,
\end{align*}
where $X_1^* = [\boldsymbol{1}\, X_1]$ and $\boldsymbol{\beta}_1^* = (\beta_0, \boldsymbol{\beta}_1^{\mathrm{T}})^{\mathrm{T}}$.
To compare models (i) and (ii), we compare them both to the null model, and look at the difference between these comparisons:
\begin{align*}
\mbox{[RSS(null) - RSS(ii)] - [RSS(null) - RSS(i)]} & = \mbox{RSS(i) - RSS(ii)}\\
& = (\bY - X_1^*\hat{\bbeta}_1^*)^{\mathrm{T}}(\bY - X_1^*\hat{\bbeta}_1^*) - (\bY - X\hat{\bbeta})^{\mathrm{T}}(\bY - X\hat{\bbeta})\\
& = \hat{\bbeta}^{\mathrm{T}}X^{\mathrm{T}}X\hat{\bbeta} - (\hat{\bbeta}_1^*)^{\mathrm{T}}(X_1^*)^{\mathrm{T}}X_1^*\hat{\bbeta}_1^*\,.
\end{align*}
| Source | Degrees of freedom | Sums of squares | Mean square |
| :----- | :----------------: | :-------------: | :---------: |
| Regression | $p_1+p_2$ | $\hat{\bbeta}X^{\mathrm{T}}X\hat{\bbeta} - n\bar{Y}^2$ | $\left(\hat{\bbeta}X^{\mathrm{T}}X\hat{\bbeta} - n\bar{Y}^2\right)/(p_1+p_2)$ |
| Model (i) | $p_1$ | $(\hat{\bbeta}_1^*)^{\mathrm{T}}(X_1^*)^{\mathrm{T}}X_1^*\hat{\bbeta}_1^* - n\bar{Y}^2$ | $\left((\hat{\bbeta}_1^*)^{\mathrm{T}}(X_1^*)^{\mathrm{T}}X_1^*\hat{\bbeta}_1^* - n\bar{Y}^2\right)/p_1$ |
| Extra due to Model (ii) | $p_2$ | $\hat{\bbeta}^{\mathrm{T}}X^{\mathrm{T}}X\hat{\bbeta} - (\hat{\bbeta}_1^*)^{\mathrm{T}}(X_1^*)^{\mathrm{T}}X_1^*\hat{\bbeta}_1^*$ | $\left(\hat{\bbeta}^{\mathrm{T}}X^{\mathrm{T}}X\hat{\bbeta} - (\hat{\bbeta}_1^*)^{\mathrm{T}}(X_1^*)^{\mathrm{T}}X_1^*\hat{\bbeta}_1^*\right)/p_2$ |
| Residual | $n-p_1-p_2-1$ | $(\bY - X\hat{\bbeta})^{\mathrm{T}}(\bY - X\hat{\bbeta})$ | RSS$/(n-p_1-p_2-1)$ |
| Total | $n-1$ | $\bY^{\mathrm{T}}\bY - n\bar{Y}^2$ | |
By definition, the Model (i) SS and the Extra SS for Model (ii) sum to the Regression SS.
c. The extra sum of squares for the terms in model (i) over and above those in model (ii) are obtained through comparison of the models
ia. $\bY = X^\star_2\bbeta_2^\star + \bvarepsilon$,
iia. $\bY = \boldsymbol{1}_n\beta_0 + X_1\bbeta_1 + X_2\bbeta_2+ \bvarepsilon = X\bbeta + \varepsilon$
where $X^\star_2 = [\boldsymbol{1}_n \, X_2]$ and $\bbeta_2^\star = (\beta_0, \bbeta_2^{\mathrm{T}})^{\mathrm{T}}$ (so model (ia) also contains the intercept).
Extra sum of squares for model (iia):
\begin{align*}
\mbox{[RSS(null) - RSS(iia)] - [RSS(null) - RSS(ia)]} & = \mbox{RSS(ia) - RSS(iia)}\\
& = (\bY - X_2^\star\hat{\bbeta}^\star_2)^{\mathrm{T}}(\bY - X_2^\star\hat{\bbeta}^\star_2) - (\bY - X\hat{\bbeta})^{\mathrm{T}}(\bY - X\hat{\bbeta})\\
& = \hat{\bbeta}^{\mathrm{T}}X^{\mathrm{T}}X\hat{\bbeta} - (\hat{\bbeta}^\star_2)^{\mathrm{T}}(X_2^\star)^{\mathrm{T}}X_2^\star\hat{\bbeta}^\star_2\,.
\end{align*}
Hence, an ANOVA table for the extra sums of squares is given by
| Source | Degrees of freedom | Sums of squares | Mean square |
| :----: | :----------------: | :-------------: | :---------: |
| Regression | $p_1+p_2$ | $\hat{\bbeta}X^{\mathrm{T}}X\hat{\bbeta} - n\bar{Y}^2$ | $\left(\hat{\bbeta}X^{\mathrm{T}}X\hat{\bbeta} - n\bar{Y}^2\right)/(p_1+p_2)$ |
| Extra due to Model (i) only | $p_1$ | $\hat{\bbeta}^{\mathrm{T}}X^{\mathrm{T}}X\hat{\bbeta} - (\hat{\bbeta}^\star_2)^{\mathrm{T}}(X_2^\star)^{\mathrm{T}}X_2^\star\hat{\bbeta}^\star_2$ | $\left(\hat{\bbeta}^{\mathrm{T}}X^{\mathrm{T}}X\hat{\bbeta} - (\hat{\bbeta}^\star_2)^{\mathrm{T}}(X_2^\star)^{\mathrm{T}}X_2^\star\hat{\bbeta}^\star_2\right)/p_1$ |
| Extra Model due to model (ii) only | $p_2$ | $\hat{\bbeta}^{\mathrm{T}}X^{\mathrm{T}}X\hat{\bbeta} - (\hat{\bbeta}_1^*)^{\mathrm{T}}(X_1^*)^{\mathrm{T}}X_1^*\hat{\bbeta}_1^*$ | $\left(\hat{\bbeta}^{\mathrm{T}}X^{\mathrm{T}}X\hat{\bbeta} - (\hat{\bbeta}_1^*)^{\mathrm{T}}(X_1^*)^{\mathrm{T}}X_1^*\hat{\bbeta}_1^*\right)/p_2$ |
| Residual | $n-p_1-p_2-1$ | $(\bY - X\hat{\bbeta})^{\mathrm{T}}(\bY - X\hat{\bbeta})$ | RSS$/(n-p_1-p_2-1)$ |
| Total | $n-1$ | $\bY^{\mathrm{T}}\bY - n\bar{Y}^2$ | |
Note that for these *adjusted* sums of squares, in general the extra sum of squares for model (i) and (ii) do not sum to the regression sum of squares. This will only be the case if the columns of $X_1$ and $X_2$ are mutually orthogonal, i.e. $X_1^{\mathrm{T}}X_2 = \boldsymbol{0}$, and if both $X_1$ and $X_2$ are also orthogonal to $\boldsymbol{1}_n$, i.e. $X_1^{\mathrm{T}}\boldsymbol{1}_n = \boldsymbol{0}$ and $X_2^{\mathrm{T}}\boldsymbol{1}_n = \boldsymbol{0}$.
</details>