-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path08-collinearity-ridge.qmd
1309 lines (1052 loc) · 64.7 KB
/
08-collinearity-ridge.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r include=FALSE}
source("R/common.R")
knitr::opts_chunk$set(fig.path = "figs/ch08/")
```
::: {.content-visible unless-format="pdf"}
{{< include latex/latex-commands.qmd >}}
:::
# Collinearity & Ridge Regression {#sec-collin}
In univariate multiple regression models, we usually hope to have high correlations between the outcome $y$ and each of the
predictors, $\mathbf{X} = [\mathbf{x}_1, \mathbf{x_2}, \dots]$, but high correlations _among_ the predictors can cause problems
in estimating and testing their effects. Exactly the same problems can exist in multivariate response models,
because they involve only the relations among the predictor variables.
The problem of high correlations among the predictors in a model is called **collinearity**
(or multicollinearity), referring to the situation when two or more predictors
are very nearly linearly related to each other (collinear).
This chapter illustrates the nature of collinearity geometrically, using data and confidence
ellipsoids.
It describes diagnostic measures to asses these effects,
and presents some novel visual tools for these purposes using the `r pkg("VisCollin")` package.
One class of solutions for collinearity involves _regularization methods_ such as ridge regression.
Another collection of graphical methods, generalized ridge trace plots, implemented
in the `r pkg("genridge")` package, sheds further light on what is accomplished by this technique.
More generally, the methods of this chapter are further examples of how data and confidence
ellipsoids can be used to visualize bias **and** precision of regression estimates.
**Packages**
In this chapter we use the following packages. Load them now.
```{r}
library(car)
library(VisCollin)
library(genridge)
library(MASS)
library(dplyr)
library(factoextra)
library(ggrepel)
library(patchwork)
```
## What is collinearity?
Researchers who have studies standard treatments of linear models
(e.g, @Graybill1961; @Hocking2013)
are often confused about what collinearity is, how to find its sources and how to take steps to resolve them.
There are a number of important diagnostic measures that can help, but these are usually presented
in a tabular display like @fig-collinearity-diagnostics-SPSS, which prompted this querry on
an online forum:
> Some of my collinearity diagnostics have large values, or small values, or whatever they are _not_ supposed to be
>
> * What is bad?
> * If bad, what can I do about it?
```{r}
#| label: fig-collinearity-diagnostics-SPSS
#| out-width: "100%"
#| fig-cap: "Collinearity diagnostics for a multiple regression model from SPSS. _Source_: Arndt Regorz, How to interpret a Collinearity Diagnostics table in SPSS, https://bit.ly/3YRB82b"
#| echo: false
knitr::include_graphics("images/collinearity-diagnostics-SPSS.png")
```
The trouble with displays like @fig-collinearity-diagnostics-SPSS is that the important
information is hidden in a sea of numbers, some of which are bad when large, others
bad when they are small and a large bunch which are irrelevant.
In @FriendlyKwan:2009, we liken this problem to that of the reader of
Martin Hansford's
successful series of books, \emph{Where's Waldo}.
These consist of a series of full-page illustrations of hundreds of
people and things and a few Waldos--- a character wearing a red and white striped
shirt and hat, glasses, and carrying a walking stick or other paraphernalia.
Waldo was never disguised, yet the complex arrangement of misleading visual cues
in the pictures made him very hard to find.
Collinearity diagnostics often provide a similar puzzle: where should you look
in traditional tabular displays?
<!-- This image based on: https://x.com/ErrorJustin/status/830205933598879744 -->
```{r}
#| label: fig-wheres-waldo
#| echo: false
#| out-width: "100%"
#| fig-cap: "A scene from one of the Where's Waldo books. Waldo wears a red-striped shirt, but far too many of the other figures in the scene have horizontal red stripes, making it very difficult to find him among all the distractors. This is often the problem with collinearity diagnostics. _Source_: Modified from https://bit.ly/48KPcOo"
knitr::include_graphics("images/wheres-waldo.png")
```
Recall the standard classical linear model for a response variable $y$ with a collection of predictors
in $\mathbf{X} = (\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_p)$
\begin{align*}
\mathbf{y} & = \beta_0 + \beta_1 \mathbf{x}_1 + \beta_2 \mathbf{x}_2 + \cdots + \beta_p \mathbf{x}_p + \boldsymbol{\epsilon} \\
& = \mathbf{X} \mathbf{\beta} + \mathbf{\epsilon} \; ,
\end{align*}
for which the ordinary least squares solution is:
$$
\widehat{\mathbf{b}} = (\mathbf{X}^\mathsf{T} \mathbf{X})^{-1} \; \mathbf{X}^\mathsf{T} \mathbf{y} \; .
$$
The sampling variances and covariances of the estimated coefficients is
$\text{Var} (\widehat{\mathbf{b}}) = \sigma_\epsilon^2 \times (\mathbf{X}^\mathsf{T} \mathbf{X})^{-1}$ and $\sigma_\epsilon^2$ is the variance of the residuals $\mathbf{\epsilon}$, estimated by the
mean squared error (MSE).
In the limiting case, collinearity becomes particularly problematic when one $x_i$ is _perfectly_ predictable from the other $x$s, i.e., $R^2 (x_i | \text{other }x) = 1$. This is problematic because:
* there is no _unique_ solution for the regression coefficients
$\mathbf{b} = (\mathbf{X}^\mathsf{T} \mathbf{X})^{-1} \mathbf{X} \mathbf{y}$;
* the standard errors $s (b_i)$ of the estimated coefficients are infinite and _t_ statistics $t_i = b_i / s (b_i)$
are 0.
This extreme case reflects a situation when one or more predictors are effectively redundant, for example
when you include two variables $x$ and $y$ and their sum $z = x + y$ in a model.
For instance, a dataset may include variables for income, expenses, and savings. But
income is the sum of expenses and savings, so not all three should be used as predictors.
A more subtle case is the use _ipsatized_, defined as
scores that sum to a constant, such as proportions of a total. You might have scores on
tests of reading, math, spelling and geography. With ipsatized scores, any one of these
is necessarily 1 $-$ sum of the others, i.e., if reading is 0.5, math and geography are
both 0.15, then geography must be 0.2. Once thre of the four scores are known, the last
provides no new information.
More generally, collinearity refers to the case when there are very high
**multiple correlations** among the predictors, such as $R^2 (x_i | \text{other }x) \ge 0.9$.
Note that you can't tell simply by looking at the simple correlations. A large correlation
$r_{ij}$ is _sufficient_ for collinearity, but not _necessary_---you can have variables
$x_1, x_2, x_3$ for which the pairwise correlation are low, but the multiple correlation is high.
The consequences are:
* The estimated coefficients have large standard errors, $s(\hat{b}_j)$. They are multiplied by
the square root of the variance inflation factor, $\sqrt{\text{VIF}}$, discussed below.
* The large standard errors deflate the $t$-statistics, $t = \hat{b}_j / s(\hat{b}_j)$, by the same factor,
so a coefficient that would significant if the predictors were uncorrelated becomes insignificant
when collinearity is present.
* Thus you may find a situation where an overall model is highly significant (large $F$-statistic), while
no (or few) of the individual predictors are. This is a puzzlement!
* Beyond this, the least squares solution may have poor numerical accuracy [@Longley:1967], because
the solution depends inversely on the determinant $|\,\mathbf{X}^\mathsf{T} \mathbf{X}\,|$, which approaches 0 as multiple correlations increase.
* There is an interpretive problem as well. Recall that the coefficients $\hat{b}$ are _partial coefficients_, meaning that they estimate
change $\Delta y$ in $y$ when $x$ changes by one unit $\Delta x$, but holding **all other variables
constant**. Then, the model may be trying to estimate something that does not occur in the data.
(For example: predicting strength from the highly correlated height and weight)
### Visualizing collinearity
Collinearity can be illustrated in data space for two predictors in terms of the stability of the
regression plane for a linear model `Y = X1 + X2`. @fig-collin-demo (adapted from @Fox:2016:ARA, Fig. 13.2)
shows three cases as 3D plots of $(X_1, X_2, Y)$, where the correlation of predictors can be
observed in the $(X_1, X_2)$ plane.
(a) shows a case where
$X_1$ and $X_2$ are uncorrelated as can be seen in their scatter in the horizontal plane (`+` symbols).
The gray regression plane is well-supported; a small change in Y for one observation won't make much difference.
(b) In panel (b), $X_1$ and $X_2$ have a perfect correlation, $r (x_1, x_2) = 1.0$. The regression plane
is not unique; in fact there are an infinite number of planes that fit the data equally well. Note that,
if all we care about is prediction (not the coefficients), we could use $X_1$ or $X_2$, or both, or
any weighted sum of them in a model and get the same predicted values.
(c) Shows a typical case where there is a strong correlation between $X_1$ and $X_2$. The regression plane
here is unique, but is not well determined. A small change in Y **can** make quite a difference
in the fitted value or coefficients, depending on the values of $X_1$ and $X_2$.
Where $X_1$ and $X_2$ are far from their near linear relation in the botom plane,
you can imagine that it is easy to tilt the plane substantially by a small change in $Y$.
```{r}
#| label: fig-collin-demo
#| echo: false
#| out-width: 100%
#| fig.cap: "Effect of collinearity on the least squares regression plane.
#| (a) Small correlation between predictors;
#| (b) Perfect correlation ;
#| (c) Very strong correlation.
#| The black points show the data Y values, white points are the fitted values in the regression plane,
#| and + signs represent the values of X1 and X2.
#| _Source_: Adapted from @Fox:2016:ARA, Fig. 13.2"
knitr::include_graphics("images/collin-demo.png")
```
### Data space and $\beta$ space
It is also useful to visualize collinearity by comparing the representation in **data space** with the
analogous view of the confidence ellipses for coefficients in **beta space**. To do so in this example, I generate
data from a known model $y = 3 x_1 + 3 x_2 + \epsilon$ with $\epsilon \sim \mathcal{N} (0, 100)$
and various true correlations between $x_1$ and $x_2$, $\rho_{12} = (0, 0.8, 0.97)$ [^1].
[^1]: This example is adapted from one by John Fox (2022), [Collinearity Diagnostics](https://socialsciences.mcmaster.ca/jfox/Courses/SORA-TABA/slides-collinearity.pdf)
::: {.column-margin}
R file: `R/collin-data-beta.R`
:::
First, I use `MASS:mvrnorm()` to construct a list of three data frames `XY` with the same means
and standard deviations, but with different correlations. In each case, the variable $y$
is generated with true coefficients `beta` $=(3, 3)$, and the fitted model for that value of
`rho` is added to a corresponding list of models, `mods`.
```{r collin-data-beta-gen}
#| code-fold: show
library(MASS)
library(car)
set.seed(421) # reproducibility
N <- 200 # sample size
mu <- c(0, 0) # means
s <- c(1, 1) # standard deviations
rho <- c(0, 0.8, 0.97) # correlations
beta <- c(3, 3) # true coefficients
# Specify a covariance matrix, with standard deviations
# s[1], s[2] and correlation r
Cov <- function(s, r){
matrix(c(s[1], r * s[1]*s[2],
r * s[1]*s[2], s[2]), nrow = 2, ncol = 2)
}
# Generate a dataframe of X, y for each rho
# Fit the model for each
XY <- vector(mode ="list", length = length(rho))
mods <- vector(mode ="list", length = length(rho))
for (i in seq_along(rho)) {
r <- rho[i]
X <- mvrnorm(N, mu, Sigma = Cov(s, r))
colnames(X) <- c("x1", "x2")
y <- beta[1] * X[,1] + beta[2] * X[,2] + rnorm(N, 0, 10)
XY[[i]] <- data.frame(X, y=y)
mods[[i]] <- lm(y ~ x1 + x2, data=XY[[i]])
}
```
The estimated coefficients can then be extracted using `coef()` applied to each model:
```{r collin-data-beta-coefs}
coefs <- sapply(mods, coef)
colnames(coefs) <- paste0("mod", 1:3, " (rho=", rho, ")")
coefs
```
Then, I define a function to plot the data ellipse (`car::dataEllipse()`) for each data frame and confidence ellipse
(`car::confidenceEllipse()`) for the coefficients in the corresponding fitted model. In the plots in @fig-collin-data-beta, I specify the x, y limits for each plot
so that the relative sizes of these ellipses are comparable, so that variance inflation can be assessed visually.
```{r}
#| label: fig-collin-data-beta
#| code-fold: show
#| out-width: "100%"
#| fig-show: "hold"
#| fig.cap: '95% Data ellipses for x1, x2 and the corresponding 95% confidence ellipses for their coefficients in the model predicting y.
#| In the confidence ellipse plots, reference lines show the value (0,0) for the null hypothesis and "+" marks
#| the true values for the coefficients. This figure adapts an example by John Fox (2022).'
#|
do_plots <- function(XY, mod, r) {
X <- as.matrix(XY[, 1:2])
dataEllipse(X,
levels= 0.95,
col = "darkgreen",
fill = TRUE, fill.alpha = 0.05,
xlim = c(-3, 3),
ylim = c(-3, 3), asp = 1)
text(0, 3, bquote(rho == .(r)), cex = 2, pos = NULL)
confidenceEllipse(mod,
col = "red",
fill = TRUE, fill.alpha = 0.1,
xlab = expression(paste("x1 coefficient, ", beta[1])),
ylab = expression(paste("x2 coefficient, ", beta[2])),
xlim = c(-5, 10),
ylim = c(-5, 10),
asp = 1)
points(beta[1], beta[2], pch = "+", cex=2)
abline(v=0, h=0, lwd=2)
}
op <- par(mar = c(4,4,1,1)+0.1,
mfcol = c(2, 3),
cex.lab = 1.5)
for (i in seq_along(rho)) {
do_plots(XY[[i]], mods[[i]], rho[i])
}
par(op)
```
Recall (@sec-betaspace) that the confidence ellipse for $(\beta_1, \beta_2)$ is just a 90 degree rotation
(and rescaling) of the data ellipse for $(x_1, x_2)$: it is wide (more variance) in any direction where the
data ellipse is narrow.
The shadows of the confidence ellipses on the coordinate axes in @fig-collin-data-beta represent the
standard errors of the coefficients, and get larger with increasing $\rho$. This is the effect of variance
inflation, described in the following section.
## Measuring collinearity {#sec-measure-collin}
This section first describes the _variance inflation factor_ (VIF) used to measure the
effect of possible collinearity on each predictor and a collection of diagnostic measures
designed to help interpret these. Then I describe some novel graphical methods to make these
effects more readily understandable, to answer the "Where's Waldo" question posed at the outset.
### Variance inflation factors {#sec-vif}
How can we measure the effect of collinearity? The essential idea is to compare, for each predictor the variance
$s^2 (\widehat{b_j})$ that the coefficient that $x_j$ would have if it was totally unrelated to the other
predictors to the actual variance it has in the given model.
For two predictors such as shown in @fig-collin-data-beta the sampling variance of $x_1$ can be expressed
as
$$
s^2 (\widehat{b_1}) = \frac{MSE}{(n-1) \; s^2(x_1)} \; \times \; \left[ \frac{1}{1-r^2_{12}} \right]
$$
The first term here is the variance of $b_1$ when the two predictors are uncorrelated.
The term in brackets represents the **variance inflation factor** [@Marquardt:1970], the amount by which the
variance of the coefficient is multiplied as a consequence of the correlation $r_{12}$ of
the predictors. As $r_{12} \rightarrow 1$, the variances approaches infinity.
More generally, with any number of predictors, this relation has a similar form, replacing
the simple correlation $r_{12}$ with the multiple correlation predicting $x_j$ from all others,
$$
s^2 (\widehat{b_j}) = \frac{MSE}{(n-1) \; s^2(x_j)} \; \times \; \left[ \frac{1}{1-R^2_{j | \text{others}}} \right]
$$
So, we have that the variance inflation factors are:
$$
\text{VIF}_j = \frac{1}{1-R^2_{j \,|\, \text{others}}}
$$
In practice, it is often easier to think in terms of the square root, $\sqrt{\text{VIF}_j}$ as the
multiplier of the standard errors. The denominator, $1-R^2_{j | \text{others}}$ is sometimes called
**tolerance**, a term I don't find particularly useful, but it is just the proportion of the
variance of $x_j$ that is _not_ explainable from the others.
For the cases shown in @fig-collin-data-beta the VIFs and their square roots are:
```{r collin-data-beta-vif}
vifs <- sapply(mods, car::vif)
colnames(vifs) <- paste("rho:", rho)
vifs
sqrt(vifs)
```
Note that when there are terms in the model with more than one degree of freedom, such as education with four levels
(and hence 3 df) or a polynomial term specified as `poly(age, 3)`, that variable, education or age
is represented by three separate $x$s in the model matrix, and the standard VIF calculation
gives results that vary with how those terms are coded in the model.
To allow for these cases, @FoxMonette:92 define
_generalized_, GVIFs as the inflation in the squared area of the confidence ellipse for the coefficients
of such terms, relative to what would be obtained with uncorrelated data. Visually, this can be seen
by comparing the areas of the ellipses in the bottom row of @fig-collin-data-beta.
Because the magnitude of the GVIF increases with the number of degrees of freedom for the set of parameters,
Fox & Monette suggest the analog $\sqrt{\text{GVIF}^{1/2 \text{df}}}$ as the measure of impact on standard
errors. This is what `car::vif()` calculates for a factor or other term with more than 1 df.
**Example**: This example uses the `cars` dataset in the `VisCollin` package
containing various measures of size and performance on 406 models of automobiles from 1982. Interest is focused on predicting gas mileage, `mpg`.
```{r cars}
data(cars, package = "VisCollin")
str(cars)
```
We fit a model predicting gas mileage (`mpg`) from the number of cylinders, engine displacement, horsepower, weight,
time to
accelerate from 0 -- 60 mph and model year (1970--1982). Perhaps surprisingly, only `weight` and `year` appear to
significantly predict gas mileage. What's going on here?
```{r cars-mod}
cars.mod <- lm (mpg ~ cylinder + engine + horse +
weight + accel + year,
data=cars)
Anova(cars.mod)
```
We check the variance inflation factors, using `car::vif()`. We see that most predictors have very high
VIFs, indicating moderately severe multicollinearity.
```{r cars-vif}
vif(cars.mod)
sqrt(vif(cars.mod))
```
According to $\sqrt{\text{VIF}}$, the standard error of `cylinder` has been
multiplied by $\sqrt{10.63} = 3.26$ and it's $t$-value is divided by this number,
compared with the case when all predictors are
uncorrelated. `engine`, `horse` and `weight` suffer a similar fate.
If we also included the factor `origin` in the models, we would get the generalized GVIF:
```{r cars-mod2}
cars.mod2 <- lm (mpg ~ cylinder + engine + horse +
weight + accel + year + origin,
data=cars)
vif(cars.mod2)
```
::: {.callout-tip title="Connection with inverse of correlation matrix"}
In the linear regression model with standardized predictors,
the covariance matrix of the estimated intercept-excluding
parameter vector $\mathbf{b}^\star$ has the
simpler form,
$$
\mathcal{V} (\mathbf{b}^\star) = \frac{\sigma^2}{n-1} \mathbf{R}^{-1}_{X} \; .
$$
where
$\mathbf{R}_{X}$ is the correlation matrix among the predictors.
It
can then be seen that the VIF$_j$ are just the diagonal entries of
$\mathbf{R}^{-1}_{X}$.
More generally, the matrix $\mathbf{R}^{-1}_{X} = (r^{ij})$, when standardized to a correlation matrix
as $-r^{ij} / \sqrt{r^{ii} \; r^{jj}}$ gives the matrix of all partial correlations,
$r_{ij} \,|\, \text{others}$.
}
:::
### Collinearity diagnostics {#sec-colldiag}
OK, we now know that large VIF$_j$ indicate predictor coefficients whose estimation
is degraded due to large $R^2_{j \,|\, \text{others}}$.
But for this to be useful, we need to determine:
* how many dimensions in the space of the predictors are associated with nearly collinear relations?
* which predictors are most strongly implicated in each of these?
Answers to these questions are provided using measures developed by Belsley and colleagues
[@Belsley-etal:80; @Belsley:91a].
These measures are based on the eigenvalues $\lambda_1, \lambda_2, \dots \lambda_p$
of the correlation matrix $R_{X}$ of the predictors (preferably centered and scaled, and not including the constant term
for the intercept), and the corresponding eigenvectors in the columns of $\mathbf{V}_{p \times p}$, given by
the the eigen decomposition
$$
\mathbf{R}_{X} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^\mathsf{T}
$$
By elementary matrix algebra, the eigen decomposition of $\mathbf{R}_{XX}^{-1}$ is then
$$
\mathbf{R}_{X}^{-1} = \mathbf{V} \mathbf{\Lambda}^{-1} \mathbf{V}^\mathsf{T} \; ,
$$ {#eq-rxinv-eigen}
so, $\mathbf{R}_{X}$ and $\mathbf{R}_{XX}^{-1}$ have the same eigenvectors, and the eigenvalues
of $\mathbf{R}_{X}^{-1}$ are just $\lambda_i^{-1}$.
Using @eq-rxinv-eigen, the variance inflation factors may be expressed as
$$
\text{VIF}_j = \sum_{k=1}^p \frac{V^2_{jk}}{\lambda_k} \; ,
$$
which shows that only the _small_ eigenvalues contribute to variance inflation,
but only for those predictors that have large eigenvector coefficients on those small components.
These facts lead to the following diagnostic statistics for collinearity:
* **Condition indices**:
The smallest of the eigenvalues, those for which $\lambda_j \approx 0$,
indicate collinearity and the number of small values indicates the number of near collinear relations.
Because the sum of the eigenvalues, $\Sigma \lambda_i = p$ increases with the number
of predictors $p$, it is useful to scale them
all in relation to the largest. This leads to _condition indices_, defined as
$\kappa_j = \sqrt{ \lambda_1 / \lambda_j}$. These have the property that the resulting numbers
have common interpretations regardless of the number of predictors.
+ For completely uncorrelated predictors, all $\kappa_j = 1$.
+ $\kappa_j \rightarrow \infty$ as any $\lambda_k \rightarrow 0$.
* **Variance decomposition proportions**:
Large VIFs indicate variables that are involved in _some_ nearly collinear
relations, but they don't indicate _which_ other variable(s) each is involved with.
For this purpose, @Belsley-etal:80 and @Belsley:91a proposed calculation of
the proportions of variance of each variable associated with each principal component
as a decomposition of the coefficient variance for each dimension.
These measures can be calculated using `VisCollin::colldiag()`.
For the current model, the usual display contains both the condition indices and
variance proportions. However, even for a small example, it is often difficult to know
what numbers to pay attention to.
```{r colldiag1}
(cd <- colldiag(cars.mod, center=TRUE))
```
@Belsley:91a recommends that the sources of collinearity be diagnosed
(a) only for those components with large $\kappa_j$, and
(b) for those components for which the variance proportion is large (say, $\ge 0.5$) on _two_ or more predictors.
The print method for `"colldiag"` objects has a `fuzz` argument controlling this.
```{r colldiag2}
print(cd, fuzz = 0.5)
```
The mystery is solved, if you can read that table with these recommendations in mind. There are two nearly collinear relations among the predictors, corresponding to the two
smallest dimensions.
* Dimension 5 reflects the high correlation between horsepower and weight,
* Dimension 6 reflects the high correlation between number of cylinders and engine displacement.
Note that the high variance proportion for `year` (0.787) on the second component creates no problem and
should be ignored because (a) the condition index is low and (b) it shares nothing with other predictors.
### Tableplots
The default tabular display of condition indices and variance proportions from `colldiag()` is what triggered
the comparison to "Where's Waldo". It suffers from the fact that the important information ---
(a) how many Waldos? (b) where are they hiding --- is disguised by being embedded in a sea of mostly irrelevant numbers. The simple option of using a principled `fuzz` factor helps considerably, but not entirely.
The simplified tabular display above can be improved to make the patterns of collinearity more
visually apparent and to signify warnings directly to the eyes.
A **tableplot** [@Kwan-etal:2009] is a semi-graphic display that presents numerical information in a table
using shapes proportional to the value in a cell and other visual attributes (shape type, color fill, and so forth)
to encode other information.
For collinearity diagnostics, these show:
* the condition indices,
using _squares_ whose background color is `r colorize("red")` for condition indices > 10,
`r colorize("brown")` for values > 5 and `r colorize("green")` otherwise, reflecting danger, warning and OK respectively.
The value of the condition index is encoded within this using a white square whose side is proportional to the value
(up to some maximum value, `cond.max` that fills the cell).
* Variance decomposition proportions are shown by filled _circles_ whose radius is proportional to those values
and are filled (by default) with shades ranging from white through pink to red. Rounded values of those diagnostics
are printed in the cells.
The tableplot below (@fig-cars-tableplot) encodes all the information from the values of `colldiag()` printed above. To aid perception, it uses `prop.col` color breaks such that variance proportions < 0.3 are shaded white.
The visual message is that one should attend to collinearities with large condition indices **and**
large variance proportions implicating two or more predictors.
::: {.column-margin}
R file: `R/cars-colldiag.R`
:::
<!-- fig.code: R/cars-colldiag.R -->
```{r}
#| label: fig-cars-tableplot
#| fig-keep: "last"
#| out-width: 100%
#| fig.cap: "Tableplot of condition indices and variance proportions for the `cars` data. In column 1, the square symbols are scaled relative to a maximum condition index of 30.
#| In the remaining columns, variance proportions (times 100) are shown as circles
#| scaled relative to a maximum of 100."
tableplot(cd, title = "Tableplot of cars data",
cond.max = 30 )
```
### Collinearity biplots
As we have seen, the collinearity diagnostics are all functions of the eigenvalues
and eigenvectors of the correlation matrix of the predictors in the regression model,
or alternatively, the SVD of the $\mathbf{X}$ matrix in the linear model (excluding the
constant).
The standard biplot [@Gabriel:71; @GowerHand:96] (see: @sec-biplot)
can be regarded as
a multivariate analog of a scatterplot,
obtained by projecting a
multivariate sample into a low-dimensional space
(typically of 2 or 3 dimensions)
accounting for the greatest variance in the data.
However the standard biplot is less useful for visualizing the relations among the
predictors that lead to nearly collinear relations. Instead, biplots of
the **smallest dimensions** show these relations directly, and can show other
features of the data as well, such as outliers and leverage points.
We use `prcomp(X, scale.=TRUE)` to obtain the PCA of the correlation matrix
of the predictors:
```{r cars-pca}
cars.X <- cars |>
select(where(is.numeric)) |>
select(-mpg) |>
tidyr::drop_na()
cars.pca <- prcomp(cars.X, scale. = TRUE)
cars.pca
```
The standard deviations above are the square roots $\sqrt{\lambda_j}$ of the eigenvalues of the
correlation matrix, and are returned in the `sdev` component of the `"prcomp"` object.
The eigenvectors are returned in the `rotation` component, whose directions are arbitrary.
Because we are interested in seeing the relative magnitude of variable vectors,
we are free to multiply them by any constant to make them more visible in relation to the
scores for the cars.
```{r}
#| label: fig-cars-collin-biplot
#| out-width: 100%
#| fig.cap: "Collinearity biplot of the Cars data, showing the last two dimensions.
#| The projections of the variable vectors on the coordinate axes are proportional to
#| their variance proportions. To reduce graphic clutter, only the most outlying observations in predictor
#| space are identified by case labels.
#| An extreme outlier (case 20) appears in the lower right corner."
cars.pca$rotation <- -2.5 * cars.pca$rotation # reflect & scale the variable vectors
ggp <- fviz_pca_biplot(
cars.pca,
axes = 6:5,
geom = "point",
col.var = "blue",
labelsize = 5,
pointsize = 1.5,
arrowsize = 1.5,
addEllipses = TRUE,
ggtheme = ggplot2::theme_bw(base_size = 14),
title = "Collinearity biplot for cars data")
# add point labels for outlying points
dsq <- heplots::Mahalanobis(cars.pca$x[, 6:5])
scores <- as.data.frame(cars.pca$x[, 6:5])
scores$name <- rownames(scores)
ggp + geom_text_repel(data = scores[dsq > qchisq(0.95, df = 6),],
aes(x = PC6,
y = PC5,
label = name),
vjust = -0.5,
size = 5)
```
As with the tabular display of variance proportions, Waldo is hiding
in the dimensions associated with the smallest eigenvalues
(largest condition indices).
As well, it turns out that outliers in the predictor space (also high
leverage observations) can often be seen as observations far
from the centroid in the space of the smallest principal components.
The projections of the variable vectors in @fig-cars-collin-biplot
on the Dimension 5 and Dimension 6 axes are proportional to
their variance proportions shown above.
The relative lengths of these variable vectors can be considered
to indicate the extent to which each variable contributes to collinearity
for these two near-singular dimensions.
Thus, we see again that Dimension 6
is largely determined by `engine` size, with a substantial (negative) relation
to `cylinder`. Dimension 5 has its' strongest relations to `weight`
and `horse`.
Moreover, there is one observation, #20, that stands out as
an outlier in predictor space, far from the centroid.
It turns out that this vehicle, a Buick Estate wagon, is an early-year (1970) American behemoth,
with an 8-cylinder, 455 cu. in, 225 horse-power engine, and able to go from 0 to 60 mph
in 10 sec.
(Its MPG is only slightly under-predicted from the regression model, however.)
With PCA and the biplot, we are used to looking at the dimensions that account for
the most variation, but
the answer to _Where's Waldo?_ is that he is hiding in the smallest data dimensions,
just as he does in @fig-wheres-waldo where the weak signals of his stripped shirt,
hat and glasses are embedded in a visual field of noise. As we just saw, outliers
hide there also, hoping to escape detection. These small dimensions are also
implicated in ridge regression as we will see shortly (@sec-ridge).
## Remedies for collinearity: What can I do? {#sec-remedies}
Collinearity is often a **data** problem, for which there is no magic cure. Nevertheless there are some
general guidelines and useful techniques to address this problem.
* **Pure prediction**: If we are only interested in predicting / explaining an outcome,
and not the model coefficients or which are "significant", collinearity can be largely ignored.
The fitted values are unaffected by collinearity, even in the case of perfect collinearity
as shown in @fig-collin-demo (b).
* **Structural collinearity**: Sometimes collinearity results from structural relations among the variables that relate to how they have been defined.
+ For example, polynomial terms, like $x, x^2, x^3$ or interaction terms like $x_1, x_2, x_1 * x_2$
are necessarily correlated. A simple cure is to _center_ the predictors at their means, using
$x - \bar{x}, (x - \bar{x})^2, (x - \bar{x})^3$ or
$(x_1 - \bar{x}_1), (x_2 - \bar{x}_2), (x_1 - \bar{x}_1) * (x_2 - \bar{x}_2)$. Centering removes the spurious ill-conditioning, thus reducing the VIFs.
Note that in polynomial models,
using `y ~ poly(x, 3)` to specify a cubic model generates _orthogonal_ (uncorrelated) regressors, whereas in
`y ~ x + I(x^2) + I(x^3)` the terms have built-in correlations.
+ When some predictors share a common cause, as in GNP or population in time-series or cross-national data,
you can reduce collinearity by re-defining predictors to reflect _per capita measures_. In a related example with
sports data, when you have cumulative totals (e.g., runs, hits, homeruns in baseball) for players over years,
expressing these measures as _per year_ will reduce the common effect of longevity on these measures.
* **Model re-specification**:
+ Drop one or more regressors that have a high VIF, if they are not deemed to be essential to understanding the model. Care must be taken here to not omit variables which should be controlled or accounted for in interpretation.
+ Replace highly correlated regressors with less correlated linear combination(s) of them. For example, two related variables,
$x_1$ and $x_2$ can be replaced without any loss of information by replacing them with their sum and
difference, $z_1 = x_1 + x_2$ and $z_2 = x_1 - x_2$. For instance, in a dataset on fitness, we may have
correlated predictors of resting pulse rate and pulse rate while running. Transforming these to
average pulse rate and their difference gives new variables which are interpretable and less correlated.
* **Statistical remedies**:
+ Transform the predictors $\mathbf{X}$ to uncorrelated principal component scores
$\mathbf{Z} = \mathbf{X} \mathbf{V}$,
and regress $\mathbf{y}$ on $\mathbf{Z}$. These will have the identical overall model
fit without loss of information. A related technique is _incomplete_ principal components regression, where
some of the smallest dimensions (those causing collinearity) are omitted from the model.
The trade-off is that it may be more difficult to interpret what the model means, but this can be countered
with a biplot, showing the projections of the original variables into the reduced space of the principal
components.
+ Use **regularization methods** such as ridge regression and lasso, which correct for collinearity by
introducing shrinking coefficients towards 0, introducing
a small amount of bias, . See the [genridge](https://CRAN.R-project.org/package=genridge) package and its [`pkgdown` documentation](https://friendly.github.io/genridge/) for visualization methods.
+ use Bayesian regression; if multicollinearity prevents a regression coefficient from being estimated precisely, then a prior on that coefficient will help to reduce its posterior variance.
**Example**: Centering
To illustrate the effect of centering a predictor in a polynomial model, we generate a perfect
quadratic relationship, $y = x^2$ and consider the correlations of $y$ with $x$ and with
$(x - \bar{x})^2$. The correlation of $y$ with $x$ is 0.97, while the correlation of
$y$ with $(x - \bar{x})^2$ is zero.
```{r centering}
x <- 1:20
y1 <- x^2
y2 <- (x - mean(x))^2
XY <- data.frame(x, y1, y2)
(R <- cor(XY))
```
The effect of centering here is remove the linear association in what is a purely quadratic relationship,
as can be seen by plotting `y1` and `y2` against `x`.
```{r}
#| label: fig-collin-centering
#| out-width: "100%"
#| fig.cap: "Centering a predictor removes the nessessary correlation in a quadratic regression"
r1 <- R[1, 2]
r2 <- R[1, 3]
gg1 <-
ggplot(XY, aes(x = x, y = y1)) +
geom_point(size = 3) +
geom_smooth(method = "lm", formula = y~x, linewidth = 2, se = FALSE) +
labs(x = "X", y = "Y") +
theme_bw(base_size = 16) +
annotate("text", x = 5, y = 350, size = 6,
label = paste("X Uncentered\nr =", round(r1, 3)))
gg2 <-
ggplot(XY, aes(x = x, y = y2)) +
geom_point(size = 3) +
geom_smooth(method = "lm", formula = y~x, linewidth = 2, se = FALSE) +
labs(x = "X", y = "Y") +
theme_bw(base_size = 16) +
annotate("text", x = 5, y = 80, size = 6,
label = paste("X Centered\nr =", round(r2, 3)))
gg1 + gg2 # show plots side-by-side
```
**Example**: Interactions
The dataset `genridge::Acetylene` gives data from @MarquardtSnee1975 on the `yield` of a chemical
manufacturing process to produce acetylene in relation to reactor temperature (`temp`), the
`ratio` of two components and the contact `time` in the reactor. A naive response surface model
might suggest that yield is quadratic in time and there are potential interactions among all pairs of predictors.
<!-- fig.code: R/acetlyne-colldiag.R -->
```{r acetyl-mod0}
data(Acetylene, package = "genridge")
acetyl.mod0 <- lm(
yield ~ temp + ratio + time + I(time^2) +
temp:time + temp:ratio + time:ratio,
data=Acetylene)
(acetyl.vif0 <- vif(acetyl.mod0))
```
These results are horrible! How much does centering help? I first center all three predictors and then
use `update()` to re-fit the same model using the centered data.
```{r acetyl-mod1}
Acetylene.centered <-
Acetylene |>
mutate(temp = temp - mean(temp),
time = time - mean(time),
ratio = ratio - mean(ratio))
acetyl.mod1 <- update(acetyl.mod0, data=Acetylene.centered)
(acetyl.vif1 <- vif(acetyl.mod1))
```
This is far better, although
still not great in terms of VIF. But, how much have we improved the situation by the simple
act of centering the predictors? The square roots of the ratios of VIFs tell us the impact
of centering on the standard errors.
```{r acetyl-ratio}
sqrt(acetyl.vif0 / acetyl.vif1)
```
Finally, we use `poly(time, 2)` in the model for the centered data. Because there are multiple
degree of freedom terms in the model, `car::vif()` calculates GVIFs here.
The final column gives $\sqrt{\text{GVIF}^{1/2 \text{df}}}$, the remaining effect of
collinearity on the standard errors of terms in this model.
```{r acetyl-mod2}
acetyl.mod2 <- lm(yield ~ temp + ratio + poly(time, 2) +
temp:time + temp:ratio + time:ratio,
data=Acetylene.centered)
vif(acetyl.mod2, type = "term")
```
## Ridge regression {#sec-ridge}
Ridge regression is an instance of a class of techniques designed to obtain more favorable predictions at the expense of some increase in bias, compared to ordinary least squares (OLS) estimation.
These methods began as a way of solving collinearity problems in OLS regression with highly correlated predictors
[@HoerlKennard:1970a].
More recently ridge regression developed to a larger class of model selection methods, of which the
LASSO method of @Tibshriani:regr:1996 and LAR method of @Efron-etal:leas:2004
are well-known instances. See, for example, the reviews in
@Vinod:1978 and @McDonald:2009 for details and context omitted here.
The case of ridge regression has also been extended to the case of
two or more response variables [@Brown-Zidek-1980; @Haitovsky1987].
An essential idea behind these methods is that the OLS estimates are constrained in some way, shrinking them, on average, toward zero, to achieve increased predictive accuracy
at the expense of some increase in bias.
Another common characteristic is that they involve some tuning parameter ($k$)
or criterion to quantify the tradeoff between bias and variance. In many cases,
analytical or computationally intensive methods have been developed to choose an optimal value
of the tuning parameter, for example using generalized cross validation, bootstrap methods.
A common means to visualize the effects of shrinkage in these problems is to make
what are called _univariate ridge trace plots_ (@sec-ridge-univar) showing how the estimated coefficients
$\widehat{\boldsymbol{\beta}}_k$ change as the shrinkage criterion $k$ increases.
(An example is shown in Fig XX below.)
But this only provides a view of bias. It is the wrong graphic form for a multivariate
problem where we want to visualize bias in the coefficients $\widehat{\boldsymbol{\beta}}_k$
vs. their precision, as reflected in their estimated variances,
$\widehat{\textsf{Var}} (\widehat{\boldsymbol{\beta}}_k)$.
A more useful graphic plots the confidence ellipses for the coefficients,
showing both bias and precision (@sec-ridge-bivar).
Some of the material below borrows from @Friendly-2011-gentalk and @Friendly:genridge:2013.
### Properties of ridge regression
To provide some context, I summarize the properties of ridge regression below,
comparing the OLS estimates with their ridge counterparts.
To avoid unnecessary details related to the intercept, assume
the predictors have been centered at their means and the unit vector is
omitted from $\mathbf{X}$.
Further, to avoid scaling issues, we
standardize the columns of $\mathbf{X}$ to unit length, so that $\mathbf{X}\trans \mathbf{X}$ is a also correlation matrix.
The ordinary least squares estimates of coefficients and their estimated
variance covariance matrix take the (hopefully now) familiar form
\begin{align*}
\widehat{\boldsymbol{\beta}}^{\mathrm{OLS}} = &
(\mathbf{X}\trans \mathbf{X})^{-1} \mathbf{X}\trans \mathbf{y} \comma \\
\widehat{\Var} (\widehat{\boldsymbol{\beta}}^{\mathrm{OLS}}) = &
\widehat{\sigma}_{\epsilon}^2 (\mathbf{X}\trans \mathbf{X})^{-1}.
\end{align*} {#eq-OLS-beta-var}
As we saw ealier, one signal of the problem of collinearity is that the determinant
$\det{\mathbf{X}\trans \mathbf{X}}$ approaches zero as the predictors become more
collinear. The inverse $(\mathbf{X}\trans \mathbf{X})^{-1}$ becomes numerically unstable,
or does not exist if the determinant becomes zero in the case of exact dependency
of one variable on the others.
Ridge regression uses a trick to avoid this. It adds a constant, $k$ to the diagonal elements,
replacing $\mathbf{X}\trans \mathbf{X}$ with $\mathbf{X}\trans \mathbf{X} + k \mathbf{I}$
in @eq-OLS-beta-var. This drives the determinant away from zero as $k$ increases.
The ridge regression estimates then become,
\begin{align*}
\widehat{\boldsymbol{\beta}}^{\mathrm{RR}}_k = &
(\mathbf{X}\trans \mathbf{X} + k \mathbf{I})^{-1} \mathbf{X}\trans \mathbf{y} \\
= & \mathbf{G}_k \, \widehat{\boldsymbol{\beta}}^{\mathrm{OLS}} \comma \\
\widehat{\Var} (\widehat{\boldsymbol{\beta}}^{\mathrm{RR}}_k) = &
\widehat{\sigma}^2 \mathbf{G}_k (\mathbf{X}\trans \mathbf{X})^{-1} \mathbf{G}_k\trans \comma
\end{align*} {#eq-ridge-beta-var}
where $\mathbf{G}_k = \left[\mathbf{I} + k (\mathbf{X}\trans \mathbf{X})^{-1} \right] ^{-1}$ is the $(p \times p)$ _shrinkage_ matrix.
Thus, as $k$ increases, $\mathbf{G}_k$ decreases, and drives $\widehat{\mathbf{\beta}}^{\mathrm{RR}}_k$ toward $\mathbf{0}$
[@HoerlKennard:1970a].
Another insight, from the shrinkage literature, is that ridge regression can be formulated as least squares regression, minimizing a residual sum of squares, $\text{RSS}(k)$, which adds a penalty for large coefficients,
$$
\text{RSS}(k) = (\mathbf{y}-\mathbf{X} \mathbf{\beta}) \trans (\mathbf{y}-\mathbf{X} \boldsymbol{\beta}) + k \boldsymbol{\beta}\trans\boldsymbol{\beta} \quad\quad (k \ge 0)
\comma
$$ {#eq-ridgeRSS}
where the penalty restrict the coefficients to some squared length $\boldsymbol{\beta}\trans \boldsymbol{\beta} = \Sigma \beta_i \le t(k)$.
The geometry of ridge regession is illustrated in @fig-ridge-demo for two coefficients
$\boldsymbol{\beta} = (\beta_1, \beta_2)$. The `r colorize("blue circles", "blue")`
at the origin, having radii $\sqrt{t_k}$, show the constraint
that the sum of squares of coefficients,
$\boldsymbol{\beta}\trans \boldsymbol{\beta} = \beta_1^2 + \beta_2^2$ be less than $k$.
The `r colorize("red ellipses", "red")` show contours of the covariance ellipse of
$\widehat{\boldsymbol{\beta}}^{\mathrm{OLS}}$.
As the shrinkage constant $k$ increases, the center of these ellipses travel along
the path illustrated toward $\boldsymbol{\beta} = \mathbf{0}$
This path is called the _locus of osculation_, the path along which
circles or ellipses first kiss as they expand, like the pattern of ripples from
rocks dropped into a pond [@Friendly-etal:ellipses:2013].
```{r}
#| label: fig-ridge-demo
#| out-width: "80%"
#| echo: false
#| fig-cap: "Geometric interpretation of ridge regression, using elliptical contours of the $\\text{RSS}(k)$ function. The blue circles at the origin show the constraint that the sum of squares of coefficients, $\\boldsymbol{\\beta}\\trans \\boldsymbol{\\beta}$ be less than $k$. The red ellipses show the covariance ellipse of two coefficients $\\boldsymbol{\\beta}$. Ridge regression finds the point $\\widehat{\\boldsymbol{\\beta}}^{\\mathrm{RR}}_k$ where the OLS contours just kiss the constraint region. _Source: @Friendly-etal:ellipses:2013."
knitr::include_graphics("images/ridge-demo.png")
```
<!-- The blue circles at the origin show the constraint that the sum of squares of coefficients, $\\boldsymbol{\\beta}\\trans \\boldsymbol{\\beta}$ be less than $k$. The red ellipses show the covariance ellipse of two coefficients $\\boldsymbol{\\beta}$
-->
@eq-ridge-beta-var is computationally expensive, potentially numerically unstable for small $k$, and it is conceptually opaque,
in that it sheds little light on the underlying geometry of the data in the column space of $\mathbf{X}$. An alternative
formulation can be given in terms of the singular value decomposition (SVD) of $\mathbf{X}$,
$$
\mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{V}\trans
$$
where $\mathbf{U}$ and $\mathbf{V}$ are respectively $n\times p$ and $p\times p$ orthonormal matrices, so that
$\mathbf{U}\trans \mathbf{U} = \mathbf{V}\trans \mathbf{V} = \mathbf{I}$,
and
$\mathbf{D} = \diag{(d_1, d_2, \dots d_p)}$
is the diagonal matrix of ordered singular values, with entries $d_1 \ge d_2 \ge \cdots \ge d_p \ge 0$.
Because $\mathbf{X}\trans \mathbf{X} = \mathbf{V} \mathbf{D}^2 \mathbf{V}\trans$, the eigenvalues of $\mathbf{X}\trans \mathbf{X}$
are given by $\mathbf{D}^2$ and therefore the eigenvalues of $\mathbf{G}_k$ can be shown [@HoerlKennard:1970a]
to be the diagonal elements of
$$
\mathbf{D}(\mathbf{D}^2 + k \mathbf{I} )^{-1} \mathbf{D} = \diag{ \left(\frac{d_i^2}{d_i^2 + k}\right) }\period
$$
Noting that the eigenvectors, $\mathbf{V}$ are the principal component vectors, and that $\mathbf{X} \mathbf{V} = \mathbf{U} \mathbf{D}$,
the ridge estimates can be calculated more simply in terms of $\mathbf{U}$ and $\mathbf{D}$ as
$$
\widehat{\boldsymbol{\beta}}^{\mathrm{RR}}_k = (\mathbf{D}^2 + k \mathbf{I})^{-1} \mathbf{D} \mathbf{U}\trans \mathbf{y} = \left( \frac{d_i}{d_i^2 + k}\right) \: \mathbf{u}_i\trans \mathbf{y}, \quad i=1, \dots p \period
$$
The terms $d^2_i / (d_i^2 + k) \le 1$ are thus the factors by which the coordinates of $\mathbf{u}_i\trans \mathbf{y}$
are shrunk with respect to the orthonormal basis for the column space of $\mathbf{X}$. The small singular values
$d_i$ correspond to the directions which ridge regression shrinks the most. These are the directions which
contribute most to collinearity, discussed earlier.
This analysis also provides an alternative and more intuitive characterization of the ridge tuning constant.
By analogy with OLS, where the hat matrix, $\mathbf{H} = \mathbf{X} (\mathbf{X}\trans \mathbf{X})^{-1} \mathbf{X}\trans$
reflects degrees of freedom $\text{df} = \trace{H} = p$
corresponding to the $p$ parameters, the effective degrees of freedom for ridge regression [@Hastie-etal-2009] is
\begin{align*}
\text{df}_k
= & \text{tr}[\mathbf{X} (\mathbf{X}\trans \mathbf{X} + k \mathbf{I})^{-1} \mathbf{X}\trans] \\
= & \sum_i^p \text{df}_k(i) = \sum_i^p \left( \frac{d_i^2}{d_i^2 + k} \right) \period
\end{align*} {#eq-dfk}
$\text{df}_k$ is a monotone decreasing function of $k$, and hence any set of ridge constants
can be specified in terms of equivalent $\text{df}_k$. Greater shrinkage corresponds
to fewer coefficients being estimated.
There is a close connection with principal components regression mentioned in @sec-remedies.
Ridge regression shrinks _all_ dimensions
in proportion to $\text{df}_k(i)$, so the low variance dimensions are shrunk more.
Principal components regression discards the low variance dimensions and leaves the high variance dimensions unchanged.
### The `genridge` package
Ridge regression and other shrinkage methods are available in several packages
including
`r pkg("MASS")` (the `lm.ridge()` function),
`r pkg("glmnet", cite=TRUE)`, and
`r pkg("penalized", cite=TRUE)`, but none of these provides insightful graphical displays.
`glmnet::glmnet()` also implements a method for multivariate responses with
a `family="mgaussian".
Here, I focus in the `r package("genridge", cite=TRUE)`, where the `ridge()` function
is the workhorse and `pca.ridge()` transforms these results to PCA/SVD space.
`vif.ridge()` calculates VIFs for class `"ridge"` objects and `precision()` calculates
precision and shrinkage measures.
A variety of plotting functions is available for univariate, bivariate and 3D plots:
* `traceplot()` Traditional univariate ridge trace plots
* `plot.ridge()` Bivariate 2D ridge trace plots, showing the covariance ellipse of the estimated coefficients
* `pairs.ridge()` All pairwise bivariate ridge trace plots
* `plot3d.ridge()` 3D ridge trace plots with ellipsoids
* `biplot.ridge()` ridge trace plots in PCA/SVD space
In addition, the `pca()` method for `"ridge"` objects transforms the coefficients and covariance matrices of a ridge object from predictor space to the equivalent, but more interesting space of the PCA of $\mathbf{X}\trans \mathbf{X}$ or the SVD of $\mathbf{X}$. `biplot.pcaridge()` adds variable vectors to the bivariate plots of coefficients in PCA space