-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathordination101.Rmd
2561 lines (1942 loc) · 76.1 KB
/
ordination101.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Basic Ordination Methods"
subtitle: "Ordination for Ecological Methods"
author: "Jari Oksanen"
date: "Nov 12 to 21, 2018"
output:
xaringan::moon_reader:
css: ['default', './resources/gavins.css']
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
ratio: '16:9'
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(comment=NA,
echo = FALSE, message = FALSE, warning = FALSE, cache = TRUE,
fig.align = 'center', fig.height = 5, fig.width = 5.5, dev = 'svg')
knitr::knit_hooks$set(crop.plot = knitr::hook_pdfcrop)
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(rgl.newwindow = TRUE)
```
```{r packages, include = FALSE, cache = FALSE}
require('vegan')
if (packageVersion('vegan') < '2.5.0') # in github
stop("needs vegan 2.5-0")
data('varespec', 'varechem', 'mite', 'mite.env', 'dune', 'dune.env', 'BCI', 'BCI.env', 'pyrifos')
data('dune.phylodis','dune.taxon')
require('natto') # in github, natto::raodist()
require('vegan3d')
if (packageVersion('vegan3d') < '1.1.1') # in github
stop("needs vegan3d 1.1-1")
# require('labdsv')
data('bryceveg','brycesite', package = "labdsv")
require(analogue)
data(abernethy)
library('mgcv')
require('knitr')
require('viridis')
```
class: bottom
Source files of these slides are [in github](https://github.com/jarioksa/ordination101).
<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.
© Jari Oksanen 2017—2018
This version was built on `r date()`
<!-- ********** INTRO ************ -->
---
# This Is Ordination
.center[
```{r whatisthis}
par(mar=c(4,4,0,0)+ .1)
palette(viridis(8))
foo <- cca(varespec ~ Al + P + K, varechem)
plot(foo, type="n")
points(foo, dis="si", pch=21, col=2, bg=6)
ordilabel(foo, dis="sp", col=4, cex=.8, priority=colSums(varespec))
text(foo, dis="cn", col=2, cex=1.6, lwd=2)
```
]
---
class: inverse center middle
# Ordination: Why and What?
---
# Multivariate Data
- Ordination methods are multivariate graphical tools to display
multivariate data
- *Multivariate data* have several variables and several sampling
units (SU), and such data are best analysed with
*multivariate methods*
- We do not have one obvious dependent variable in multivariate data,
but we would still want to understand the major features of the data
- Species abundances in sampling sites, OTUs of DNA sequence
analysis in sampling units, measurements of chemical and physical
conditions in sampling stations *etc*
- If we have a clear hypothesis with one dependent variable, it is
better to use univariate methods
- Sometimes we can summarize multiple variables into one indicator
(*e.g*, diversity), and then we can use univariate methods
- If all else fails, we *must* use multivariate methods
---
# Multivariate Data and Display
.pull-left[
.small[
```{r echo=FALSE}
varespec
```
]
]
.pull-right[
```{r mvexample, results=FALSE}
par(mar=c(4,4,0,1)+.1)
palette(viridis(8))
m <- metaMDS(varespec)
m <- MDSrotate(m, varechem$Humdepth)
ef <- envfit(m, varechem)
plot(m, type="n")
text(m, dis="si", col=1, cex=0.8)
text(m, dis="sp", col=5, cex=0.8)
plot(ef, col=3, p.=0.05)
```
]
---
# Reading Ordination Plots
.pull-left[
- The ordination graphs shows the main features of similarities and differences in the data
- If two SUs are close to each other, they have similar communities
- If two SUs are far away from each other, the communities differ
- If two species are close to each other, they have similar occurrence
patterns, and occur most abundantly in SUs that are close to them
- It can be assumed that environmental conditions are behind these
differences and we can identify the external variables that explain
the ordination structure
]
.pull-right[
```{r mvexample2}
par(mar=c(4,4,0,1)+.1)
palette(viridis(8))
# redraw the previous
plot(m, type="n")
text(m, dis="si", col=1, cex=0.8)
text(m, dis="sp", col=5, cex=0.8)
plot(ef, col=3, p.=0.05)
```
]
---
# Why Ordination?
- Ordination provides a graphical summary of main structure in the data
- Instead of looking at a huge number of variables with a huge number
of observations, we compress multidimensional data into two (or
three) dimensions that we can easily inspect and comprehend
- Ordination is **descriptive**, but it can help us to understand the
main features of the data
- Often ordination is *private:* we familiarize with the data so that
we can formulate research hypotheses, check the adequacy of our
hypothesis, and communicate the results to other people
- *Private:* we do it, but we do not tell anybody what we do in our privacy
- Ordination need not be published or shown to anyone to be useful for you
- Sometimes we can build rigorous tests on ordination results: these
summarize the major structure and allow using simpler test
procedures
- Some ordination methods allow testing within themselves
- Many people never need ordination, but some do — that all
depends on your research questions
---
# The Purpose of These Lectures
- Provide information on ordination for those who may need it
- Even if you do not need them yourself, it is useful to understand
how to interpret publications that use ordination
- These lectures are not about using ordination software, but they are
about the **methods**
- There are several alternative software implementations to perform
these analyses, although the lecture slides were written using
**R** statistical language and mainly with **vegan** package
- There are no **R** commands in these slides, but if you *want* to
see them, look at the
[source code](https://github.com/jarioksa/ordination101)
of these lectures
- I avoid mathematics, but also over-simplification and
superficiality: I try to *explain* how these methods work in terms
of common statistical tools, for instance, regression analysis (and
that is about how complicated it becomes)
- I would be really pleased if you understand what I say: Please
interrupt me and ask as soon as you get lost
---
# The Structure of These Lectures I
In the first part of lectures I describe four main basic ordination
methods:
- Principal Components Analysis (PCA) is just an extension of
regression analysis, and a simple summary of the data. I also
explain why it fails often with ecological community data.
- Correspondence Analysis (CA) is a variant of PCA, but that small
difference makes it much better for community data.
- In Principal Coordinates Analysis (PCoA) the main focus is what we
actually mean with things being similar, and how we can use
ecological judgement in defining similarity.
- Nonmetric Multidimensional Scaling (NMDS) is presented as a flexible
and robust tool that often is most reliable in finding interpretable
results.
- I also explain how to interpet ordination results with the help of
external information, such as measurements of chemical or physical
parameters and other environmental conditions
---
# The Structure of These Lectures II
In the latter part of lectures I describe how to use external
environmental variables to *constrain* ordination. These methods are
useful if we have a controlled experiment and we want to
analyse the statistical significance of multivariate responses. Even
with survey data, the use of external variables can make
interpretation much easier. Some of the main themes are:
- How constrained ordination is defined as an extension of regression
analysis, and how it can be applied in the context of PCA, CA and
PCoA.
- How we can test the statistical significance of external variables
in multivariate setting.
- How we can condition the analysis for some background variables and
remove their effects before main ordination.
- How we can partition multivariate response to different sources
variation.
<!-- ********** PCA ************** -->
---
class: inverse center middle
# PCA: Principal Components Analysis
PCA is the basic ordination method and the one that Statisticians
know. It is poor for most uses, but we need to know PCA to understand
how other methods improve upon it. The main problem of PCA is that it
is linear, but community data are nonlinear.
---
# From Regression to Ordination
- What is the best possible predictor variable for a species? --- It
is the species itself!
- If you predict values of $y$ with $y$, you get a perfect fit and
explain $y$ completely (but what about other species?)
- Some species can also be powerful in predicting abundances of
other species
- The best possible predictor is such a (linear) combination of
species that explains abundances of all species as well as possible
- That linear combination is the first PCA axis
---
# 3 Cryptogams in Reindeer Pastures
```{r bottpredict, fig.width=8, fig.height=5}
bott <- varespec[, c("Pleuschr","Cladrang","Cladstel")]
bott0 <- scale(bott, scale = FALSE)
par(mar = c(4,4,0,1)+.1)
palette(viridis(8))
panel.lm <- function(x, y, col.smooth = 1, tcex=1.5, ...)
{
usr <- par("usr"); on.exit(par(usr))
points(x, y, ...)
mod <- lm(y ~ x - 1)
txt <- paste0("Resid.Var = ", round(summary(mod)$sigma^2, 1))
abline(mod, col = col.smooth, lwd=2)
text(mean(usr[1:2]), 0.9*usr[4], txt, cex=tcex)
}
pairs(bott0, pch=16, col = 4, ylim = range(bott0), panel = panel.lm)
```
---
# Best Predictor
```{r bottpc1, fig.width=9.5, fig.height=4}
mod <- rda(bott0)
koef <- mod$CA$v[,1]
xlab = paste(formatC(koef,digits=2, flag="+"), names(koef), sep="*",collapse="")
PC1 <- (bott0 %*% mod$CA$v)[,1]
df <- data.frame(bott0)
palette(viridis(8))
par(mar = c(4,4,0,1)+.1, mfrow=c(1,3))
ylim <- range(bott0)
plot(Pleuschr ~ PC1, df, type="n", ylim=ylim, xlab=xlab, cex.lab=1.2)
panel.lm(PC1, df$Pleuschr, pch=16, col=2, col.smooth=4)
plot(Cladrang ~ PC1, df, type="n", ylim=ylim, xlab=xlab, cex.lab=1.2)
panel.lm(PC1, df$Cladrang, pch=16, col=2, col.smooth=4)
plot(Cladstel ~ PC1, df, type="n", ylim=ylim, xlab=xlab, cex.lab=1.2)
panel.lm(PC1, df$Cladstel, pch=16, col=2, col.smooth=4)
```
---
# What was Explained?
- PCA explains the **variance** of the data: the squared
differences between observed values $y$ and species means $\bar y$:
$\mathrm{Var}(y_j) = \sum_{i=1}^N(y_{ij} - \bar y_j)^2/(N-1)$
- Ordination always centres data so that species means $\bar y = 0$
- The total variance (sum of species variances) is called **inertia**
- In the example, the variances of three species were
`r round(apply(bott0, 2, var), 1)`, with sum
`r round(mod$tot.chi, 1)`, and the sum of residual variances of the PC
regression was `r round(mod$tot.chi - mod$CA$eig[1], 1)`
- The variance explained by the first PC is the **eigenvalue** of the
axis, and it is the difference of total variance and residual
variance $\lambda_1 =$ `r round(mod$CA$eig[1], 1)` which is
`r round(100*mod$CA$eig[1]/mod$tot.chi, 1)`% of total variance
- The coefficients (`r round(mod$CA$v[,1], 3)`) of three species are
the **species scores** used in ordination diagrams, and they are also
regression coefficients that project centred species abundances to
the PC
- Ordination always centres data so that the intercept
is zero
- The result of this projection are the sampling unit **(SU) scores**
that are used in ordination diagrams -- this is also called as the
**principal component**
---
# Second and Further Principal Components
- Second PC has similar components, but is **orthogonal** to the
previous one:
- Species scores and SU scores are orthogonal to (uncorrelated with)
corresponding scores in the previous axes
- The second PC explains the largest possible amount of remaining
variance
- In the example, its eigenvalue is $\lambda_2 =$ `r round(mod$CA$eig[2], 1)`
which is `r round(100*mod$CA$eig[2]/mod$tot.chi, 1)`% of total variance
- The sum of all $K$ eigenvalues is equal to the variances of all
species (inertia) $\sum_{k=1}^K \lambda_k = \sum_{j=1}^S \mathrm{Var}(y_j)$:
The axes decompose variance into independent ordered components
- In modern software, we do not find second and further PCs one-by-one
with orthogonalization, but the interpretation of the results is
still similar
---
# Is This Circular?
- We explain species with species — and this sounds badly circular
- We perform no statistical testing, and what looks regression
actually is **rotation** of species data
- We only like to find a way of looking at the data so that some first axes
(say PC1 and PC2) show as much of the variance of the data as possible
- Then we can ignore the later axes and say that a **major part** of
variation is displayed in the first dimensions that we
can plot
- PCA is **not** a statistical method, but it is only a rotation of
the data
---
# Species Scores in Multidimensions
.pull-left[
- We had species scores for a single axis, but in 2 and more dimensions
we must look at all dimensions simultaneously
- The coefficients together define the **direction** to which the
species abundance **increases** most steeply — and it
**decreases** to the opposite direction
- The response is linear in 2D: the contours of linear trend surface are
perpendicular to the arrow
- The estimated abundances of all species on each SU can be read by
projecting the SU point to the arrow: PCA **approximates** data
- Species were centred: The zero contour goes through the origin
]
.pull-right[
```{r plot3sp}
par(mar=c(4,4,0,1)+.1)
palette(viridis(8))
biplot(mod, type = c("text","points"))
tmp <- ordisurf(mod ~ Cladstel, df, add = TRUE, col=4, knots=1)
```
]
---
# Distance and Direction From the Origin
.pull-left[
- In the origin, all species occur at their average abundances
- The further away from the origin SU is situated, the more strongly
its species composition differs from the average
- For species the angles between arrows indicate the similarity of
change, and the length of the arrow the steepness of the change:
right angle means no relation, zero angle identical response, and
direct angle opposite responses
- Try yourself: where these species are most abundant and scarcest?
Which species attain highest abundances and where? Which species
replace each other and which are independent from each other?
]
.pull-right[
```{r plot3sp2}
par(mar=c(4,4,0,1)+.1)
palette(viridis(8))
biplot(mod, col=c(4,1))
```
]
---
# Rotation in Species Space
.pull-left[
- In **species space** every species is an *axis* (or a *dimension*)
- Each axis is at right angle against all other axes
- The SUs are points, and the abundance of each species gives its location
in species axes
- There can be a huge number of species axes, and we cannot see them all
- We **rotate** the species space so that we see as much as possible of the
locations of the points: This is the **PCA!**
- Originally we had a *species*-dimensional space, and we *reduce* it
to two or three dimensional space
]
.pull-right[
```{r specspace}
palette(viridis(8))
ordiplot3d(bott0, mar=c(4,4,0,2), angle=20, pch=16, col=2, ax.col=4)
```
[Please Click Me!](http://cc.oulu.fi/~jarioksa/opetus/metodi/3D/PCArotation/index.html)
]
---
# Complete Reindeer Pasture Data
.pull-left[
- Analysis of complete data is pretty similar to the analysis of these
three species
- PCA analyses variances and shows absolute differences from the species mean
$y - \bar y$: minor species change little in absolute units
- Actually only some few species influence results, and the data are
not very multivariate
]
.pull-right[
```{r pcavare}
par(mar=c(4,4,0,1)+.1)
palette(viridis(8))
modf <- rda(varespec)
biplot(modf, col=c(4,1))
```
]
---
# Explaining Variance
.pull-left[
- Abundant species have high variance, and PCA explains variance
- It may be very easy to explain variance when only a few species
account for the most, because then data are not very
multidimensional
- Minor species may be ecological indicators, but they have little
effect on community ordination with PCA
- Three abundant cryptogams accounted for
`r round(100*mod$tot.chi/modf$tot.chi, 1)`% of total variance in the complete
data set, and are dominant in PCA
]
.pull-right[
```{r varabu}
par(mar=c(4,4,0,1)+.1)
v <- apply(varespec, 2, var)
plot(colMeans(varespec), v, xlab="Mean Abundance", ylab="Variance", type="n")
ordilabel(cbind(colMeans(varespec), v), priority = v, fill = "yellow", cex=1)
```
]
---
# Equal Weights to All Species
.pull-left[
- Use correlation instead of variance and minimize residuals $1-R^2$ in
PC regression: This gives equal weights to all species
- Equalized data are more multidimensional, and scarce species will
also influence the result, and the ordination changes from the previous
- PCA implies that species should be shown by arrows, but this
makes graphs very messy, and often it makes sense to drop arrows and
just show the species scores at arrowheads
- They still should be interpreted like arrows: **Distance** and
**direction** from the origin counts
]
.pull-right[
```{r pcacorr}
mod1 <- rda(varespec, scale=TRUE)
palette(viridis(8))
par(mar=c(4,4,0,1)+.1)
plot(mod1, scaling="sites", type="n")
text(mod1, scaling="sites", dis="site", col = 1, cex=0.8)
text(mod1, scaling="sites", dis="species", col=4, cex=0.8)
```
]
---
class: inverse center middle
## PCA Trouble: It is Linear!
---
# Linear Method
- PCA builds upon the idea of *linear regression*
- PC axes are sometimes called as latent variables: unknown and hidden
real variables that we dig up from data
- Community response to environmental variables is non-linear: PC axes
cannot find such variables
```{r mtfgrad, fig.width=9, fig.height=3}
source('data/mtfit.R')
palette(viridis(6))
par(mar=c(4,4,0.4,1)+.1)
matplot(altfit, mtfit, type="l", lty=1, xlab="Altitude (m)", ylab="Species Response")
```
Mt Field (Tasmania): Fitted Responses
---
# Horseshoe: Beware!
.pull-left[
- If species have nonlinear, unimodal responses to gradients, the
gradient appears as a curve: **horseshoe artefact**
- People often want to interpret axes as latent variables, but even a
single gradient is not an axis: **Beware**
- Even if you are aware of the risk of horseshoe, it can hide other
shorter gradients: Using two dimensions for one gradient is wasteful
and confusing
- In general, PCA **should not be used** for community data
- PCA is useful for linear data, or for deriving new combined
variables for other analyses when PCA explains a large proportion of
variance
]
.pull-right[
```{r horseshoe}
mod <- rda(mtfit)
palette(viridis(8))
par(mar=c(4,4,0,1)+.1)
plot(mod, dis="si", scaling="si", type="n")
points(mod, dis="si", scaling="si", pch=16, col=2)
```
PCA of the expected species responses of previous slide
]
---
# Horseshoe is Common
```{r morehorses, fig.width=11}
palette(viridis(8))
par(mfrow=c(1,3), mar=c(4,4,2,1)+.1)
m1 <- rda(varespec)
m2 <- rda(bryceveg)
m3 <- rda(mite)
plot(m1, dis="si", type="n", main="Reindeer Pastures")
points(m1, dis="si", pch=16)
plot(m2, dis="si", type="n", main="Bryce Canyon")
points(m2, dis="si", pch=16)
plot(m3, dis="si", type="n", main="Oribatid Mites")
points(m3, dis="si", pch=16)
```
---
# When Would We Expect a Horseshoe?
- **Always** with community data
- Even when we cannot see the horseshoe, it may be hidden below the
crumble of noisy analysis
- Horseshoe can hide underlying gradients and disperse their effect
over several PCs so that they cannot be found and interpreted
- Horseshoe can be expected also in homogeneous data with short
gradients
---
# Niche for PCA
.pull-left[
- New synthetic variables from correlated measurements to avoid
problems of collinearity in statistical analysis
- Using PCs as explanatory variables in regression
- For this PCA should explain a large amount of variance
- Use only the subset of correlated variables that may be assumed to
describe the same underlying (latent) variable: **measurement
model** in **Factor Analysis**
- Use correlations when variables are not measured in the same units
- Do not use axes directly: they may not be parallel to interpretable
variables
]
.pull-right[
```{r latent}
mod <- rda(varechem, scale=TRUE)
palette(viridis(8))
par(mar=c(4,4,1,1)+.1)
biplot(mod, display="sp", col=1)
```
.small[
Soil data for reindeer pastures
]
]
<!-- ********* CA ************ -->
---
class: inverse center middle
# CA: Correspondence Analysis
Correspondence Analysis is an eigenvector method just like PCA, and it
can be seen as its variant. However, this small difference makes it
handle nonlinear species responses better than PCA. For community
data, CA is usually a much better alternative than PCA.
---
# Correspondence Analysis
- In PCA we explain the differences of species from their averages
$y_{ij} - \bar y_j$ using their mean squares (variance) as a measure
of goodness of fit
- In CA the expected value is derived both from SUs and species,
and the goodness of fit is $\chi^2$
- We first scale the observed data so that their sum is one
$\left( \sum_{i=1}^N \sum_{j=1}^S y_{ij} = 1 \right)$
— that is, we divide data with their total
- If all communities have identical species composition, and all
species occur in equal proportion in all SUs, then the expected
abundance is $r_i c_j$ where $r_i$ is row (SU) sum and $c_j$ is the
column (species) sum: This takes the place of $\bar y_j$ of PCA
- We look at proportional difference
$(y_{ij} - r_i c_j)/\sqrt{r_i c_j}$, and its square is
$\chi^2 = \sum_i \sum_j (y_{ij} - r_i c_j)^2/(r_i c_j)$:
This takes the place of variance of PCA
- Then we proceed with regression to find the axes, just like in PCA
(except that we need to use weights $r_i$, $c_j$)
---
# Small But Important Difference
.pull-left[
- CA is based on similar regression as PCA, but with
$\chi$-standardized data and weights
- Regression is linear, but *observed* values of data rather pack
together so that high values are close to each other along the axis
- The response of species can be approximately unimodal: This matches
the gradient model of species responses
- The solution can be better related to environmental variables than in PCA
- CA is still **not** a completely unimodal ordination and it still
shows a curve artefact, but not as garbled as PCA
]
.pull-right[
```{r cagam}
palette(viridis(3))
par(mar=c(4,4,1,1)+.1)
mod <- cca(bott) # from PCA lecture
CA1 <- mod$CA$u[,1]
matplot(CA1, bott, pch=16, xlab = "CA axis 1", ylab="Species Abundance")
legend("top", colnames(bott), pch=16, lty=1, col=1:3, bty="n")
i <- order(CA1)
for(k in 1:3) {
m <- gam(bott[,k] ~ s(CA1, k=4), family=quasipoisson)
lines(CA1[i], fitted(m)[i], lwd=2, col = k)
}
```
]
---
# PCA Orders Linearly, CA Packs
.pull-left[
```{r pcatabasco}
tabasco(varespec, rda(varespec), scale="log", col=viridis(10))
```
**PCA:** Rows and columns ordered by PC1
]
.pull-right[
```{r catabasco, }
tabasco(varespec, cca(varespec), scale="log", col=viridis(10))
```
**CA:** Rows and columns ordered by CA1
]
---
# More Clearly in Sparse Data
.pull-left[
```{r pcatabadune}
tabasco(dune, rda(dune), col=viridis(5))
```
**PCA:** Dutch Dune Meadows
]
.pull-right[
```{r catabadune, }
tabasco(dune, cca(dune), col=viridis(5))
```
**CA:** Diagonally structured table
]
---
# CA Preserves Order
- CA can also have a curve artefact with dominant long gradient, but
it is not garbled and involuted like the PCA horseshoe
- First CA axis preserves the correct ordering: CA is a **seriation method**
```{r abernethy, fig.width=7.5, fig.height=3.5}
par(mar=c(4,4,0.2,1)+.1, mfrow=c(1,2))
palette(viridis(8))
m0 <- rda(abernethy[,1:36])
m1 <- cca(abernethy[,1:36])
plot(m0, dis="si", scaling="si", type="n")
ordiarrows(m0, rep("a", nrow(abernethy)), scaling="si", col=7)
points(m0, dis="si", scaling="si", pch=16, col=1)
plot(m1, dis="si", scaling="si", type="n")
ordiarrows(m1, rep("a", nrow(abernethy)), scaling="si", col=7)
points(m1, dis="si", scaling="si", pch=16, col=1)
```
.pull-right[
Abernethy Forest pollen data (5515 to 12145 BP)
]
---
# Unimodal Responses in 2D
.pull-left[
- CA packs species also in multidimensional space
- Instead of an arrow of increase, the species score may be regarded
as a *centre of abundance*
- It is said that CA approximates the unimodal model
- We may think that the species scores gives the species maximum and
the abundance decreases to every direction from the centroid given
by the species score
- In PCA species close to the origin changed little and was poorly
presented by the ordination, but in CA it really may have its
optimum there
]
.pull-right[
```{r cascores}
par(mar=c(4,4,0.9,0.5)+.1, mfrow=c(2,2))
palette(viridis(8))
mod <- cca(dune)
with(dune, tmp <- ordisurf(mod ~ Poaprat, bubble=2, family=quasipoisson, knots=2, col=6, scaling="si", main="Poa pratensis"))
abline(h=0, v=0, lty=3)
with(dune, tmp <- ordisurf(mod ~ Trifrepe, bubble=2, family=quasipoisson, knots=2, col=6, scaling="si", main="Trifolium repens"))
abline(h=0, v=0, lty=3)
with(dune, tmp <- ordisurf(mod ~ Lolipere, bubble=2, family=quasipoisson, knots=2, col=6, scaling="si",main="Lolium perenne"))
abline(h=0, v=0, lty=3)
with(dune, tmp <- ordisurf(mod ~ Juncarti, bubble=2, family=quasipoisson, knots=2, col=6, scaling="si", main="Juncus articulatus"))
abline(h=0, v=0, lty=3)
```
]
---
# Numbers and Interpretation
.pull-left[
- Sum of all eigenvalues (inertia) $\sum_k \lambda_k = \chi^2$, and
$\lambda_k / \chi^2$ gives the proportion explained by the axis
- This inertia is **not** variance but $\chi^2$, and eigenvalues or percentages
cannot be compared with PCA
- The eigenvalue is bound to be $\lambda \le 1$ (and $\lambda \ge 0$)
- High eigenvalue is good, but too high (say, $\lambda > 0.7$) are
suspicious: disjunct or partly disjunct data where a group of SUs
shares hardly anything (or nothing) with others
- Species scores can be seen as points indicating species optima
instead of arrows of increase in PCA
- Both species and SU scores are **weighted averages** of each other
— but exact relationship depends on scaling in graphs
]
.pull-right[
```{r caplot}
palette(viridis(8))
par(mar=c(4,4,0,1)+.1)
plot(mod, scaling="sites", type="n")
text(mod, scaling="sites", dis="site", col = 1, cex=0.8)
text(mod, scaling="sites", dis="species", col=4, cex=0.8)
```
]
---
# Weighted Averages and Scaling
.pull-left[
```{r caspecwa}
par(mar=c(4,4,1,1)+.1)
palette(viridis(8))
with(dune, tmp <- ordisurf(mod ~ Lolipere, bubble=2.5, family=quasipoisson, knots=2, main="scaling='species'", pch=21, col=1, bg=4))
text(mod, dis="sp", cex=0.8, col=3)
with(dune, ordispider(mod, Lolipere>0, w=Lolipere, label=FALSE, show=TRUE, col=4))
with(dune, plot(envfit(mod ~ rep("A",nrow(dune)), w=Lolipere), labels="Lolipere", bg=8))
```
**Dots:** SUs, size: *Lolium perenne*, **Text:** species
]
.pull-right[
```{r casitewa}
par(mar=c(4,4,1,1)+.1)
palette(viridis(8))
plot(mod, scaling="si", type="n", main="scaling='sites'")
ordispider(mod, unlist(dune[18,]>0), scaling="si", w = unlist(dune[18,]), show=T, display="sp", col=4)
cl <- rep("white", nrow(dune))
cl[18] <- palette()[8]
ordilabel(mod, dis="si", cex=0.8, scaling="si", fill=cl)
points(mod, scaling="si", display="sp", cex=unlist(dune[18,]/2+0.4), pch=21, bg=4, col=1)
```
**Text:** SUs, **Dots:** species, size: abundance in SU 18
]
---
# Rare Species
.pull-left[
- Rare species are often extreme in CA
- CA explains deviation from expected abundance, and it expects every
species occur at every SU — rare species occur only in one or
two
- CA assesses deviation as proportional to expected — rare
species have low expected values and look extreme
- CA is weighted, and rare species have low weights — they do
not influence ordination of SUs or other species very much
- Some people tell you to routinely remove rare species from the data,
but I do not recommed doing so
]
.pull-right[
```{r carare}
par(mar=c(4,4,0,1)+.1)
cl <- rev(viridis(5))
fr <- colSums(bryceveg > 0)
fr <- cut(fr, breaks=c(0,1,3,7,15,100))
bmod <- cca(bryceveg)
plot(bmod, dis="sp", type="n")
points(bmod, dis="sp", pch=21, bg=cl[fr])
legend("topleft", c(" 1"," 2 - 3"," 4 - 7"," 8 - 15","16 - 87"), pch=21, pt.bg=cl, title="Species Frequency")
```
Bryce Canyon, species scores
]
---
class: inverse center middle
# Interpretation: Ordination and External Variables
Gradient model is the deep idea of ordination: If sampling units are
close to each other in ordination, they occur in similar environments,
and if they are far away in ordination, the environments differ. If our
ordination model is such that it can be assumed to be consistent with
gradient model (*i.e.*, it is able to handle unimodal species
responses), we can try to identify the underlying environmental
variables that cause the ordination structure.
---
## Fitted Vectors
.pull-left[
- Most popular method of displaying continuous variables
- Can show many variables in one graph
- Implies a **linear trend** surface, similarly as species arrows in PCA
- **Direction** shows the steepest change in variable: the direction
of the **gradient**
- Gradients are **not** parallel to axes: do **not** interpret axes, but gradients
- **Length** shows the relative importance or the strength of the variable
- The arrow is drawn from the origin and shows the direction of
increase: there is equally strong **decrease to opposite** direction
]
.pull-right[
```{r envfit}
par(mar=c(4,4,0,1)+.1)
palette(viridis(8))
vmod <- cca(varespec)
plot(vmod, dis="si", type="n")
points(vmod, pch=21, col=1, bg=4)
ef <- envfit(vmod ~ ., varechem)