-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathdifferential-abundance.Rmd
143 lines (117 loc) · 14.4 KB
/
differential-abundance.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# How bias affects DA results {#differential-abundance}
<!-- Previous title: Differential-abundance methods vary in their robustness to taxonomic bias -->
We now turn to how the errors in MGS-based abundance measurements affect DA results.
Our focus is on _multiplicative DA analyses_ that estimate the fold differences (FD) in the relative or absolute abundance of species across time points or between experimental conditions.
Multiplicative DA has a direct ecological interpretation via the multiplicative processes of exponential growth and decay.
Many DA methods operate on additive differences in log-transformed abundances and are therefore multiplicative; these include popular methods such as DESeq2 (@love2014mode), ALDEx2 (@fernandes2014unif), corncob (@martin2020mode), and time series methods based on the Lotka-Volterra model (e.g., @stein2013ecol).
We also consider non-parametric rank-based methods; these methods are popular in case-control studies for which ecological interpretability is often of secondary importance to discovering stable associations between species and the condition of interest.
## Fold change between a pair of samples
The building blocks of a multiplicative DA analysis are the measured FDs in a species' abundance between pairs of individual samples.
<!-- The FE in the measured FD between two samples is given by the ratio of the FEs for the individual-sample measurements. -->
If the fold error (FE) in a species' abundance measurement is constant across samples, then it will not affect the measured FDs between samples.
<!-- Ratios -->
In Section \@ref(abundance-measurement), we showed that consistent taxonomic bias creates a constant FE in the ratio between two species $i$ and $j$, equal to the ratio in their efficiencies (Equation \@ref(eq:ratio-error)).
This error completely cancels when calculating the FD of this ratio from sample $a$ to $b$,
\begin{align}
(\#eq:ratio-fc-error)
\underbrace{\frac{\tilde R_{i/j}^{(b)}}{\tilde R_{i/j}^{(a)}}}_\text{measured FD}
= \frac
{R_{i/j}^{(b)} \cdot \cancel{B_i / B_j}}
{R_{i/j}^{(a)} \cdot \cancel{B_i / B_j}}
=
\underbrace{\frac{R_{i/j}^{(b)}}{R_{i/j}^{(a)}}}_\text{actual FD}
.
\end{align}
<!-- Proportions -->
In contrast, bias creates a FE in the proportion of a species (Equation \@ref(eq:prop-error)) that varies inversely with the mean efficiency of the sample.
Therefore, this error does not cancel when calculating the FD in the proportion,
\begin{align}
(\#eq:prop-fc-error)
\underbrace{\frac{\tilde P_i^{(b)}}{\tilde P_i^{(a)}}}_\text{measured FD}
= \frac
{P_i^{(b)} \cdot \cancel{B_i} / \bar B^{(b)}}
{P_i^{(a)} \cdot \cancel{B_i} / \bar B^{(a)}}
=
\underbrace{\frac{P_i^{(b)}}{P_i^{(a)}}}_\text{actual FD}
\cdot
\underbrace{\left[\frac{\bar B^{(b)}}{\bar B^{(a)}}\right]^{-1}}_\text{FE}
.
\end{align}
The varying mean efficiency does not cancel, leaving an FE in the FD of the species' proportion that is equal to the inverse FD in the mean efficiency.
Notably, this FE is the same for every species.
The different effects of taxonomic bias on the measured FDs in proportions versus ratios are illustrated in a hypothetical example in Figure \@ref(fig:error-proportions) (bottom row).
In this example, the mean efficiency decreases 2.6-fold from Sample 1 to Sample 2 (FD of 0.4), causing the FD in the proportion of each species to appear 2.6-fold larger than its true value.
Though the FE in the FD is the same for each species, the biological implications of the error vary.
In particular, there are three distinct types of error: an increase in the magnitude, a decrease in the magnitude, and a reversal in the direction of the measured FD.
We can see each type of error in Figure \@ref(fig:error-proportions).
For Species 1, which increases in abundance and thus moves in the opposite direction of the mean efficiency, we see an increase in magnitude of the measured FD (actual FD: 2.3, measured FD: 6.5).
For Species 2, which decreases and thus moves in the same direction as the mean efficiency but by a larger factor, we see a decrease in magnitude of the measured FD (actual FD: 0.15, measured FD: 0.44).
For Species 3, which decreases but by a smaller factor than the mean efficiency, the species actually appears to increase---a reversal in direction of the measured FD (actual FD: 0.6, measured FD: 1.8).
In contrast, the FDs in the ratios among species are identical in the actual and measured taxonomic profiles.
For example, the ratio of Species 2 to Species 3 shows the same 4-fold decrease in the actual profiles ($1$ to $1/4$) and in the measured profiles ($3$ to $3/4$).
<!-- The ratios in sample 1 are Actual: 1:1:1, Observed: 1:18:6 -->
<!-- The ratios in sample 2 are Actual: 15:1:4, Observed: 5:6:8 -->
Taxonomic bias affects measured FDs in a species' absolute abundance in a similar fashion.
First, consider a species' abundance measurement $\tilde A_i^{(a)}$ derived from a non-MGS measurement of total-community abundance using Equation \@ref(eq:density-prop-meas).
Equation \@ref(eq:density-prop-error) indicates that the FE in this measurement equals the (constant) species efficiency multiplied by ${\bar B^{\mtot (a)}}/{\bar B^{(a)}}$, the ratio of the mean efficiency of the total-abundance measurement to that of the MGS measurement.
This ratio can vary, creating error in the measured FD between samples.
A notable special case is if the FDs in the mean efficiency of the total-abundance measurement mirrors that of the MGS measurement; then, the two will offset each other and lead to more stable FEs (and hence more accurate FDs) in $\tilde A_i^{(a)}$.
We discuss how this possibility might be exploited in real experimental workflows in Section \@ref(solutions).
Now consider a species' abundance derived from a reference species using Equation \@ref(eq:density-ratio-meas).
Equation \@ref(eq:density-ratio-error) indicates that the FE equals a constant ratio in species' efficiencies multiplied by the FE in the measurement of the reference species' abundance.
If the abundance of the reference species can be determined up to constant FE across samples, then the FE in $\tilde A_i^{(a)}$ will also be constant, leading to accurate FDs.
## Regression analysis of many samples
In most cases, a DA analysis can be understood as regression of microbial abundance variables against one or more covariates describing predictive properties of the sample.
The simplest such regression relates a microbial abundance variable $y$ to a covariate $x$ and a _residual (or unexplained) error_ $\varepsilon$ via the simple linear regression formula
\begin{align}
(\#eq:regression)
y^{(a)} = \alpha + \beta x^{(a)} + \varepsilon^{(a)}.
\end{align}
A multiplicative DA analysis can be conducted by setting $y$ equal to the logarithm of the untransformed abundance measurement ($\tilde P_i$, $\tilde R_{i/j}$, or $\tilde A_i$).
For example, $y$ may be the log absolute abundance of species $i$ ($\log A_i^{(a)}$) and $x$ may be a continuous variable (such as pH) or a binary variable (such as $x=1$ for treated patients and $x=0$ for untreated controls).
The regression coefficients $\alpha$ and $\beta$ are parameters to be estimated from the data.
A DA regression analysis is primarily interested in the slope coefficient $\beta$, which determines how the average or expected value of $y$ increases with $x$.
For a binary covariate, $\beta$ equals the average difference in $y$ between samples with $x=1$ versus those with $x=0$.
When $y$ equals the logarithm of a species' relative or absolute abundance, the coefficient $\beta$ equals the average log fold difference (LFD) in abundance between conditions corresponding to a unit increase in $x$.
Although researchers often use more sophisticated models, the simple linear regression model in Equation \@ref(eq:regression) provides an intuitive basis for understanding the effect of taxonomic bias on regression-based DA analyses.
Appendix \@ref(appendix-regression) describes a general framework for finding the error in the estimated slope coefficient $\hat \beta$ caused by taxonomic bias; here, we summarize and illustrate these results for the various abundance measures described in Section \@ref(abundance-measurement).
The effect that taxonomic bias has on the estimate $\hat \beta$ mirrors its effect on the measured FDs between a pair of individual samples:
For abundance measures with a FE that is constant across samples, bias does not affect $\hat \beta$.
For abundance measures where the FE varies, bias can affect $\hat \beta$, with its exact effect depending on how the mean efficiency varies across samples.
We illustrate the constant-FE case by considering a regression analysis of the log ratio between species $i$ and $j$, $\log \tilde R_{i/j}$.
Taxonomic bias creates a constant FE in $\tilde R_{i/j}$ that translates to a constant additive error in $\log \tilde R_{i/j}$.
A constant (additive) shift in a response variable only affects the estimated intercept $\hat \alpha$ and not the estimated slope $\hat \beta$.
Thus, in this case, taxonomic bias does not affect $\hat \beta$.
We illustrate the varying-FE case by considering the absolute abundance of a species $i$ obtained by multiplying the species' MGS proportion by a non-MGS measure of total-abundance, using Equation \@ref(eq:density-prop-meas).
For simplicity, suppose that total-community abundance has been measured completely accurately.
In this case, the measured abundance $\tilde A_i$ has a FE that varies inversely with the mean efficiency $\bar B$.
If $\bar B$ is stable across samples, then the additive error in $\log \tilde A_i$ will be approximately constant and $\hat \beta$ is again unaffected---bias has no effect on the DA result.
If, on the other hand, $\log \bar B$ tends to linearly vary with the covariate $x$, then the additive error in $\log \tilde A_i$ will linearly vary in the opposite direction and cause a systematic error in $\hat \beta$ that is equal in magnitude but opposite in sign to the slope of $\log \bar B$ with $x$.
A third possibility is that $\bar B$ varies in an essentially random manner that is independent of the covariate $x$.
In this scenario, the error from taxonomic bias in the measurement $\log \tilde A_{i}$ acts to increase the noise in $\hat \beta$ (i.e., increase its standard error), but does not cause a systematic error.
Figure \@ref(fig:regression-example) shows a simulated example of a regression analysis of the absolute abundances of 10 species across two conditions in which the mean efficiency is systematically greater in the condition $x=1$ than in the condition $x=0$.
The increase in the log mean efficiency from $x=0$ to $x=1$ (Figure \@ref(fig:regression-example) A and B) corresponds to an artificial decrease in the estimated LFD between conditions for each species (Figure \@ref(fig:regression-example) C and D).
The absolute error created by bias on the LFD estimate is the same for each species; however, its scientific impact varies.
For the three species with large positive LFDs (Species 9, 10, and 1), bias decreases the magnitude of the LFD estimate.
In contrast, for the three species with large negative LFDs (Species 8, 2, and 3), bias increases the magnitude of the estimate.
For one species with a small positive LFD (Species 5), the effect of bias results in a negative LFD estimate (direction error).
The three remaining species (Species 7, 6, and 4) have LFDs near zero, but bias causes them to have large negative LFDs.
<!-- Figure: Regression example -->
```{r regression-example, fig.cap = '(ref:regression-example)', out.width = '90%'}
fs::path(
"notebook/_posts/2021-08-03-simulate-regression-example/",
"simulate-regression-example_files/figure-html5",
"main-figure-2-1.svg"
) %>%
include_svg
```
(ref:regression-example) **Taxonomic bias distorts multi-sample differential abundance analysis when the mean efficiency of samples is associated with the covariate of interest.** This figure shows the results of a regression analysis of simulated microbiomes consisting of 10 species and 50 samples from two environmental conditions, which are indexed by $x=0$ and $x=1$. Panel A shows the (log) measurement efficiencies of the 10 species against their differential-abundance parameter values (expected LFD in absolute abundance between a sample from condition $x=1$ to a sample from condition $x=0$). The simulation was created so that the species with the largest efficiency (Sp. 9) also has the largest positive expected LFD. The increased relative abundance of Sp. 9 in condition $x=1$ drives an increase in the (log) mean efficiency (Panel B), which induces a systematic and consistent negative shift in the estimates of the mean LFD for each species (Panels C and D). Panel D shows the estimated mean LFD with 95% confidence interval for each species, when either estimated the actual abundances or the inaccurate measured abundances. The error (estimate from measured abundances minus estimate from actual abundances; red arrow in Panel D) equals the negative mean LFD the mean efficiency (red arrow in Panel B).
<!-- end Figure -->
## Rank-based analyses
Rank-based DA methods work by first applying a cross-sample rank-transformation to the abundance variable $y$: The sample in which $y$ has the smallest value receives a rank of 1, the sample with the second smallest value receives a rank of 2, etc.
These methods then analyze how the average rank of $y$ varies across sample conditions (@conover2012ther).
Rank-based methods commonly used in microbiome DA analysis include the Wilcoxon-Mann-Whitney and Kruskal-Wallis tests for a discrete covariate (e.g. case vs control) and Spearman's rank correlation for a continuous covariate (e.g. pH); these tests have collectively been applied to species proportions (@callahan2017repl,@fettweis2019thev), ratios (@fernandes2014unif), and absolute abundances (@vieirasilva2019quan).
Multiplying a variable by a constant factor does not change its cross-sample ranks, making rank-based methods unaffected by a constant FE in the abundance variable $y$.
Thus in our MGS measurement model, taxonomic bias does not affect rank-based DA analyses of species ratios or absolute abundances derived from a reference species.
But for species proportions and absolute abundances derived from a total-community abundance measurement, variation in the mean efficiency $\bar B$ can affect the cross-sample ranks and cause error in DA results if $\bar B$ is associated with the covariate.
Further work is needed to determine whether rank-based DA methods are significantly more robust to variation in $\bar B$ than multiplicative DA methods.