-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathabundance-measurement.Rmd
202 lines (176 loc) · 17.9 KB
/
abundance-measurement.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
# How bias affects abundance measurements {#abundance-measurement}
This section extends the theoretical results of @mclaren2019cons to describe how taxonomic bias in MGS experiments leads to errors in the relative and absolute abundances measured for various microbial species.
All approaches to abundance quantification have systematic errors driven by taxonomic bias; however, some yield constant fold errors (FEs), while others yield FEs that depend on overall community composition and thus can vary across samples.
## A model of MGS measurements
Our primary tool for understanding the impact of taxonomic bias on MGS measurement is the theoretical model of MGS measurement developed and empirically validated by @mclaren2019cons.
This model describes the mathematical relationship between the read counts obtained by MGS and the (actual) abundances of the various species in a sample.
Here we extend the model as first described @mclaren2019cons, which considers only relative abundances, to also consider absolute abundances.
For concreteness, we will consider _absolute abundance_, or simply _abundance_, to refer to the number of cells per unit volume in a sample (cell concentration).
That said, our results equally apply to other definitions of absolute abundance, such as the total number of cells in a sample or ecosystem and other abundance units such as biomass or genome copy number.
This model is the simplest that respects the multiplicative nature of taxonomic bias and the *compositional* nature of MGS measurements.
The actual abundance of a species in a given sample, multiplied by its _measurement efficiency_---its rate of conversion from cells to taxonomically assigned sequencing reads---determines the species' read count in that sample.
Taxonomic bias presents as variation in the measurement efficiencies among species within an MGS experiment.
The read counts further depend on sample-specific experimental factors that are typically unknown, such that they are best interpreted as only providing relative abundances (such data is said to be _compositional_; @gloor2017micr).
We consider a set of microbiome samples measured by a specific MGS protocol that extracts, sequences, and taxonomically assigns reads to a set of microbial species $S$.
We make several simplifying assumptions to facilitate our analysis and presentation.
First, we consider only species-level assignment, and suppose that reads that cannot be uniquely assigned to a single species in $S$ are discarded.
Second, we ignore the possibility that reads are misassigned to the wrong species or the wrong sample.
Third, we suppose that taxonomic bias acts consistently across samples at the species level---that is, a given species is always measured more efficiently than another to the same degree.
Finally, unless otherwise stated, we treat sequencing measurements as deterministic, ignoring the 'random' variation in read counts that arise from the sampling of sequencing reads and other aspects of the MGS process.
These assumptions, though unrealistic descriptions of most MGS experiments, serve the purpose of clearly demonstrating when and why consistent taxonomic bias creates errors in DA analysis.
Our model stipulates that the taxonomically-assigned read count for a species $i$ in a sample $a$ equals its abundance $A_{i}^{(a)}$ multiplied by a species-specific factor $B_{i}$ and a sample-specific factor $F^{(a)}$,
\begin{align}
(\#eq:mgs-model)
M_i^{(a)} = A_i^{(a)} B_i F^{(a)}.
\end{align}
The species-specific factor, $B_{i}$, is the *relative measurement efficiency* (or simply *efficiency*) of the species relative to an arbitrary baseline species (@mclaren2019cons).
The variation in efficiency among species corresponds to the taxonomic bias of the MGS protocol.
The sample-specific factor, $F^{(a)}$, describes the effective sequencing effort for that sample; it equals the number of reads per unit abundance that would be obtained for a species with an efficiency of 1.
We can write the total number of assigned reads for the sample as
\begin{align}
(\#eq:total-reads)
M_\tot^{(a)}
= A_\tot^{(a)} \bar B^{(a)} F^{(a)},
\end{align}
where
$M_{\tot}^{(a)} \equiv \sum_{j\in S} M_j^{(a)}$ is the total read count and $A_{\tot}^{(a)} \equiv \sum_{j\in S}A_j^{(a)}$ is the total abundance of all species in $S$, and
\begin{align}
(\#eq:mean-efficiency)
\bar B^{(a)} \equiv \frac{\sum_{i\in S} A_i^{(a)} B_i}{\sum_{i\in S} A_i^{(a)}}
\end{align}
is the _sample mean efficiency_, defined as the mean efficiency of all species weighted by their abundance.
## Relative abundance {#relative-abundance}
We distinguish between two types of species-level *relative abundances* within a sample.
The *proportion* $P_{i}^{(a)}$ of species $i$ in sample $a$ equals its abundance divided by the total abundance of all species in $S$,
\begin{align}
(\#eq:prop)
P_{i}^{(a)} &\equiv \frac{A_i^{(a)}}{A_\tot^{(a)}}.
\end{align}
The *ratio* $R_{i/j}^{(a)}$ between two species $i$ and $j$ equals the abundance of $i$ divided by that of $j$,
\begin{align}
(\#eq:ratio)
R_{i/j}^{(a)} \equiv \frac{A_i^{(a)}}{A_j^{(a)}}.
\end{align}
The measured proportion of a species is given by its proportion of all the assigned reads in a sample,
\begin{align}
(\#eq:prop-meas)
\tilde P_{i}^{(a)} &\equiv \frac{M_i^{(a)}}{M_\tot^{(a)}}.
\end{align}
We use the tilde to distinguish the measurement from the actual quantity being measured.
From Equations \@ref(eq:mgs-model), \@ref(eq:total-reads), and \@ref(eq:prop-meas), it follows that the measured and actual proportion are related by
\begin{align}
(\#eq:prop-error)
\tilde P_{i}^{(a)} &= P_{i}^{(a)} \cdot \frac{B_i}{\bar B^{(a)}}.
\end{align}
Taxonomic bias creates a fold-error (FE) in the measured proportion of a species that is equal to its efficiency divided by the mean efficiency in the sample.
Since the mean efficiency varies across samples, so does the FE.
This phenomenon can be seen for Species 3 in the two hypothetical communities in Figure \@ref(fig:error-proportions).
Species 3, which has an efficiency of 6, is under-measured (FE < 1) in Sample 1 but over-measured (FE > 1) in Sample 2.
This difference occurs because the even distribution of species in Sample 1 yields a mean efficiency of 8.33; in contrast, the lopsided distribution in Sample 2, which is dominated by the low-efficiency Species 1, has a mean efficiency of just 3.15.
A demonstration of this phenomenon in bacterial mock communities is shown in [Figure 3C](https://doi.org/10.7554/eLife.46923.004) of @mclaren2019cons.
The measured ratio $\tilde R_{i/j}^{(a)}$ between species $i$ and $j$ is given by the ratio of their read counts,
\begin{align}
(\#eq:ratio-meas)
\tilde R_{i/j}^{(a)} \equiv \frac{M_{i}^{(a)}}{M_{j}^{(a)}}.
\end{align}
From Equations \@ref(eq:mgs-model) and \@ref(eq:ratio-meas), it follows that the measured and actual ratio are related by
\begin{align}
(\#eq:ratio-error)
\tilde R_{i/j}^{(a)} = R_{i/j}^{(a)} \cdot \frac{B_i}{B_j}.
\end{align}
Taxonomic bias creates a FE in the measured ratio that is equal to the ratio in the species' efficiencies; the FE is therefore constant across samples.
For instance, in Figure \@ref(fig:error-proportions), the ratio of Species 3 (with an efficiency of 6) to Species 1 (with an efficiency of 1) is over-measured by a factor of 6 in both communities despite their varying compositions.
A demonstration in bacterial mock communities of the constant FE in species' ratios is shown in [Figure 3D](https://doi.org/10.7554/eLife.46923.004) of @mclaren2019cons.
<!-- begin figure -->
```{r error-proportions, out.width = '100%', fig.cap = '(ref:cap-error-proportions)', cache = TRUE}
fs::path(
"figures/illustrations/error-proportions.svg"
) %>%
include_svg
```
(ref:cap-error-proportions) **Taxonomic bias creates fold errors in species proportions that vary across samples and lead to inaccurate fold differences between samples.** Top row: Error in proportions measured by MGS in two hypothetical microbiome samples that contain different relative abundances of three species. Bottom row: Error in the measured fold difference in the third species that is derived from these measurements. Species' proportions may be measured as too high or too low depending on sample composition. For instance, Species 3 has an efficiency of 6 and is under-measured in Sample 1 (which has a mean efficiency of 8.33) but over-measured in Sample 2 (which has a mean efficiency of 3.15).
<!-- end figure -->
**Higher-order taxa:**
We can consider a higher-order taxon $I$, such as a genus or phylum, as a set of species, $\{i \in I\}$.
The abundance of taxon $I$ in sample $a$ is the sum of the abundances of its constituent species, $A_{I}^{(a)} \equiv \sum_{i \in I} A_{i}^{(a)}$.
Similarly, the read count of taxon $I$ is the sum $M_{I}^{(a)} \equiv \sum_{i \in I} M_{i}^{(a)}$.
We further define the efficiency of taxon $I$ as the abundance-weighted average of the efficiencies of its constituent species,
\begin{align}
(\#eq:efficiency-general)
B_I^{(a)} \equiv \frac{\sum_{i\in I} A_{i}^{(a)} B_{i}}{\sum_{i\in I} A_{i}^{(a)}}.
\end{align}
With these definitions, the read count for higher-order taxon $I$ can be expressed as
$M_{I}^{(a)} = A_{I}^{(a)} B_I^{(a)} F^{(a)}$.
Thus $B_I^{(a)}$ plays a role analogous to the efficiency of an individual species, but differs in that it is not constant across samples:
If the constituent species have different efficiencies, then the efficiency of the higher-order taxon $I$ depends on the relative abundances of its constituents and so will vary across samples (@mclaren2019cons).
As an example, suppose that Species 1 and Species 2 in Figure \@ref(fig:error-proportions) were in the same phylum.
The efficiency of the phylum would then be $\tfrac{1}{2} \cdot 1 + \tfrac{1}{2} \cdot 18 = 9.5$ in Sample 1 and $\tfrac{15}{16} \cdot 1 + \tfrac{1}{16} \cdot 18 \approx 2.1$ in Sample 2.
Equations \@ref(eq:prop-error) and \@ref(eq:ratio-error) continue to describe the measurement error in proportions and ratios involving higher-order taxa, so long as the sample-dependent, higher-order taxa efficiencies $B_I^{(a)}$ and $B_J^{(a)}$ are used.
In this way, we see that both proportions and ratios among higher-order taxa may have inconsistent FEs.
## Absolute abundance {#absolute-abundance}
Several extensions of the standard MGS experiment make it possible to measure absolute species abundances.
These extensions fall into two general approaches.
The first approach leverages information about the abundance of the total community; for example, @vandeputte2017quan measured total-community abundance using flow cytometry and multiplied this number by genus proportions measured by MGS to quantify the absolute abundances of individual genera (@vandeputte2017quan).
A second approach leverages information about the abundance of one or more reference species; for example, a researcher might 'spike in' a known, fixed amount of an extraneous species to all samples prior to MGS, and normalize the read counts of all species to the spike-in species (@harrison2021theq).
We consider each approach in detail to determine how taxonomic bias affects the resulting absolute-abundance measurements.
### Leveraging information about total-community abundance
Suppose that the total abundance of all species in the sample, $A_{\tot}^{(a)}$, has been measured by a non-MGS method, yielding a measurement $\tilde A_\tot^{(a)}$.
The absolute abundance of an individual species can be quantified by multiplying the species' proportion from MGS by this total-abundance measurement,
\begin{align}
(\#eq:density-prop-meas)
\tilde A_i^{(a)} &= \tilde P_i^{(a)} \tilde A_\tot^{(a)}.
\end{align}
Total-abundance measurements recently used for this purpose include counting cells with microscopy (@lloyd2020evid) or flow cytometry (@props2017abso, @vandeputte2017quan, @galazzo2020howt), measuring the concentration of a marker-gene with qPCR or ddPCR (@zhang2017soil, @barlow2020aqau, @galazzo2020howt, @tettamantiboshier2020comp), and measuring bulk DNA concentration with a florescence-based DNA quantification method (@contijoch2019gutm).
Importantly, these methods of measuring total abundance are themselves subject to taxonomic bias that is analogous to, but quantitatively different from, the MGS relative abundance measurements.
Flow cytometry may yield lower cell counts for species whose cells tend to clump together or are prone to lysis during steps involved in sample collection, storage, and preparation.
Marker-gene concentrations measured by qPCR are affected by variation among species in extraction efficiency, marker-gene copy number, and PCR binding and amplification efficiency (@lloyd2013meta).
We can easily understand the impact of taxonomic bias on total-abundance measurement under simplifying assumptions analogous to those in our MGS model.
Suppose that each species $i$ has an _absolute efficiency_ $B_{i}^{\mtot}$ for the total-abundance measurement that is constant across samples.
Further, let $\bar B^{\mtot (a)}$ be the abundance-weighted average of these efficiencies in sample $a$---that is, the mean efficiency of the total-abundance measurement.
Neglecting other error sources, the total-abundance measurement equals
\begin{align}
(\#eq:total-density-error)
\tilde A_\tot^{(a)}
&= \sum_{i\in S} A_i^{(a)} B_{i}^{\mtot}
\\&= A_\tot^{(a)} \bar B^{\mtot (a)}.
\end{align}
<!-- Note: We have assumed that only species in S contribute to the total abundance measurement. -->
Species abundance measurements derived by this method (Equation \@ref(eq:density-prop-meas)) are affected by taxonomic bias in both the MGS and total-abundance measurement.
We can determine the resulting fold error (FE) in the estimate $\tilde A_i^{(a)}$ by substituting Equations \@ref(eq:prop-error) and \@ref(eq:total-density-error) into Equation \@ref(eq:density-prop-meas), yielding
\begin{align}
(\#eq:density-prop-error)
\tilde A_\tot^{(a)}
= A_\tot^{(a)} \cdot \frac{B_i \bar B^{\mtot (a)}}{\bar B^{(a)}}.
\end{align}
Equation \@ref(eq:density-prop-error) indicates that the FE in the measured absolute abundance of a species equals its MGS efficiency relative to the mean MGS efficiency in the sample, multiplied by the mean efficiency of the total measurement.
As in the case of proportions (Equation \@ref(eq:prop-error)), the FE depends on sample composition through the two mean efficiency terms and so will, in general, vary across samples.
### Leveraging information about a reference species
Suppose that the absolute abundance of a _reference species_ $r$ has been fixed by the experimenter or been measured by independent means.
This known or measured abundance $\tilde A_{r}^{(a)}$ can be used in conjunction with the MGS read counts to obtain absolute abundances for all species.
In the absence of taxonomic bias, the ratio of a species' absolute abundance to its MGS read count is the same for all species in a given sample (Equation \@ref(eq:mgs-model)).
Hence the known ratio for the reference species can serve as conversion factor for obtaining the absolute abundance of a species $i$ from its read count,
\begin{align}
(\#eq:density-ratio-meas)
\tilde A_i^{(a)} &= M_i^{(a)} \cdot \frac{\tilde A_r^{(a)}}{M_r^{(a)}}.
\end{align}
Let $\FE[\tilde A_r^{(a)}] \equiv {\tilde A_r^{(a)}}/{A_r^{(a)}}$ be the FE in the reference measurement.
The effect on $\tilde A_i^{(a)}$ of taxonomic bias in the MGS measurement can be determined by substituting Equation \@ref(eq:ratio-error) into Equation \@ref(eq:density-ratio-meas), yielding
\begin{align}
(\#eq:density-ratio-error)
\tilde A_i^{(a)} = A_i^{(a)} \cdot \frac{B_i}{B_r} \cdot
% \frac{\tilde A_r^{(a)}}{A_r^{(a)}}.
\FE \left[\tilde A_r^{(a)}\right].
\end{align}
The FE in $\tilde A_i^{(a)}$ consists of two terms: the relative efficiency of species $i$ to species $r$ in the MGS measurement (${B_i}/{B_r}$) and the FE in the reference species' abundance (${\tilde A_r^{(a)}}/{A_r^{(a)}}$).
A common application of this approach involves adding a 'spike-in' (as described above) in a known (and typically constant) abundance across samples (@stammler2016adju, @ji2019quan, @tkacz2018abso, @harrison2021theq, @rao2021mult).
In this case, the reference abundance $\tilde A_r^{(a)}$ is determined from the concentration of the spike-in stock multiplied by the ratio of the spike-in to sample volumes.
Others have instead sought to determine naturally-occurring species that are thought to be constant across samples; we refer to such species as _housekeeping species_ by analogy with the housekeeping genes used for absolute-abundance conversion in gene-expression studies (@silver2006sele).
Housekeeping species can sometimes be identified using prior scientific knowledge; for example, in shotgun sequencing experiments, researchers have used sequencing reads from the plant or animal host as a reference (@karasov2020ther, @regalado2020comb, @wallace2021thed).
A related approach involves computationally identifying species that are constant between pairs of samples (@david2014host) or between sample conditions (@mandal2015anal, @kumar2018anal).
The abundance of a housekeeping species is typically unknown; therefore, to estimate the abundances of other species, we simply set $\tilde A_r^{(a)}$ to 1 in Equation \@ref(eq:density-ratio-meas).
The resulting abundance measurements have unknown but fixed units, which is sufficient for measuring fold changes across samples.
We suggest an additional way of using the reference-species strategy even in the absence of a spike-in or constant species:
Performing targeted measurements of the absolute abundance of one or more naturally occurring species.
These species can then be used as reference species in Equation \@ref(eq:density-ratio-meas) to measure the absolute abundances of all species.
The most common form of targeted measurement involves using qPCR or ddPCR to measure the concentration of a marker-gene in the extracted DNA.
It is also possible to directly measure cell concentration by performing ddPCR prior to DNA extraction (@morella2018rapi), flow cytometry with species-specific florescent probes, or CFU counting on selective media.