case-studies.Rmd

# Case studies  {#case-studies}

To better understand the potential impact of taxonomic bias on DA analysis in practice, we conducted several case studies spanning a range of biological scenarios and sequencing technologies.

## Foliar fungi experiment

The taxonomic bias of species in 'mock communities' of known composition can be directly measured, and the measured bias can then be used to correct downstream DA analysis (@mclaren2019cons).
In practice it is difficult to construct control communities that span the many species present in complex natural communities.
However, gnotobiotic community experiments are well suited to this form of _calibration via community controls_ since it is possible to assemble mock communities containing all species in known relative abundances.

In a study of host-commensal-pathogen interactions, @leopold2020host inoculated plants with 8 commensal fungal species and subsequently exposed plants to a fungal pathogen.
The authors used ITS amplicon sequencing to measure communities before and after pathogen infection.
Motivated by the substantial ribosomal copy-number variation (CNV) in fungi (@lofgren2019geno), the authors also performed control measurements of mock communities that they constructed from quantified genomic DNA of the 9 species in the experiment; these controls were used to measure taxonomic bias with the method of @mclaren2019cons.
The authors found a 13-fold difference between the most and least efficiently measured commensal, while the pathogen was measured 40-fold more efficiently than the least efficiently measured commensal.

@leopold2020host performed two related DA analyses on the pre-infection communities: the first characterized the relative importance of host genetics and species arrival order on species relative abundances in the fully-established community, and the second quantified the strength of 'priority effects'---the advantage gained by a species from being allowed to colonize first.
Both analyses were based on fold differences (FDs) in species proportions and so in principle were sensitive to taxonomic bias.
To improve accuracy, the authors incorporated the bias measured from the control samples into analysis-specific calibration procedures.

We repeated the two DA analyses of @leopold2020host with and without calibration and found that the results did not meaningfully differ
([SI Analysis of host genetics and arrival order](https://mikemc.github.io/differential-abundance-theory/notebook/posts/2022-01-06-leopold2020host-original-regression-analysis/),
[SI Analysis of priority effects](https://mikemc.github.io/differential-abundance-theory/notebook/posts/2022-01-05-leopold2020host-priority-effects/)).
To understand why, we examined the variation in species proportions and the mean efficiency across the pre-infection communities 
([SI Analysis](https://mikemc.github.io/differential-abundance-theory/notebook/posts/2022-01-08-leopold2020host-case-study/), 
SI Figure \@ref(fig:leopold2020host-variation)).
Despite the 13-fold variation in the efficiencies among species, the mean efficiency hardly varied across samples (SI Figure \@ref(fig:leopold2020host-variation)C), having a geometric range of 1.62 and a geometric standard deviation of 1.05.
This consistency in the mean efficiency was despite the fact that each species showed substantial multiplicative variation (SI Figure \@ref(fig:leopold2020host-variation)A).
But the pre-infection samples were always dominated by the three species with the highest efficiencies, which varied by 3-fold and by just 1.5-fold between the two most dominant.
The mean efficiency, equal to the proportion-weighted arithmetic average of species efficiencies, is insensitive to species present at low relative abundance and so remained relatively constant across samples.
Because the multiplicative variation in the mean efficiency was much smaller than that in the proportions of individual species, it had a negligible impact on the inferred FDs and the DA analyses based on them.

We performed an additional DA analysis on the data from @leopold2020host to investigate whether any commensals increased in absolute concentration in response to infection ([SI Analysis](https://mikemc.github.io/differential-abundance-theory/notebook/posts/2022-01-08-leopold2020host-case-study/#change-in-commensals-due-to-infection)).
The pathogen is absent in pre-infection samples but tends to dominate the community post-infection, resulting in a substantially higher mean efficiency in post-infection samples (SI Figure \@ref(fig:leopold2020host-infection-mean-efficiency-dist)).
Across different host genotypes, the average post-infection increase in mean efficiency ranged from 2.5-fold to 5.2-fold.
Using gamma-Poisson regression of the observed read counts, we estimated the average LFD in commensal species proportions following infection with and without calibration (SI Figure \@ref(fig:leopold2020host-infection-lfc)).
The commensals' proportions typically decreased post-infection, which is expected given the pathogen's growth to most abundant community member and the sum-to-one constraint faced by proportions.
However, failure to account for bias caused the decrease to be overestimated, by an amount corresponding to the inverse change in log mean efficiency.
For commensal-host pairs with relatively small observed decreases, bias correction greatly reduced the magnitude of the negative LFDs or, in several cases, resulted in LFDs that were near 0 or slightly positive.

Although @leopold2020host did not include absolute-abundance measurements, we can consider the impact taxonomic bias would have on an absolute DA analysis in a simple scenario in which total genome concentration of each pre- and post-infection sample is perfectly known.
We consider the approach of multiplying the total genomic concentrations of each sample with the species proportions measured by MGS as described by Equation \@ref(eq:density-prop-meas).
In this case, the bias in the MGS measurements will create absolute errors in the estimated LFDs of genome concentration of equal magnitude to the errors in the LFD estimates for proportions.
If total abundances also differed after pathogen growth, however, we may make directional errors in determining the FD.
Suppose that total abundance increased by 2-fold due to pathogen growth; in this case, species whose proportions remained approximately constant would have increased in absolute abundance by around 2-fold.
Bias shifts FD estimates downwards by 2.5-fold to 5.2-fold (depending on host genotype).
Hence, without bias correction, we would instead conclude that these species decreased.

## Vaginal microbiomes of pregnant women

A growing number of MGS studies have used DA analysis to describe associations between members of the vaginal microbiome and health conditions including urinary tract infections, sexually-transmitted infections, bacterial vaginosis, and preterm birth.
However, these associations often vary between studies.
DA analyses of the vaginal microbiome are commonly based on proportions, creating an opportunity for taxonomic bias to impact results.
Substantial taxonomic bias has been experimentally demonstrated in MGS protocols applied to vaginal samples or in vitro samples of vaginally-associated species (@yuan2012eval, @brooks2015thet, @gill2016eval, @graspeuntner2018sele).
The different biases of different MGS protocols have been proposed as one explanation for discrepancies in DA results across studies (@callahan2017repl).

As part of the Multi-Omic Microbiome Study: Pregnancy Initiative (MOMS-PI) study, @fettweis2019thev collected longitudinal samples from over 1500 pregnant women, including nearly 600 that were taxonomically profiled by amplicon sequencing of the 16S V1-V3 hypervariable region.
The taxonomic bias of the MOMS-PI MGS protocol was previously quantified by @mclaren2019cons using control measurements by @brooks2015thet of cellular mock communities of seven common, clinically-relevant vaginal bacterial species.
Of these, _Lactobacillus iners_ had the highest efficiency, which was nearly 30-fold larger than that of the species with the lowest efficiency, _Gardnerella vaginalis_.
A second _Lactobacillus_ species, _L. crispatus_, had an efficiency that was approximately 2-fold less than _L. iners_ and 15-fold greater than _G. vaginalis_.
These species, along with the unculturable _Lachnospiraceae BVAB1_, are most frequently the most abundant species in a sample and commonly reach high proportions of over 70% 
([SI Analysis](https://mikemc.github.io/differential-abundance-theory/notebook/posts/2021-11-01-momspi-summary/)).
Therefore, shifts between these dominant species could drive large changes in the sample mean efficiency, which might in turn distort DA analyses.

We sought to understand the potential impact of bias on DA analysis of vaginal microbiome profiles from the MOMS-PI study.
We first examined variation in the mean efficiency across samples under the assumption that bias in the MOMS-PI study was accurately represented by the @brooks2015thet control measurements.
Using taxonomic relatedness to impute the efficiencies of species not in the controls, we were able to calibrate the MOMS-PI profiles and estimate the mean efficiency of each sample
([SI Analysis](https://mikemc.github.io/differential-abundance-theory/notebook/posts/2021-11-01-momspi-summary/)).
The mean efficiency varies substantially across vaginal samples (Figure \@ref(fig:momspi-mean-efficiency-dist)), with samples in which a _Lactobacillus_ species is most abundant typically having a mean efficiency that is 3-fold to 20-fold greater than samples in which _G. vaginalis_ is most abundant.
The vaginal microbiome sometimes shifted between _Lactobacillus_-dominance and _Gardnerella_-dominance between consecutive visits in individual women (SI Figure \@ref(fig:momspi-mean-efficiency-fcs)), which caused magnitude and direction errors in the trajectories of lower-abundance species (SI Figure \@ref(fig:momspi-trajectory)).

<!-- begin figure -->

```{r momspi-mean-efficiency-dist, fig.cap = '(ref:cap-momspi-mean-efficiency-dist)'}
fs::path(
  "notebook/_posts/2021-11-01-momspi-summary/momspi-summary_files", 
  "figure-html5/momspi-mean-efficiency-dist-1.svg"
  ) %>%
  include_svg
```

(ref:cap-momspi-mean-efficiency-dist) **The mean efficiency in vaginal samples from the MOMS-PI study varies with the most abundant species.**

<!-- end figure -->

These observations suggest that a proportion-based DA analysis of a health condition that is associated with _Lactobacillus_ and/or _Gardnerella_ would be prone to spurious results.
To demonstrate this possibility, we performed a DA analysis of the MOMS-PI samples with respect to a simulated health condition
([SI Analysis](https://mikemc.github.io/differential-abundance-theory/notebook/posts/2021-12-15-momspi-regression-diversity)).
Bacterial vaginosis (BV) has been repeatedly found to be associated with a dramatically reduced proportion of _Lactobacillus spp._, an increased proportion of _G. vaginalis_, and higher within-community (alpha) diversity (@srinivasan2012bact, @cartwright2018mult).
As a proxy for clinical BV, we split samples into high diversity (BV-like), intermediate diversity, and low diversity (non-BV-like) groups based on Shannon diversity in observed (uncalibrated) microbiome profiles.
The mean efficiency was lower in the high-diversity BV-like samples (mean LFD of -1.6) due to a lower abundance of _Lactobacillus spp._.
We estimated the LFD in proportion from non-BV-like to BV-like samples for 30 prevalent bacterial species, with and without bias correction, using gamma-Poisson (or negative-binomial) regression of the observed read counts to fit a linear model to log species proportions.
We accounted for bias by including an offset term in the linear model equal to the log ratio of the species' efficiency to the sample mean efficiency.
As expected, bias correction reduced the estimated LFD from non-BV-like to BV-like groups for every species tested, though the importance of this effect varied among species (SI Figure \@ref(fig:momspi-regression)).

## Human gut microbiomes

We wondered whether DA studies of the human gut microbiome might be less affected by bias than studies of the vaginal microbiome, due to the greater alpha diversity observed within gut microbiome samples.
Recall that the mean efficiency is an abundance-weighted average over the species in a sample.
Thus, all else equal, we expect the mean efficiency to vary less across samples from ecosystems where samples have higher alpha (within-sample) diversity, as is the case for stool samples relative to vaginal samples (@huttenhower2012stru).
On the other hand, the proportion of individual species may also vary less less across samples when alpha diversity is higher, in which case the effect of bias on DA results need not diminish.

```{r gut-stats, message = FALSE}
hmp_div_stats <- fs::path(
  "notebook/_posts/2022-01-30-hmp-stool-vagina-comparison/_output/",
  "div_q2_stats.csv"
) %>%
  read_csv(col_types = 'cddd') %>%
  metacal::as_matrix(rownames = body_site)

hmp_me_stats <- fs::path(
  "notebook/_posts/2022-01-30-hmp-stool-vagina-comparison/_output/",
  "reps_me_summ_diff.csv"
) %>%
  read_csv(col_types = 'cddd') %>%
  metacal::as_matrix(rownames = sim_type)

hmp_species_stats <- fs::path(
  "notebook/_posts/2022-01-30-hmp-stool-vagina-comparison/_output/",
  'species_gsd_stats.csv'
) %>%
  read_csv(col_types = 'dddd')
hmp_species_diff <- hmp_species_stats %>%
  summarize(diff = metacal::gm_mean(vagina_stool)) %>%
  pull(1)
```

To investigate this question, we analyzed bacterial species profiles of vaginal and stool samples derived from shotgun sequencing in the Human Microbiome Project 
(@huttenhower2012stru, [SI Analysis](https://mikemc.github.io/differential-abundance-theory/notebook/posts/2022-01-30-hmp-stool-vagina-comparison/)).
On average, stool profiles had substantially greater order-2 alpha diversity (Inverse Simpson index) than vaginal profiles (SI Figure \@ref(fig:gut)A; geometric mean (GM) $\md$ geometric standard deviation (GSD) of `r hmp_div_stats['stool', 'gm'] %>% round(1)` $\md$ `r hmp_div_stats['stool', 'gsd'] %>% round(1)` in stool samples and `r hmp_div_stats['vagina', 'gm'] %>% round(1)` $\md$ `r hmp_div_stats['vagina', 'gsd'] %>% round(1)` in vaginal samples).
To assess the potential importance of bias for proportion-based DA analyses in the two ecosystems, we quantified the multiplicative variation in the mean efficiency across samples for a large number of possible taxonomic biases under the assumption that the measured profiles reflected the truth.
Across simulation replicates, the GSD in the mean efficiency was typically lower in stool than in vaginal samples (SI Figure \@ref(fig:gut)B; ratio of GSD in vaginal samples to GSD in stool samples of `r hmp_me_stats['iid', 'gm'] %>% round(1)` $\md$ `r hmp_me_stats['iid', 'gsd'] %>% round(1)` across 1000 simulations).
Notably, however, the multiplicative variation in species also tended to be lower (SI Figure \@ref(fig:gut)C); the GSD in the proportion of a gut species across gut samples was an average of `r hmp_species_diff %>% round(2)`-fold lower than that of a vaginal species across vaginal samples.
Recall that the importance of bias estimation of the FD in a species' proportion depends on GSD in the mean efficiency relative the GSD in the species' proportion (Section \@ref(differential-abundance)).
Thus these results suggest that although the mean efficiency likely varies less across stool than vaginal samples, bias may be just as problematic for proportion-based DA analyses.

## Microbial growth in marine sediments

The surface layers of marine sediments harbor diverse and abundant microbiomes.
Total cell density and species richness decrease with depth as resources are consumed in older, deeper sediment layers; however, some taxa are able to grow and increase in density as they are slowly buried.
@lloyd2020evid performed a systematic assessment of growth rates of bacterial and archaeal taxa over a depth of 10 cm (corresponding to ~40 years of burial time) in sediment of the White Oak River estuary.
Taxa proportions were measured with 16S amplicon sequencing and total community densities were measured by counting cells using epifluorescence microscopy.
The authors then calculated cell concentrations for each taxon by multiplying the taxon’s proportion by the total concentration, a method they referred to as FRAxC measurements for 'fraction of 16S reads times total cell counts'.
Taxon-specific growth rates were inferred from the slope of a simple linear regression of the log of the calculated concentration of each taxon against burial time over the first 3 cm below the bioirrigation layer (corresponding to ~8 years of burial).

The FRAxC concentration measurements made by @lloyd2020evid correspond to the total-abundance approach to inferring absolute abundance (Equation \@ref(eq:density-prop-meas)).
Accordingly, taxonomic bias could lead to systematic error in FRAxC-derived growth rates if sample mean efficiency tends to systematically vary with burial time.
Such systematic variation could occur if microbes with tougher cell walls both persist longer in the sediment and are more difficult to extract DNA from than microbes with weaker cell walls.
In this scenario, the relative abundance of tougher species will increase with burial time, as the cells of weaker species degrade.
This shifting composition will cause the sample mean efficiency to decrease with time, leading to inflated growth-rate estimates for all taxa.
Taxa that decay with depth sufficiently slowly would mistakenly be inferred to have positive growth from the FRAxC-calculated abundances.

Importantly, @lloyd2020evid also used qPCR to measure the absolute abundance of two taxa from these same samples.
By comparing qPCR to FRAxC growth rates for these taxa, we can estimate the systematic error in FRAxC growth rates ([SI Analysis](https://mikemc.github.io/differential-abundance-theory/notebook/posts/2022-01-08-lloyd2020evi-case-study/))
Because the systematic error that bias causes in the regression slope is the same for each taxon (Section \@ref(differential-abundance)), these comparisons allow us to draw conclusions about the accuracy of FRAxC growth rates for all taxa.

```{r}
x <- fs::path(
  'notebook/_posts/2022-01-08-lloyd2020evi-case-study/_output/table1-tidy.rds'
) %>% readRDS
bathy_core30_fraxc <- x %>%
  filter(exp_name == 'Core 30', str_detect(clade, 'Bathyarchaeota'), 
    meas_type == 'FRAxC') %>%
  pull(doubling_rate)
bathy_core30_qpcr <- x %>%
  filter(exp_name == 'Core 30', str_detect(clade, 'Bathyarchaeota'), 
    meas_type == 'qPCR') %>%
  pull(doubling_rate)
bathy_core32_fraxc <- x %>%
  filter(exp_name == 'Core 32', str_detect(clade, 'Bathyarchaeota'), 
    meas_type == 'FRAxC') %>%
  pull(doubling_rate)
bathy_core32_qpcr <- x %>%
  filter(exp_name == 'Core 32', str_detect(clade, 'Bathyarchaeota'), 
    meas_type == 'qPCR') %>%
  pull(doubling_rate)
mbgd_core32_fraxc <- x %>%
  filter(exp_name == 'Core 32', str_detect(clade, 'MBG-D'), 
    meas_type == 'FRAxC') %>%
  pull(doubling_rate)
mbgd_core32_qpcr <- x %>%
  filter(exp_name == 'Core 32', str_detect(clade, 'MBG-D'), 
    meas_type == 'qPCR') %>%
  pull(doubling_rate)
```

The first soil core included qPCR measurements of a single archaeal taxon, _Bathyarchaeota_, for which growth rates by qPCR and FRAxC were nearly identical (doubling rates of `r round(bathy_core30_fraxc, 3)`/yr by FRAxC and `r round(bathy_core30_qpcr, 3)`/yr by qPCR).
The second soil core included qPCR measurements of _Bathyarchaeota_ and a second taxon, _Thermoprofundales_/MBG-D.
In this core, FRAxC and qPCR growth rates differed more substantially, with growth rates from FRAxC being larger by `r round(bathy_core32_fraxc - bathy_core32_qpcr, 3)`/yr for _Bathyarchaeota_ (`r round(bathy_core32_fraxc, 3)`/yr by FRAxC and `r round(bathy_core32_qpcr, 3)`/yr by qPCR) and by `r round(mbgd_core32_fraxc - mbgd_core32_qpcr, 3)`/yr for _Thermoprofundales_/MBG-D (`r round(mbgd_core32_fraxc, 3)`/yr by FRAxC and `r round(mbgd_core32_qpcr, 3)`/yr by qPCR).
Uncertainty in the FRAxC- and qPCR-derived growth rates is not reported and is likely substantial; however, the fact that the FRAxC-derived rates are larger than qPCR-derived rates in all three cases is consistent with our hypothesis that mean efficiency decreased with depth in a manner that systematically biased FRAxC-derived rates to higher values.
The differences in growth rate are small in absolute terms; however, the maximum observed difference of 0.086/yr suggests an error large enough to impact results for some taxa classified as positive growers, whose FRAxC growth rates ranged between 0.04/yr and 0.5/yr.
In summary, our comparison between FRAxC and qPCR measurements supports the overall study conclusion that many taxa did grow following sediment burial; however, we should remain uncertain as to whether species with small, positive FRAxC-derived growth rates were in fact growing or rather were slowly declining in abundance.

## Summary and discussion

The effect that consistent taxonomic bias has on proportion-based DA analyses depends on the MGS protocol, the biological system, and the sample comparisons under investigation.
Although our case studies explore a limited range of possibilities, some general patterns stand out. 

In several cases, the mean efficiency remained stable across samples, so that LFD esimates were unaffected by bias.
In the pre-infection fungal microbiome samples of @leopold2020host, we observed that the mean efficiency remained stable despite substantial bias and large multiplicative variation in species proportions among samples.
Because the variation in the mean efficiency was much less than that of the individual species, the impact of bias on DA analysis was negligible.
In the vaginal case study, we also observed that the mean efficiency was relative stable across vaginal microbiomes that were dominated by the same species, despite substantial bias and large LFDs in non-dominant species.
In both cases, the stability of the mean efficiency can be understood by the fact that it is an additive average over species and so is primarily determined by the most abundant species in the sample.

Yet we also observed cases where the mean efficiency varied substantially.
In several cases, the (log) mean efficiency co-varied sufficiently strongly with the condition of interest to cause substantial systematic errors in LFD estimates.
Examples include the comparison of foliar fungal microbiomes pre- and post-infection, for which the mean efficiency substantially increased post-infection, and the comparison of vaginal microbiomes with low versus high diversity in the MOMS-PI study, for which the mean efficiency was typically lower in high-diversity samples.
Our analysis of marine sediment communities is consistent with a systematic decline of the mean efficiency with burial time that is sufficient to substantially inflate the estimated growth rates of slowly changing species.
The FD in the mean efficiency is bounded by the largest FD of any species' proportion.
Therefore, the error in the estimated LFDs tend to be practically significant only for species whose LFDs are substantially smaller in magnitude than the largest LFD.

Variation in the mean efficiency may also be substantial, yet unassociated with the condition of interest.
Although we did not directly observe this scenario directly, we have reason to think that it may be common in real microbiome studies.
Our simulation analysis of gut microbiomes suggest that even in diverse ecosystems, bias will often cause multiplicative variation in the mean efficiency that is comparable to that of individual species.
This variation is much less problematic for DA results when it not associated with the condition under study, since any loss in precision can in principle be offset by an increase in sample size.
For certain applications, however, it will be important to remember that the LFDs between individual samples remain unreliable.

Under what conditions should we expect the problematic third scenario?
Our case studies suggest one prominent mechanism for causing systematic variation in the mean efficiency: the existence of one or more species with unusually high or low efficiencies that tend to dominate the community more often in one sample condition than another.
This mechanism was responsible for the negative effect of bias in our DA analyses of foliar fungal microbiomes following infection and of vaginal microbiomes with low versus high diversity.
In experimental systems where no one species frequently forms a large fraction of the community, systematic variation of the mean efficiency can still arise through the collective change of many species that are associated with the condition of interest.
We described a plausible instance in the marine-sediment case study, where increased burial time selects for lysis-resistant (and so lower efficiency) species.
As another example, treatment with an antibiotic might selectively kill Gram negative species, which also tend to be easier to lyse, thereby decreasing the mean efficiency in fecal samples collected after treatment.
A generic mechanism which might spawn such condition-efficiency associations stems from the evolutionary relationships among species.
Species with recent common ancestry are expected to share a number of traits that determine measurement efficiency, including cell wall structure, genome size, ribosomal copy number, and PCR binding sequence.
They are also expected to share traits related to the condition of interest.
If related species tend to have similar efficiencies and similar associations with the condition, then they may drive an association of the mean efficiency with the condition.

These observations provide reasons to both worry and hope.
It seems likely that in many studies, the mean efficiency is consistent (first scenario) or is at least not associated with the condition (second scenario), so LFD estimates remain accurate (or at least not overconfident).
Yet it is not obvious which scenario any study falls into.
There are plausible mechanisms leading to systematic variation of the mean efficiency even in ecosystems with high species diversity.
Moreover, our explorations suggest that random variation in the mean efficiency is common and distorts comparisons between individual samples.
Thus while we should not discount the large set of existing DA results, we should seek ways to better assess the robustness of results from previous studies and to measure and correct the error caused by bias in future ones.