Skip to content

Commit

Permalink
Add ability to control taxon order in merge functions
Browse files Browse the repository at this point in the history
Also:
- Added testing of output taxon order in various circumstances
- Removed call to `exists()` in `merge_taxa_vec()` to fix #31
- Add phyloseq's tax_glom tests
- Added Imports and Suggests for r package check
  • Loading branch information
mikemc committed May 29, 2020
1 parent 0a6f8de commit 1859b10
Show file tree
Hide file tree
Showing 10 changed files with 510 additions and 172 deletions.
4 changes: 3 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: speedyseq
Title: Faster implementations of phyloseq functions
Version: 0.1.2.9007
Version: 0.1.2.9008
Authors@R:
person(given = "Michael",
family = "McLaren",
Expand All @@ -16,6 +16,7 @@ Depends:
phyloseq
Imports:
ape,
Biostrings,
castor,
data.table,
dplyr,
Expand All @@ -30,6 +31,7 @@ RoxygenNote: 7.1.0
Suggests:
cowplot,
DECIPHER,
forcats,
ggtree,
phangorn,
testthat
43 changes: 37 additions & 6 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,32 @@
# speedyseq (development version)

* Fixed bug that applied to taxonomic merge functions when an object named
`new_tax_mat` exists outside the function environment; described in [Issue
#31](https://github.com/mikemc/speedyseq/issues/31)

* Changed default ordering of new taxa output by taxonomic merging functions
and added `reorder` parameter to control this behavior. (Only applies to
phylogenetic objects without trees.)

* New `merge_taxa_vec()` function provides a vectorized version of
`phyloseq::merge_taxa()` for `phyloseq` and `otu_table` objects.
`phyloseq::merge_taxa()`

* New `tip_glom()` function provides a speedy version of
`phyloseq::tip_glom()` for indirect phylogenetic merging of taxa.

* New `tree_glom()` function performs direct phylogenetic merging of taxa. This
function is much faster and arguably more intuitive than `tip_glom()`.

* Merging / glom functions now work on relevant phyloseq components as well as
phyloseq objects

* Adds dependencies
[castor](https://cran.r-project.org/web/packages/castor/index.html) and
[purrr](https://purrr.tidyverse.org/)

## New general-purpose vectorized merging function
## New features

### New general-purpose vectorized merging function

Phyloseq's `merge_taxa()` takes a phyloseq object or component object `x` and a
set of taxa `eqtaxa` and merges them into a single taxon. In place of the
Expand Down Expand Up @@ -68,10 +81,28 @@ ps <- prune_taxa(sample(taxa_names(GlobalPatterns), 2e2), GlobalPatterns)
ps1 <- tip_glom(ps, 0.1, tax_adjust = 1)
ps2 <- tip_glom(ps, 0.1, tax_adjust = 2)
tax_table(ps1)[c(108, 136, 45),]
#> Taxonomy Table: [3 taxa by 7 taxonomic ranks]:
#> Kingdom Phylum Class Order
#> 578831 "Bacteria" "Bacteroidetes" "Sphingobacteria" "Sphingobacteriales"
#> 2801 "Bacteria" "Planctomycetes" "Planctomycea" "Pirellulales"
#> 185581 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Oceanospirillales"
#> Family Genus Species
#> 578831 NA "Niabella" NA
#> 2801 NA "Rhodopirellula" NA
#> 185581 "OM60" NA "marinegammaproteobacteriumHTCC2080"
tax_table(ps2)[c(108, 136, 45),]
#> Taxonomy Table: [3 taxa by 7 taxonomic ranks]:
#> Kingdom Phylum Class Order
#> 578831 "Bacteria" "Bacteroidetes" "Sphingobacteria" "Sphingobacteriales"
#> 2801 "Bacteria" "Planctomycetes" "Planctomycea" "Pirellulales"
#> 185581 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Oceanospirillales"
#> Family Genus Species
#> 578831 NA NA NA
#> 2801 NA NA NA
#> 185581 "OM60" NA NA
```

## Speedy `tip_glom()` for indirect phylogenetic merging
### Speedy `tip_glom()` for indirect phylogenetic merging

Phyloseq provides `tip_glom()` to perform a form of indirect phylogenetic
merging using the phylogenetic tree in `phy_tree(physeq)`. This function uses
Expand All @@ -81,12 +112,12 @@ dendrogram produced by the clustering at a user defined height. Phyloseq's
version can be slow and memory intensive when the number of taxa is large.

Speedyseq's new `tip_glom()` function provides a faster and less
memory-intensive alternative to `phyloseq::tip_glom() through the use of
memory-intensive alternative to `phyloseq::tip_glom()` through the use of
vectorized merging (via `merge_taxa_vec()`) and faster and lower-memory
phylogenetic-distance computation (via `get_all_pairwise_distances()` from the
[castor](https://cran.r-project.org/web/packages/castor/index.html) package).

Speedyseq's `tax_glom()` also has the new `tax_adjust` argument, which is
Speedyseq's `tip_glom()` also has the new `tax_adjust` argument, which is
passed on to `merge_taxa_vec()`. It is set to `1` by default for phyloseq
compatibility and should give identical results to phyloseq in this case.

Expand All @@ -97,7 +128,7 @@ using the `hclust` function from base R with the `method == "average"` option.
Speedyseq's `tip_glom()` currently only works on phyloseq objects and will give
an error if used on a phylo (tree) object.

## Direct phylogenetic merging with `tree_glom()`
### Direct phylogenetic merging with `tree_glom()`

It might be desirable in many cases to perform phylogenetic merging based
directly on the phylogenetic tree rather than (as in `tip_glom()`) a dendrogram
Expand Down
31 changes: 21 additions & 10 deletions R/agglomeration.R
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,19 @@
#' be replaced with \code{NA}, because they should be meaningless following
#' agglomeration.
#'
#' This is the speedyseq reimplementation of phyloseq's tax_glom function. It
#' is designed to produce identical results but this has not been thoroughly
#' tested. Please report any discrepancies.
#' This is the speedyseq reimplementation of `phyloseq::tax_glom()`. It should
#' produce results that are identical to phyloseq up to taxon order.
#'
#' Documentation and general strategy derived from `phyloseq::tax_glom()`.
#' If `x` is a phyloseq object with a phylogenetic tree, then the new taxa will
#' be ordered as they are in the tree. Otherwise, the taxa order can be
#' controlled by the `reorder` argument, which behaves like the `reorder`
#' argument in \code{\link[base]{rowsum}}. `reorder = FALSE` will keep taxa in
#' the original order determined by when the member of each group first appears
#' in `taxa_names(x)`; `reorder = TRUE` will order new taxa alphabetically
#' according to taxonomy (string of concatenated rank values).
#'
#' @usage tax_glom(physeq, taxrank=rank_names(physeq)[1], NArm=TRUE,
#' bad_empty=c(NA, "", " ", "\t"))
#' Acknowledgements: Documentation and general strategy derived from
#' `phyloseq::tax_glom()`.
#'
#' @param physeq (Required). \code{\link{phyloseq-class}} or
#' \code{\link{tax_table}}.
Expand Down Expand Up @@ -48,6 +53,9 @@
#' agglomeration will not combine taxa according to the presence of these
#' values in \code{tax}. Furthermore, the corresponding taxa can be
#' optionally pruned from the output if \code{NArm} is set to \code{TRUE}.
#' @param reorder Logical specifying whether to reorder the taxa by taxonomy
#' strings or keep initial order. Ignored if `physeq` has a phylogenetic
#' tree.
#'
#' @return A taxonomically-agglomerated, optionally-pruned, object with class
#' matching the class of \code{physeq}.
Expand All @@ -74,7 +82,8 @@
tax_glom <- function(physeq,
taxrank = rank_names(physeq)[1],
NArm = TRUE,
bad_empty = c(NA, "", " ", "\t")) {
bad_empty = c(NA, "", " ", "\t"),
reorder = FALSE) {
if (is.null(access(physeq, "tax_table"))) {
stop("`tax_glom()` requires that `physeq` contain a taxonomy table")
}
Expand All @@ -95,7 +104,8 @@ tax_glom <- function(physeq,
function(x) {paste(x, collapse=";")}
)
# Merge taxa with speedyseq's vectorized merging function
physeq <- merge_taxa_vec(physeq, tax_strings, tax_adjust = 0L)
physeq <- merge_taxa_vec(physeq, tax_strings, tax_adjust = 0L,
reorder = reorder)
# "Empty" the taxonomy values to the right of the rank, using NA_character_.
if (rank_idx < length(rank_names(physeq))) {
bad_ranks <- seq(rank_idx + 1, length(rank_names(physeq)))
Expand All @@ -121,11 +131,12 @@ tax_glom <- function(physeq,
#' most abundant taxon (having the largest value of `taxa_sums(physeq)`. The
#' tree and refseq objects are pruned to the archetype taxa.
#'
#' Documentation and general strategy derived from `phyloseq::tip_glom()`.
#'
#' Speedyseq note: \code{\link[stats]{hclust}} is faster than the default
#' `hcfun`; set `method = "average"` to get equivalent clustering.
#'
#' Acknowledgements: Documentation and general strategy derived from
#' `phyloseq::tip_glom()`.
#'
#' @param physeq (Required). A \code{\link{phyloseq-class}}, containing a
#' phylogenetic tree. Alternatively, a phylogenetic tree
#' \code{\link[ape]{phylo}} will also work.
Expand Down
88 changes: 58 additions & 30 deletions R/merge_taxa_vec.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,14 @@
#' `otu_table` objects) or the first taxon in each group (all other phyloseq
#' component objects).
#'
#' If `x` is a phyloseq object with a phylogenetic tree, then the new taxa will
#' be ordered as they are in the tree. Otherwise, the taxa order can be
#' controlled by the `reorder` argument, which behaves like the `reorder`
#' argument in \code{\link[base]{rowsum}}. `reorder = FALSE` will keep taxa in
#' the original order determined by when the member of each group first appears
#' in `taxa_names(x)`; `reorder = TRUE` will order new taxa according to their
#' corresponding value in `group`.
#'
#' The `tax_adjust` argument controls the handling of taxonomic disagreements
#' within groups. Setting `tax_adjust == 0` causes no adjustment; the taxonomy
#' of the new group is set to the archetype taxon (see below). Otherwise,
Expand All @@ -21,14 +29,18 @@
#' @param x A phyloseq object or component object
#' @param group a vector with one element for each taxon in `physeq` that
#' defines the new groups. see `base::rowsum()`.
#' @param reorder Logical specifying whether to reorder the taxa by their
#' `group` values. Ignored if `x` has (or is) a phylogenetic tree.
#' @param tax_adjust 0: no adjustment; 1: phyloseq-compatible adjustment; 2:
#' conservative adjustment (see Details)
#' conservative adjustment
#' @export
#'
#' @seealso
#' \code{\link[phyloseq]{merge_taxa}}
#'
#' \code{\link{tax_glom}}
#' \code{\link{tip_glom}}
#' \code{\link{tree_glom}}
#' \code{\link[base]{rowsum}}
#'
#' @rdname merge_taxa_vec-methods
#' @examples
Expand All @@ -48,49 +60,57 @@
#' ps0 <- merge_taxa_vec(
#' ps,
#' group = clusters$cluster,
#' tax_adjust = 2
#' )
#' }
setGeneric("merge_taxa_vec",
function(x,
group,
reorder = FALSE,
tax_adjust = 1L)
standardGeneric("merge_taxa_vec")
)

#' @rdname merge_taxa_vec-methods
setMethod("merge_taxa_vec", "phyloseq",
function(x, group, tax_adjust = 1L) {
function(x, group, reorder = FALSE, tax_adjust = 1L) {
stopifnot(ntaxa(x) == length(group))
stopifnot(tax_adjust %in% c(0L, 1L, 2L))
# Warn the user if an impossible reordering is requested
if (!is.null(x@phy_tree) & reorder) {
warning("Can't reorder taxa if `x` has a `phy_tree`")
reorder <- FALSE
}
# drop taxa with `is.na(group)`
if (anyNA(group)) {
warning("`group` has missing values; corresponding taxa will be dropped")
x <- prune_taxa(!is.na(group), x)
group <- group[!is.na(group)]
}
# Get the merged otu table with new taxa named by most abundant
otu <- merge_taxa_vec(otu_table(x), group)
otu <- merge_taxa_vec(otu_table(x), group, reorder = reorder)
# Adjust taxonomy if necessary
if (!is.null(x@tax_table) & tax_adjust != 0) {
tax <- merge_taxa_vec(tax_table(x), group, tax_adjust = tax_adjust)
tax <- merge_taxa_vec(tax_table(x), group, tax_adjust = tax_adjust,
reorder = reorder)
# Taxa in `tax` are in same order as in `otu` but are named by first in
# group and so need to be renamed
# group instead of max and so need to be renamed
taxa_names(tax) <- taxa_names(otu)
} else {
tax <- NULL
}
# Create the new phyloseq object. Replacing the original otu_table with
# the new, smaller table will automatically prune the taxonomy, tree, and
# refseq to the smaller set of archetypal taxa.
otu_table(x) <- otu
if (exists("tax"))
if (!is.null(tax))
tax_table(x) <- tax
x
}
)

#' @rdname merge_taxa_vec-methods
setMethod("merge_taxa_vec", "otu_table",
function(x, group) {
function(x, group, reorder = FALSE) {
stopifnot(ntaxa(x) == length(group))
# Work with taxa as rows
if (!taxa_are_rows(x)) {
Expand All @@ -113,22 +133,22 @@ setMethod("merge_taxa_vec", "otu_table",
group = group
) %>%
.[, by = group, .(archetype = taxon[which.max(sum)])]
if (reorder)
data.table::setorder(new_names, group)
# Compute new table with base::rowsum(). The call to rowsum() makes the
# rownames the group names; reorder = FALSE keeps the taxa in the same
# order as `new_names`.
otu <- otu_table(rowsum(x, group, reorder = FALSE), taxa_are_rows = TRUE)
if (needs_flip) {
otu <- t(otu)
}
# rownames the group names.
otu <- otu_table(rowsum(x, group, reorder = reorder), taxa_are_rows = TRUE)
stopifnot(all.equal(as.character(new_names$group), taxa_names(otu)))
taxa_names(otu) <- new_names$archetype
if (needs_flip)
otu <- t(otu)
otu
}
)

#' @rdname merge_taxa_vec-methods
setMethod("merge_taxa_vec", "taxonomyTable",
function(x, group, tax_adjust = 1L) {
function(x, group, reorder = FALSE, tax_adjust = 1L) {
stopifnot(ntaxa(x) == length(group))
# drop taxa with `is.na(group)`
if (anyNA(group)) {
Expand All @@ -137,15 +157,17 @@ setMethod("merge_taxa_vec", "taxonomyTable",
group <- group[!is.na(group)]
}
if (tax_adjust == 0L)
return(merge_taxa_vec_pseudo(x, group))
return(merge_taxa_vec_pseudo(x, group, reorder = reorder))
else if (tax_adjust == 1L)
na_bad <- FALSE
else if (tax_adjust == 2L)
na_bad <- TRUE
k <- length(rank_names(x))
# bad_string is used to temporarily mark bad values in the tax table
bad_string <- paste0("BAD", Sys.time())
tax <- x %>%
# Reduce each group to one row; sort if needed; then finish flushing bad
# ranks and making new tax table
reduced <- x %>%
as("matrix") %>%
data.table::as.data.table(keep.rownames = "taxon") %>%
.[, group := group] %>%
Expand All @@ -157,14 +179,16 @@ setMethod("merge_taxa_vec", "taxonomyTable",
.(taxon = taxon[1]),
lapply(.SD, bad_or_unique, bad = bad_string)
)
] %>%
]
if (reorder)
data.table::setorder(reduced, group)
reduced %>%
.[, !"group"] %>%
# Propagate bad ranks downwards and convert to NAs
tibble::column_to_rownames("taxon") %>%
apply(1, bad_flush_right, bad = bad_string, na_bad = na_bad, k = k) %>%
t %>%
tax_table
return(tax)
}
)

Expand All @@ -175,18 +199,20 @@ setMethod("merge_taxa_vec", "phylo",

#' @rdname merge_taxa_vec-methods
setMethod("merge_taxa_vec", "XStringSet",
function(x, group) {merge_taxa_vec_pseudo(x, group)}
function(x, group, reorder = FALSE) {
merge_taxa_vec_pseudo(x, group, reorder = reorder)
}
)

#' Pseudo taxa merging for phylo and XStringSet objects
#' Pseudo-merge taxa in groups
#'
#' Returns `x` pruned to the first taxon of each group defined in `group`.
#'
#' @param x a phylo or XStringSet object
#' @param group a vector with one element for each taxon in `physeq` that
#' defines the new groups
#' @param x a phyloseq component-class object
#' @param group a vector with one element for each taxon in `x` that defines
#' the new groups
#' @keywords internal
merge_taxa_vec_pseudo <- function(x, group) {
merge_taxa_vec_pseudo <- function(x, group, reorder = FALSE) {
stopifnot(ntaxa(x) == length(group))
# drop taxa with `is.na(group)`
if (anyNA(group)) {
Expand All @@ -195,10 +221,12 @@ merge_taxa_vec_pseudo <- function(x, group) {
group <- group[!is.na(group)]
}
# Archetypes are the first taxon in each group
archetypes <- data.table::data.table(taxon = taxa_names(x), group = group) %>%
.[, by = group, .(taxon = taxon[1])] %>%
.$taxon
prune_taxa(archetypes, x)
archetypes <- data.table::data.table(taxon = taxa_names(x),
group = group) %>%
.[, by = group, .(taxon = taxon[1])]
if (reorder)
data.table::setorder(archetypes, group)
select_taxa(x, archetypes$taxon, reorder = TRUE)
}

# helper functions ------------------------------------------------------------
Expand Down
Loading

0 comments on commit 1859b10

Please sign in to comment.