Add ability to control taxon order in merge functions

Also: - Added testing of output taxon order in various circumstances - Removed call to `exists()` in `merge_taxa_vec()` to fix #31 - Add phyloseq's tax_glom tests - Added Imports and Suggests for r package check
mikemc · May 29, 2020 · 1859b10 · 1859b10
1 parent 0a6f8de
commit 1859b10
Show file tree

Hide file tree

Showing 10 changed files with 510 additions and 172 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: speedyseq
 Title: Faster implementations of phyloseq functions
-Version: 0.1.2.9007
+Version: 0.1.2.9008
 Authors@R: 
     person(given = "Michael",
            family = "McLaren",
@@ -16,6 +16,7 @@ Depends:
     phyloseq
 Imports: 
     ape,
+    Biostrings,
     castor,
     data.table,
     dplyr,
@@ -30,6 +31,7 @@ RoxygenNote: 7.1.0
 Suggests: 
     cowplot,
     DECIPHER,
+    forcats,
     ggtree,
     phangorn,
     testthat
diff --git a/NEWS.md b/NEWS.md
@@ -1,19 +1,32 @@
 # speedyseq (development version)
 
+* Fixed bug that applied to taxonomic merge functions when an object named
+  `new_tax_mat` exists outside the function environment; described in [Issue
+  #31](https://github.com/mikemc/speedyseq/issues/31)
+
+* Changed default ordering of new taxa output by taxonomic merging functions
+  and added `reorder` parameter to control this behavior. (Only applies to
+  phylogenetic objects without trees.)
+
 * New `merge_taxa_vec()` function provides a vectorized version of
-  `phyloseq::merge_taxa()` for `phyloseq` and `otu_table` objects.
+  `phyloseq::merge_taxa()`
 
 * New `tip_glom()` function provides a speedy version of
   `phyloseq::tip_glom()` for indirect phylogenetic merging of taxa.
 
 * New `tree_glom()` function performs direct phylogenetic merging of taxa. This
   function is much faster and arguably more intuitive than `tip_glom()`.
 
+* Merging / glom functions now work on relevant phyloseq components as well as
+  phyloseq objects
+
 * Adds dependencies
   [castor](https://cran.r-project.org/web/packages/castor/index.html) and
   [purrr](https://purrr.tidyverse.org/)
 
-## New general-purpose vectorized merging function
+## New features
+
+### New general-purpose vectorized merging function
 
 Phyloseq's `merge_taxa()` takes a phyloseq object or component object `x` and a
 set of taxa `eqtaxa` and merges them into a single taxon. In place of the
@@ -68,10 +81,28 @@ ps <- prune_taxa(sample(taxa_names(GlobalPatterns), 2e2), GlobalPatterns)
 ps1 <- tip_glom(ps, 0.1, tax_adjust = 1)
 ps2 <- tip_glom(ps, 0.1, tax_adjust = 2)
 tax_table(ps1)[c(108, 136, 45),]
+#> Taxonomy Table:     [3 taxa by 7 taxonomic ranks]:
+#>        Kingdom    Phylum           Class                 Order               
+#> 578831 "Bacteria" "Bacteroidetes"  "Sphingobacteria"     "Sphingobacteriales"
+#> 2801   "Bacteria" "Planctomycetes" "Planctomycea"        "Pirellulales"      
+#> 185581 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Oceanospirillales" 
+#>        Family Genus            Species                             
+#> 578831 NA     "Niabella"       NA                                  
+#> 2801   NA     "Rhodopirellula" NA                                  
+#> 185581 "OM60" NA               "marinegammaproteobacteriumHTCC2080"
 tax_table(ps2)[c(108, 136, 45),]
+#> Taxonomy Table:     [3 taxa by 7 taxonomic ranks]:
+#>        Kingdom    Phylum           Class                 Order               
+#> 578831 "Bacteria" "Bacteroidetes"  "Sphingobacteria"     "Sphingobacteriales"
+#> 2801   "Bacteria" "Planctomycetes" "Planctomycea"        "Pirellulales"      
+#> 185581 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Oceanospirillales" 
+#>        Family Genus Species
+#> 578831 NA     NA    NA     
+#> 2801   NA     NA    NA     
+#> 185581 "OM60" NA    NA     
 ```
 
-## Speedy `tip_glom()` for indirect phylogenetic merging
+### Speedy `tip_glom()` for indirect phylogenetic merging
 
 Phyloseq provides `tip_glom()` to perform a form of indirect phylogenetic
 merging using the phylogenetic tree in `phy_tree(physeq)`. This function uses
@@ -81,12 +112,12 @@ dendrogram produced by the clustering at a user defined height. Phyloseq's
 version can be slow and memory intensive when the number of taxa is large.
 
 Speedyseq's new `tip_glom()` function provides a faster and less
-memory-intensive alternative to `phyloseq::tip_glom() through the use of
+memory-intensive alternative to `phyloseq::tip_glom()` through the use of
 vectorized merging (via `merge_taxa_vec()`) and faster and lower-memory
 phylogenetic-distance computation (via `get_all_pairwise_distances()` from the
 [castor](https://cran.r-project.org/web/packages/castor/index.html) package).
 
-Speedyseq's `tax_glom()` also has the new `tax_adjust` argument, which is
+Speedyseq's `tip_glom()` also has the new `tax_adjust` argument, which is
 passed on to `merge_taxa_vec()`. It is set to `1` by default for phyloseq
 compatibility and should give identical results to phyloseq in this case.
 
@@ -97,7 +128,7 @@ using the `hclust` function from base R with the `method == "average"` option.
 Speedyseq's `tip_glom()` currently only works on phyloseq objects and will give
 an error if used on a phylo (tree) object.
 
-## Direct phylogenetic merging with `tree_glom()`
+### Direct phylogenetic merging with `tree_glom()`
 
 It might be desirable in many cases to perform phylogenetic merging based
 directly on the phylogenetic tree rather than (as in `tip_glom()`) a dendrogram

diff --git a/R/agglomeration.R b/R/agglomeration.R
@@ -13,14 +13,19 @@
 #' be replaced with \code{NA}, because they should be meaningless following
 #' agglomeration.
 #'
-#' This is the speedyseq reimplementation of phyloseq's tax_glom function. It
-#' is designed to produce identical results but this has not been thoroughly
-#' tested. Please report any discrepancies.
+#' This is the speedyseq reimplementation of `phyloseq::tax_glom()`. It should
+#' produce results that are identical to phyloseq up to taxon order. 
 #'
-#' Documentation and general strategy derived from `phyloseq::tax_glom()`.
+#' If `x` is a phyloseq object with a phylogenetic tree, then the new taxa will
+#' be ordered as they are in the tree. Otherwise, the taxa order can be
+#' controlled by the `reorder` argument, which behaves like the `reorder`
+#' argument in \code{\link[base]{rowsum}}. `reorder = FALSE` will keep taxa in
+#' the original order determined by when the member of each group first appears
+#' in `taxa_names(x)`; `reorder = TRUE` will order new taxa alphabetically
+#' according to taxonomy (string of concatenated rank values).
 #'
-#' @usage tax_glom(physeq, taxrank=rank_names(physeq)[1], NArm=TRUE,
-#'   bad_empty=c(NA, "", " ", "\t"))
+#' Acknowledgements: Documentation and general strategy derived from
+#' `phyloseq::tax_glom()`.
 #'
 #' @param physeq (Required). \code{\link{phyloseq-class}} or
 #'   \code{\link{tax_table}}.
@@ -48,6 +53,9 @@
 #'   agglomeration will not combine taxa according to the presence of these
 #'   values in \code{tax}. Furthermore, the corresponding taxa can be
 #'   optionally pruned from the output if \code{NArm} is set to \code{TRUE}.
+#' @param reorder Logical specifying whether to reorder the taxa by taxonomy
+#'   strings or keep initial order. Ignored if `physeq` has a phylogenetic
+#'   tree.
 #' 
 #' @return A taxonomically-agglomerated, optionally-pruned, object with class
 #' matching the class of \code{physeq}.
@@ -74,7 +82,8 @@
 tax_glom <- function(physeq, 
                      taxrank = rank_names(physeq)[1],
                      NArm = TRUE, 
-                     bad_empty = c(NA, "", " ", "\t")) {
+                     bad_empty = c(NA, "", " ", "\t"),
+                     reorder = FALSE) {
   if (is.null(access(physeq, "tax_table"))) {
     stop("`tax_glom()` requires that `physeq` contain a taxonomy table")
   }
@@ -95,7 +104,8 @@ tax_glom <- function(physeq,
     function(x) {paste(x, collapse=";")}
   )
   # Merge taxa with speedyseq's vectorized merging function
-  physeq <- merge_taxa_vec(physeq, tax_strings, tax_adjust = 0L)
+  physeq <- merge_taxa_vec(physeq, tax_strings, tax_adjust = 0L, 
+    reorder = reorder)
   # "Empty" the taxonomy values to the right of the rank, using NA_character_.
   if (rank_idx < length(rank_names(physeq))) {
     bad_ranks <- seq(rank_idx + 1, length(rank_names(physeq)))
@@ -121,11 +131,12 @@ tax_glom <- function(physeq,
 #' most abundant taxon (having the largest value of `taxa_sums(physeq)`. The
 #' tree and refseq objects are pruned to the archetype taxa.
 #'
-#' Documentation and general strategy derived from `phyloseq::tip_glom()`.
-#'
 #' Speedyseq note: \code{\link[stats]{hclust}} is faster than the default
 #' `hcfun`; set `method = "average"` to get equivalent clustering.
 #'
+#' Acknowledgements: Documentation and general strategy derived from
+#' `phyloseq::tip_glom()`.
+#'
 #' @param physeq (Required). A \code{\link{phyloseq-class}}, containing a
 #'   phylogenetic tree. Alternatively, a phylogenetic tree
 #'   \code{\link[ape]{phylo}} will also work.

diff --git a/R/merge_taxa_vec.R b/R/merge_taxa_vec.R
@@ -8,6 +8,14 @@
 #' `otu_table` objects) or the first taxon in each group (all other phyloseq
 #' component objects).
 #'
+#' If `x` is a phyloseq object with a phylogenetic tree, then the new taxa will
+#' be ordered as they are in the tree. Otherwise, the taxa order can be
+#' controlled by the `reorder` argument, which behaves like the `reorder`
+#' argument in \code{\link[base]{rowsum}}. `reorder = FALSE` will keep taxa in
+#' the original order determined by when the member of each group first appears
+#' in `taxa_names(x)`; `reorder = TRUE` will order new taxa according to their
+#' corresponding value in `group`.
+#'
 #' The `tax_adjust` argument controls the handling of taxonomic disagreements
 #' within groups. Setting `tax_adjust == 0` causes no adjustment; the taxonomy
 #' of the new group is set to the archetype taxon (see below). Otherwise,
@@ -21,14 +29,18 @@
 #' @param x A phyloseq object or component object 
 #' @param group a vector with one element for each taxon in `physeq` that
 #'   defines the new groups. see `base::rowsum()`.
+#' @param reorder Logical specifying whether to reorder the taxa by their
+#' `group` values. Ignored if `x` has (or is) a phylogenetic tree.
 #' @param tax_adjust 0: no adjustment; 1: phyloseq-compatible adjustment; 2:
-#'   conservative adjustment (see Details)
+#'   conservative adjustment
 #' @export
 #'
 #' @seealso
 #' \code{\link[phyloseq]{merge_taxa}}
-#'
 #' \code{\link{tax_glom}}
+#' \code{\link{tip_glom}}
+#' \code{\link{tree_glom}}
+#' \code{\link[base]{rowsum}}
 #'
 #' @rdname merge_taxa_vec-methods
 #' @examples
@@ -48,49 +60,57 @@
 #' ps0 <- merge_taxa_vec(
 #'   ps, 
 #'   group = clusters$cluster,
-#'   tax_adjust = 2
 #' )
 #' }
 setGeneric("merge_taxa_vec", 
   function(x, 
            group, 
+           reorder = FALSE,
            tax_adjust = 1L)
     standardGeneric("merge_taxa_vec")
 )
 
 #' @rdname merge_taxa_vec-methods
 setMethod("merge_taxa_vec", "phyloseq",
-  function(x, group, tax_adjust = 1L) {
+  function(x, group, reorder = FALSE, tax_adjust = 1L) {
     stopifnot(ntaxa(x) == length(group))
     stopifnot(tax_adjust %in% c(0L, 1L, 2L))
+    # Warn the user if an impossible reordering is requested
+    if (!is.null(x@phy_tree) & reorder) {
+      warning("Can't reorder taxa if `x` has a `phy_tree`")
+      reorder <- FALSE
+    }
     # drop taxa with `is.na(group)`
     if (anyNA(group)) {
       warning("`group` has missing values; corresponding taxa will be dropped")
       x <- prune_taxa(!is.na(group), x)
       group <- group[!is.na(group)]
     }
     # Get the merged otu table with new taxa named by most abundant
-    otu <- merge_taxa_vec(otu_table(x), group)
+    otu <- merge_taxa_vec(otu_table(x), group, reorder = reorder)
     # Adjust taxonomy if necessary
     if (!is.null(x@tax_table) & tax_adjust != 0) {
-      tax <- merge_taxa_vec(tax_table(x), group, tax_adjust = tax_adjust)
+      tax <- merge_taxa_vec(tax_table(x), group, tax_adjust = tax_adjust, 
+        reorder = reorder)
       # Taxa in `tax` are in same order as in `otu` but are named by first in
-      # group and so need to be renamed
+      # group instead of max and so need to be renamed
       taxa_names(tax) <- taxa_names(otu)
+    } else {
+      tax <- NULL
     }
     # Create the new phyloseq object. Replacing the original otu_table with
     # the new, smaller table will automatically prune the taxonomy, tree, and
     # refseq to the smaller set of archetypal taxa.
     otu_table(x) <- otu
-    if (exists("tax"))
+    if (!is.null(tax))
       tax_table(x) <- tax
     x
   }
 )
 
 #' @rdname merge_taxa_vec-methods
 setMethod("merge_taxa_vec", "otu_table",
-  function(x, group) {
+  function(x, group, reorder = FALSE) {
     stopifnot(ntaxa(x) == length(group))
     # Work with taxa as rows
     if (!taxa_are_rows(x)) {
@@ -113,22 +133,22 @@ setMethod("merge_taxa_vec", "otu_table",
       group = group
     ) %>%
       .[, by = group, .(archetype = taxon[which.max(sum)])]
+    if (reorder)
+      data.table::setorder(new_names, group)
     # Compute new table with base::rowsum(). The call to rowsum() makes the
-    # rownames the group names; reorder = FALSE keeps the taxa in the same
-    # order as `new_names`.
-    otu <- otu_table(rowsum(x, group, reorder = FALSE), taxa_are_rows = TRUE)
-    if (needs_flip) {
-      otu <- t(otu)
-    }
+    # rownames the group names.
+    otu <- otu_table(rowsum(x, group, reorder = reorder), taxa_are_rows = TRUE)
     stopifnot(all.equal(as.character(new_names$group), taxa_names(otu)))
     taxa_names(otu) <- new_names$archetype
+    if (needs_flip)
+      otu <- t(otu)
     otu
   }
 )
 
 #' @rdname merge_taxa_vec-methods
 setMethod("merge_taxa_vec", "taxonomyTable",
-  function(x, group, tax_adjust = 1L) {
+  function(x, group, reorder = FALSE, tax_adjust = 1L) {
     stopifnot(ntaxa(x) == length(group))
     # drop taxa with `is.na(group)`
     if (anyNA(group)) {
@@ -137,15 +157,17 @@ setMethod("merge_taxa_vec", "taxonomyTable",
       group <- group[!is.na(group)]
     }
     if (tax_adjust == 0L)
-      return(merge_taxa_vec_pseudo(x, group))
+      return(merge_taxa_vec_pseudo(x, group, reorder = reorder))
     else if (tax_adjust == 1L)
       na_bad <- FALSE
     else if (tax_adjust == 2L)
       na_bad <- TRUE
     k <- length(rank_names(x))
     # bad_string is used to temporarily mark bad values in the tax table
     bad_string <- paste0("BAD", Sys.time())
-    tax <- x %>% 
+    # Reduce each group to one row; sort if needed; then finish flushing bad
+    # ranks and making new tax table
+    reduced <- x %>% 
       as("matrix") %>%
       data.table::as.data.table(keep.rownames = "taxon") %>%
       .[, group := group] %>%
@@ -157,14 +179,16 @@ setMethod("merge_taxa_vec", "taxonomyTable",
           .(taxon = taxon[1]), 
           lapply(.SD, bad_or_unique, bad = bad_string)
         )
-      ] %>%
+      ]
+    if (reorder)
+      data.table::setorder(reduced, group)
+    reduced %>%
       .[, !"group"] %>%
       # Propagate bad ranks downwards and convert to NAs
       tibble::column_to_rownames("taxon") %>%
       apply(1, bad_flush_right, bad = bad_string, na_bad = na_bad, k = k) %>% 
       t %>%
       tax_table
-    return(tax)
   }
 )
 
@@ -175,18 +199,20 @@ setMethod("merge_taxa_vec", "phylo",
 
 #' @rdname merge_taxa_vec-methods
 setMethod("merge_taxa_vec", "XStringSet",
-  function(x, group) {merge_taxa_vec_pseudo(x, group)}
+  function(x, group, reorder = FALSE) {
+    merge_taxa_vec_pseudo(x, group, reorder = reorder)
+  }
 )
 
-#' Pseudo taxa merging for phylo and XStringSet objects
+#' Pseudo-merge taxa in groups
 #'
 #' Returns `x` pruned to the first taxon of each group defined in `group`.
 #'
-#' @param x a phylo or XStringSet object
-#' @param group a vector with one element for each taxon in `physeq` that
-#'   defines the new groups
+#' @param x a phyloseq component-class object
+#' @param group a vector with one element for each taxon in `x` that defines
+#'   the new groups
 #' @keywords internal
-merge_taxa_vec_pseudo <- function(x, group) {
+merge_taxa_vec_pseudo <- function(x, group, reorder = FALSE) {
   stopifnot(ntaxa(x) == length(group))
   # drop taxa with `is.na(group)`
   if (anyNA(group)) {
@@ -195,10 +221,12 @@ merge_taxa_vec_pseudo <- function(x, group) {
     group <- group[!is.na(group)]
   }
   # Archetypes are the first taxon in each group
-  archetypes <- data.table::data.table(taxon = taxa_names(x), group = group) %>%
-    .[, by = group, .(taxon = taxon[1])] %>%
-    .$taxon
-  prune_taxa(archetypes, x)
+  archetypes <- data.table::data.table(taxon = taxa_names(x), 
+    group = group) %>%
+    .[, by = group, .(taxon = taxon[1])]
+  if (reorder)
+    data.table::setorder(archetypes, group)
+  select_taxa(x, archetypes$taxon, reorder = TRUE)
 }
 
 # helper functions ------------------------------------------------------------