phyloseq_3_Merge Data1.Rmd

---
title: "phyloseq_3_Merge Data"
author: "wentao"
date: "2019/8/23"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# phyloseq基础：合并数据

## 写在前面


- phyloseq对我们意义很大程度上来讲是其独特的封装结构，将我们做微生物群落分析需要的文件封装到phyloseq对象中。面对这五个部分的封装内容，我们需要随时进行分离，重新组合。这该怎么办？

- 面对一个分组多个重复的测序样品，我们实际上关注的是组一与组之间的OTU丰度关系，在phyloseq中该如何处理呢？

- 微生物群落通常而言，指的是细菌和真菌群落。我们通过多种方式对原始测序数据进行处理，但这仅仅只是一个开始。无论是细菌还是真菌，我们测得的OTU数量基本在100-100000之间，面对如此庞大的数据，我们有七级分类用来注释OTU。这仅仅只是一个参考，如果我们想按照自己的需求合并OTU呢？



### 本教程载入包和数据

Phyloseq提供两种完全不同的合并数据的方式

><font size=2>The phyloseq project includes support for two completely different categories of merging data objects</font>

在Phyloseq中可以基于样品和物种进行OTU表格的合并：基于样品的合并和物种的合并分别使用merge_samples（）和merge_taxa（）函数。

><font size=2>Merging the OTUs or samples in a phyloseq object, based upon a taxonomic or sample variable: merge_samples() merge_taxa()</font>

使用分租文件中的分组变量对OTU和样品进行合并，可以有效减少在下游分析和出图噪音（由于个体差异导致分组规律不明显）。合并分组可以为来自同一个环境采样，合并的OTUs可能来自同一门类。合并中默认将NA的去除。在本示例中，分别使用这两列分别使用sample id或taxa id对数据进行合并。

><font size=2>Merging OTU or sample indices based on variables in the data can be a useful means of reducing noise or excess features in an analysis or graphic. Some examples might be to merge the samples in a dataset that are from the same environment, or orthogonally, to merge OTUs that are from the same taxonomic genera. Often this takes the form of a table-join, where non-matching keys are omitted in the result. In this case, keys are Sample IDs or Taxa IDs, respectively.</font>

phyloseq封装了常见的微生物群落分析的五个文件（查看先前介绍），对于同一个测序文件的不同分析部分进行合并使用： merge_phyloseq()

><font size=2>Merging two or more data objects that come from the same experiment, so that their data becomes part of the same phyloseq object: merge_phyloseq()</font>

这一操作对phyloseq中添加不足的对象尤其有用 merge_Phyloseq是一个便捷的工具，可以帮助将数据转换为正确的phyloseq格式并添加到封装体系中。

><font size=2>Merging separate data objects is especially useful for manually-imported data objects, especially when one of the data objects already has more than one component and so is a phyloseq-class. While the first category of merging functions is useful for direct manipulations of the data for analytical purposes, merge_phyloseq is a convenience/support tool to help get your data into the right format.</font>

如果想要消除单个样本对于整个分组的影响，更方便的总结规律，可以使用merge_samples函数操作。使用merge_samples函数默认对对样本的之间的count数量求均值，所以如果每个样本的权重不一样，普通求和会对数据造成影响，需要提前对数据进行抽平。

><font size=2>merge_samples can be very useful if you’d like to see what happens to an analysis if you remove the indivual effects between replicates or between samples from a particular explanatory variable.With the merge_samples function, the abundance values of merged samples are summed, so make sure to do any preprocessing to account for differences in sequencing effort before merging or you will achieve a sequencing-effort-weighted average (which you may want, but keep in mind).</font>

首先移除count数量为0的otu（所有样本的总和为0），然后在分组文件中添加一列与人类相关的变量，以便在稍后的绘图中展示数据。

><font size=2>Let’s first remove unobserved OTUs (sum 0 across all samples), and add a human-associated variable with which to organize data later in the plots.</font>

```{R}
library("phyloseq"); packageVersion("phyloseq")
data(GlobalPatterns)
 GlobalPatterns
```

### 过滤otu
prune_taxa：函数用来根据丰度对otu进行筛选

加载数据，删除空样本，向数据集中添加新的样本数据变量。

><font size=2>Load data, remove empty samples, add a new sample_data variable to the dataset.</font>

```{R}
?prune_taxa
GP = GlobalPatterns
GP
GP = prune_taxa(taxa_sums(GlobalPatterns) > 10, GlobalPatterns)
GP
humantypes = c("Feces", "Mock", "Skin", "Tongue")
sample_data(GP)$human <- get_variable(GP, "SampleType") %in% humantypes
```
### merge_samples函数根据分组合并样品和mapping文件
merge_samples:用于根据样品分组，将同一分组内的otu合并，默认为均值。这里我们使用sum来做

```{R}
# ?merge_samples

mergedGP = phyloseq::merge_samples(GP, "SampleType",fun = sum)
print(mergedGP)
##单独合并mapping文件
sample_names(GP)
SD = phyloseq::merge_samples(sample_data(GP), "SampleType")
(SD[, "SampleType"])

```



#### identical函数用于两个phyloseq对象是否相同

这里我们鉴定单独合并的mapping文件和整个phyloseq合并的文件是否相同
```{R}
identical(SD, sample_data(mergedGP))
```
### 提取丰度排名前十的OTU,提取仅含有这些otu的phyloseq对象

使用otu总和为otu排序，得到丰度最高的前十个otu的名称，使用prune_taxa函数提取含有这些otu的phyloseq子对象。

><font size=2>As emphasized earlier, the OTU abundances of merged samples are summed. Let’s investigate this ourselves looking at just the top10 most abundance OTUs</font>

```{R}
OTUnames10 = names(sort(taxa_sums(GP), TRUE)[1:10])
GP10  = prune_taxa(OTUnames10,  GP)
mGP10 = prune_taxa(OTUnames10, mergedGP)
#提取指定分组的样品名称
ocean_samples = sample_names(subset(sample_data(GP), SampleType=="Ocean"))
print(ocean_samples)
```
#### 单独从otu表格中筛选需要多份样品名称

```{R}
otu_table(GP10)[, ocean_samples]
```

#### 计算基于某个分组提取出来的otu表格计算OTU序列总和

```{R}
rowSums(otu_table(GP10)[, ocean_samples])

```

从合并完成的phyloseq对象中提取子otu表。
```{R}
otu_table(mGP10)["Ocean", ]
```
### plot_richness函数用于可视化微生物群落alpha多样性

可以通过设定分组文件展示不同分组的alpha多样性指标，这里一共有6个指标: "Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson", "Fisher"；
默认全部出图；如果需要指定出图，添加参数：measures = "Observed"

基于合并后的数据求取丰富度指标并出图。请注意，这仅仅又来演示，因为我们在合并之前并没有对数据进行预处理。

><font size=2>Let’s look at the merge graphically between two richness estimate summary plots. Note that we are ignoring our own advice and not-performing any preprocessing before this merge. Since these are  richness estimates we are probably safe.</font>

```{R}
# ?plot_richness
plot_richness(GP, "human", "SampleType", title="unmerged")
```

使用我们之前合并后的phyloseq对象进行alpha出图；首先添加一列分组信息，这里直接使用合并后的样品名称。
humantypes：为之前我们挑选出来的样品名，这里使用其取子集。

><font size=2>The merge can do some weird things to sample variables. Let’s re-add these variables to the sample_data before we plot.</font>

```{R}
sample_data(mergedGP)$SampleType = sample_names(mergedGP)

sample_data(mergedGP)$human = sample_names(mergedGP) %in% humantypes
plot_richness(mergedGP, "human", "SampleType", title="merged",measures = "Observed")
```

将来自同一环境的非重复样本的丰度结合起来时，每个环境的绝对丰度估计值都会增加。新的合并图更容易阅读和解释，这也是使用merge-samples函数的原因之一

><font size=2>Perhaps not surprisingly, when we combine the abundances of non-replicate samples from the same environment, the estimates of absolute richness increase for each environment. More to the example, however, the new merged plot is easier to read and interpret, which is one of the reasons one might use the merge_samples function</font>

### merge_taxa 合并注释文件
phyloseq中的进化树可视化方案还是十分丰富的，作者使用了power这个词，我们在之后进化树专题会展示强大的进化树可视化方案。

遗憾的是example-data.RData在phyloseq包中我没有找到;我更换为另外一组示例数据进行演示（GP10）

对OTU表来说，微生物数据中的“噪音”可能会被我们误认为具有高的分类等级，因此，基于系统发育或分类阈值对相似OTUs进行合并，最好使用tip_glom tax_glom函数。然而，要使这两个函数都能工作，它必须能够合并两个或多个被视为“相同”的otu。

><font size=2>One of the sources of “noise” in a microbial census dataset is a fine-scaled definition for OTU that blurs a pattern that might be otherwise evident were we to consider a higher taxonomic rank. The best way to deal with this is to use the agglomeration functions tip_glom or tax_glom that merge similar OTUs based on a phylogenetic or taxonomic threshold, respectively. However, for either function to work, it must be capable of merging two or more OTUs that have been deemed “equivalent”</font>

下面是merge_taxa的一个实例，使用一个系统发育树来显示合并前后的情况，并且还表明合并不仅影响otu表，而且还影响otu索引的所有数据

><font size=2>The following is an example of merge_taxa in action, using a tree graphic to display the before and after of the merge, and also show how the merge affects not just the OTU table, but all data components with OTU indices.</font>

```{R}

# load("example-data.RData")

plot_tree(GP10, color="SampleType", size="abundance", sizebase=2, label.tips="taxa_names")

```
#### 合并进化树

下面演示合并第三个到第五个门类。

这里涉及到了merge_tax函数，用于合并一些otu。这里注意archetype参数用于指定合并后的OTU的名称，可以自己指定名名称，或者设定为数字，使用合并OTU中的第几个OTU名称代替。
默认archetype参数设定为s使用这个合并的OTU集合中数量最多的OTU名称。

现在让我们使用merge-taxa将前8个otu合并为一个新的otu，通过选择两个OTUs作为merge_taxa的可选第三个参数，我们将这8个otus合并到第二个otu的索引中。默认情况是使用第二个参数的第一个otu

><font size=2>Now let’s use merge_taxa to merge the first 8 OTUS of closedps into one new OTU. By choosing 2 as the optional third argument to merge_taxa , we are combining the counts of these 8 OTUs into the index for the second OTU. The default is to use the first OTU of the second argument</font>

```{R}
  ?merge_taxa
x1 = merge_taxa(GP10, taxa_names(GP10)[3:5], 2)
x1
plot_tree(x1, color="SampleType", size="abundance", sizebase=2, label.tips="taxa_names")
```
### merge_phyloseq：phyloseq的分离和合并

前面我们提到两个函数，分别用来合并样品和部分otu。

接下来的命令我们进行phyloseq对象的合并。要知道phyloseq封装了我们常见的微生物分析的五个文件，分别是：otu表格，tax注释文件，mapping分组文件，rep代表序列文件，tree进化树文件。
对这些文件的合并使用merge_phyloseq命令。

首先导入数据，并对这个phyloseq对象的内容进行分离,然后合并otu文件和注释文件。

如前所述，Merge_Phyloseq可以帮助将数据转换为正确的格式。下面是一个示例，我们从示例数据集中提取内容，然后使用Merge_Phyloseq将它们构建回原始形式

><font size=2>As said earlier, merge_phyloseq is a convenience/support tool to help get your data into the right format. Here is an example in which we extract components from an example dataset, and then build them back up to the original form using merge_phyloseq along the way</font>

让我们将“Global Patterns”示例数据集拆分

><font size=2>Let’s split apart the “Global Patterns” example dataset into some components</font>

```{R}
library(phyloseq)
data(GlobalPatterns)
tree = phyloseq::phy_tree(GlobalPatterns)
tax  = tax_table(GlobalPatterns)
otu  = otu_table(GlobalPatterns)
sam  = sample_data(GlobalPatterns)
otutax = phyloseq(otu, tax)
otutax

```

在现有phyloseq对象汇总加入样品分组文件和进化树

新otutax对象只有otu表和分类表。使用merge_Phyloseq来构建原始的globalPatterns对象，并进行比较以确保它们是相同的。注意合并Phyloseq的参数是多项（otutax）和单项对象的混合

><font size=2>As you can see, our new otutax object has just the OTU table and taxonomy table. Now let’s use merge_phyloseq to build up the original GlobalPatterns object, and compare to make sure they are identical. Note how the arguments to merge_phyloseq are a mixture of multi-component ( otutax ) and single component objects</font>

```{R}
GP2 = merge_phyloseq(otutax, sam, tree)
identical(GP2, GlobalPatterns)
GP2

```

在现有phyloseq文件中加入otu表格和注释文件

Merge_Phyloseq函数也可以用于多对象

><font size=2>The merge_phyloseq function will also work even with more than one multiple-component object.</font>

```{R}

otusamtree = phyloseq(otu, sam, tree)
GP3 = merge_phyloseq(otusamtree, otutax)
GP3
```
新对象GP3与GlobalPatterns相似，但并不完全相同。Merge_Phyloseq的假设是，当尝试合并不同的丰度数据，两个Phyloseq对象中具有相同OTU表的任何部分都会被汇总在一起，与使用merge_taxa结果相同。

><font size=2>So this merge appears to have worked. The new object, GP3 , looks similar to GlobalPatterns but is not identical. Why? Well, the assumption by merge_phyloseq is that you are atttempting to merge separate sources of abundance data, and so any portion of the OTU tables in the two phyloseq objects that have the same OTU indices are summed together, just like with merge_taxa earlier.</font>



tax_table函数提取出来的注释文件可以直接加入 merge_phyloseq函数中进行合并

><font size=2>This example scenario was illustrative, but hopefully rare in practice. Nevertheless, just in case, an easy fix would be to extract the unique component of otutax and provide it to merge_phyloseq ,instead of the entire phyloseq object. This amounts to a small intuitive modification to the previous merge_phyloseq command:</font>
 
```{R}
GP4 = merge_phyloseq(otusamtree, tax_table(otutax))
GP4
```
鉴定我们使用不同方式合并的文件是否相同
```{R}
identical(GP4, GlobalPatterns)

```

### reference
https://joey711.github.io/phyloseq/merge.html