-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
157 lines (125 loc) · 8.95 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
title: ggtree tutorial
author: Andrey Rozenberg
---
## Setup
We'll be using the following packages:
```{r setup, warning = F}
library(dplyr)
library(tidyr)
library(castor)
library(phangorn)
library(phytools)
library(ggtree)
library(treeio)
```
All are available from conda as well, so smth like the following environment `.yaml` file would work:
```yaml
channels:
- conda-forge
- bioconda
dependencies:
- r=4.1
- r-dplyr
- r-tidyr
- r-svglite
- r-ape=5.5
- r-phangorn=2.8.0
- r-phytools=0.7_90
- r-castor=1.7.2
- bioconductor-biostrings=2.62.0
- bioconductor-ggtree=3.2.0
- bioconductor-treeio=1.18.0
```
# Reading trees
Different software generates trees in different text formats. For most of them there are `read.*` import functions. The most popular formats are as follows:
| Format | Example R functions | Example software |
| :------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------- |
| [newick](https://evolution.genetics.washington.edu/phylip/newicktree.html) | `ape::read.tree` (`phylo`), `treeio::read.astral` (`treedata`), `treeio::read.iqtree` (`treedata`) | very widespread, e.g. [iqtree](http://www.iqtree.org/) (`.treefile`), [astral](https://github.com/smirarab/ASTRAL) |
| [nexus](http://wiki.christophchamp.com/index.php?title=NEXUS_file_format) | `ape::read.nexus` (`phylo`), `treeio::read.beast` (`treedata`) | [PAUP](https://paup.phylosolutions.com/), [BEAST](https://beast.community/) |
| [nhx](https://www.cs.cmu.edu/~aiton/split/Manual-2.6.master014.html) | `treeio::read.nhx` (`treedata`) | [Notung](http://www.cs.cmu.edu/~durand/Notung/) and other reconciliation software |
The tree data are represented in R in the objects of three main R
classes:
* `phylo` which is a basic format and a common denominator across multiple packages;
* `treedata` that is very useful when it comes to storing annotations (it includes a slot `@phylo` that stores the topology) and
* `tbl_tree` which is essentially a `tibble` containing node data in tabular format.
`phylo` and `treedata` objects are what the read functions produce (as indicated). `treedata` is supported by `ape` and `ggtree` and is useful for plotting trees with annotations, `tbl_tree` is not a strict tree format and can be modified as any other tibble as long as no rows are added or removed. Inter-conversion is mostly straightforward: there are functions `as.phylo`, `as.treedata`, `as_tibble`, but notice that conversion to `phylo` is not lossless as it preserves only topology, branch lengths and node labels. Also, table operations on `tbl_tree` often lead to losing the correct class and obscure errors with `as.treedata`, so in case this happens enforce classes with `class(c("tbl_tree", "tbl_df", "tbl", "data.frame"))` (see below).
As an example, we'll take an example RAxML tree of XXXXXXX:
```{r}
tree_file <- system.file("extdata/RAxML", "RAxML_bipartitions.H3", package = "treeio")
big_tree <- read.tree(tree_file)
```
To make a cursory look at the tree will use the `ggtree` function:
```{r}
ggtree(big_tree)
```
The tree is quite big, so let's pick every 4th tip using `ape::keep.tip`:
```{r}
tips_to_keep <- with(big_tree, tip.label[seq(to = length(tip.label), by = 4)])
tree <- keep.tip(big_tree, tips_to_keep)
ggtree(tree) + geom_tiplab()
```
This looks more manageable for us (of course in real life you remove tips only if you have good reasons).
Now let's add some metadata to the tree using its tabular representation. Now, in most of the cases these annotations would probably come from a separate annotation file, but in this case the tip label are quite verbose, so we can them for our toy example -- we'll extract the year and the “type” (it's a mock – just the first letter of the label):
```{r}
tree_data <- as_tibble(tree) %>%
extract(label, into = c("type", "year"), regex = "(.).+_([0-9]+)", remove = F, convert = T) %>%
mutate(isInner = node %in% parent, isRoot = node == parent) %>%
mutate(support = ifelse(isInner | isRoot, label, NA), support = as.numeric(support)) %>%
`class<-`(c("tbl_tree", "tbl_df", "tbl", "data.frame")) %>%
as.treedata
ggtree(tree_data) + geom_tiplab()
```
# Tree rooting
To display trees in rooted layouts and for analyses that rely on rooted trees, trees must be rooted. In some cases, the trees are already rooted to begin with (e.g. obtained by phylogenetic placement with an already rooted reference tree). In general case though, common phylogenetic programs produce either explicitly unrooted trees or trees that should be treated as unrooted. In such cases, the trees should be (re-)rooted using one of the following strategies:
* outgroup rooting – OTUs a priori known not to belong to the group of interest, i.e. the ingroup, the recommended strategy
* midpoint rooting – this procedure places the root in the geometric middle of the tree, should be used with caution and mostly only for visualization
* minimal ancestor deviation and minimum variance rooting
* nonreversible models (see [root\_digger](https://github.com/computations/root_digger) and [rootstrap](http://www.iqtree.org/doc/Rootstrap))
* molecular clock (e.g. UPGMA in the extreme unrealistic case)
One important point about tree rooting concerns inner node and edge labels. Most of the formats and classes for storing phylogenetic trees code both, iternal node and edge attributes as inner nodes labels. This becomes relevant when an edge reverses its orientation due to (re-)rooting and is particularly important for support values that are edge attributes. It should be kept in mind that different rooting programs and functions eiter simply treat all inner node labels as node attributes and thus produce incorrect output most of the time or have parameters/arguments that them to treat the labels as either node or edge attributes.
Depending on the circumstances, it is sometimes desirable to root the trees before importing them in R, e.g. with nw\_reroot from [newick utilities](https://github.com/tjunier/newick_utils) – it has a switch for treating inner node as edge attributes). For MAD rooting there are [mad](https://www.mikrobio.uni-kiel.de/de/ag-dagan/ressourcen) and [madRoot](https://github.com/davidjamesbryant/MADroot), although both have issues with edge labels. For nonreversible models there are [root\_digger](https://github.com/computations/root_digger) and iqtree's implementation of [rootstrap](http://www.iqtree.org/doc/Rootstrap), although note that iqtree will only assign rootstrap support values and not actually re-root the input tree.
Among R packages, the following functions can correctly treat edge labels:
* `ape::root(edgelabel = T)`
* `phangorn::midpoint(node.labels = "support")`
Some code for future:
```{r, eval = FALSE}
scaleClades <- function(p, df) {
with(df, Reduce(function(.p, .node) {
offs <- offspring(.p$data, .node)
scale <- 0.5 / (nrow(offs) - 1)
scaleClade(.p, .node, scale)
}, node, p))
}
collapseClades <- function(p, df) {
with(df, Reduce(function(.p, .node) {
fill <- unlist(colors[Host[node == .node]])
.p$data[.p$data$node == .node, "label.show"] <- label.show[node == .node]
collapse(.p, .node, "mixed", fill = fill)
}, node, p))
}
multi_species <- allDescendants(tree_data@phylo) %>%
lapply(function(x) filter(tree, node %in% x)) %>%
bind_rows(.id = "ancestor") %>%
group_by(ancestor) %>%
filter(n_distinct(Collapse, na.rm = T) == 1, sum(!isInternal) > 1) %>%
ungroup %>%
mutate(ancestor = as.numeric(ancestor)) %>%
filter(! ancestor %in% node) %>%
filter(!is.na(Collapse)) %>%
group_by(ancestor, Collapse) %>%
summarize(num_tips = sum(!isInternal), Host = first(na.omit(Host))) %>%
mutate(label.show = sprintf("%s (%d)", Collapse, num_tips)) %>%
rename(node = ancestor)
p <- ggtree(tree_data) +
geom_nodepoint(aes(x = branch, subset = !is.na(UFboot) & UFboot >= 90, size = UFboot)) +
geom_tiplab(aes(label = label.show), size = 4, align = T, linesize = 0) +
geom_text2(aes(subset = node %in% multi_species$node, x = max(x, na.rm = T), label = label.show), nudge_x = 0.01, size = 4, hjust = 0) +
geom_tippoint(aes(color = Host), size = 3) +
geom_treescale(width = 0.5) +
scale_size_continuous(limits = c(90, 100), range = c(1, 3)) +
scale_shape_manual(values = seq(0,15)) +
scale_color_manual(values = colors)
p <- scaleClades(p, multi_species)
p <- collapseClades(p, multi_species)
```