-
Notifications
You must be signed in to change notification settings - Fork 50
/
Copy path02_tidytree.Rmd
660 lines (465 loc) · 30.8 KB
/
02_tidytree.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
\newpage
# Manipulating Tree with Data {#chapter2}
```{r include=FALSE}
library(knitr)
opts_chunk$set(message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE, cache=TRUE)
```
```{r echo=FALSE, results="hide", message=FALSE}
library("ape")
library("tidytree")
library(treeio)
```
```{r echo=FALSE}
options(show_data_for_treedata = FALSE)
```
## Manipulating Tree Data Using Tidy Interface {#tidytree}
All the tree data parsed/merged
by `r Biocpkg("treeio")`\index{treeio} [@wang_treeio_2020] can be converted to a tidy
data frame using the `r CRANpkg("tidytree")`
package. The `r CRANpkg("tidytree")` package
provides tidy interfaces to manipulate trees with associated data. For instance,
external data can be linked to phylogeny or evolutionary data obtained from
different sources can be merged using tidyverse verbs. After the tree data was
processed, it can be converted back to a `treedata` object and exported to [a single tree file](#chapter3), further analyzed in R or visualized using `r Biocpkg("ggtree")` [@yu_ggtree:_2017] and `r Biocpkg("ggtreeExtra")` [@ggtreeExtra_2021].
### The `phylo` object
The `phylo` class defined in the `r CRANpkg("ape")` package [@paradis_ape_2004] is
fundamental for phylogenetic analysis in R. Most of the R packages
in [this field](https://CRAN.R-project.org/view=Phylogenetics) rely
extensively on the `phylo` object. The `r CRANpkg("tidytree")` package provides `as_tibble`
method to convert the `phylo` object to a tidy data frame, a `tbl_tree` object\index{ape}.
```{r}
library(ape)
set.seed(2017)
tree <- rtree(4)
tree
x <- as_tibble(tree)
x
```
The `tbl_tree` object can be converted back to a `phylo` object using the `as.phylo()` method.
```{r}
as.phylo(x)
```
Using `tbl_tree` object makes tree and data manipulation more effective and
easier (see also the example in [FAQ](#bind-tip)). For example, we can link evolutionary trait to phylogeny using the `r CRANpkg("dplyr")` verbs `full_join()`:
```{r}
d <- tibble(label = paste0('t', 1:4),
trait = rnorm(4))
y <- full_join(x, d, by = 'label')
y
```
### The `treedata` object
The `r CRANpkg("tidytree")` package defines `treedata` class to store a phylogenetic tree with
associated data. After mapping external data to the tree structure, the
`tbl_tree` object can be converted to a `treedata` object.
```{r}
as.treedata(y)
```
The `treedata` class is used
in the [treeio](https://bioconductor.org/packages/treeio/) package [@wang_treeio_2020] to store evolutionary evidence inferred by commonly used software (`r pkg_beast`, `r pkg_epa`, `r pkg_hyphy`, `r pkg_mrbayes`, `r pkg_paml`, `r pkg_phyldog`, `r pkg_pplacer`, `r pkg_r8s`, `r pkg_raxml`, and `r pkg_revbayes`, etc.) (see details in [Chapter 1](#chapter1)).
The `r CRANpkg("tidytree")` package also provides the `as_tibble()` method to convert a `treedata` object to a tidy data frame. The phylogenetic tree structure and the evolutionary
inferences were stored in the `tbl_tree` object, making it consistent and easier
for manipulating evolutionary statistics inferred by different software as well
as linking external data to the same tree structure.
```{r}
y %>% as.treedata %>% as_tibble
```
### Access related nodes {#accesor-tidytree}
The `r CRANpkg("dplyr")` verbs can be applied to `tbl_tree` directly to manipulate tree data. In addition, `r CRANpkg("tidytree")` provides several verbs to filter related nodes, including
`child()`, `parent()`, `offspring()`, `ancestor()`, `sibling()` and `MRCA()`.
These verbs accept a `tbl_tree` object and a selected node which can be node number or label.
```{r}
child(y, 5)
parent(y, 2)
offspring(y, 5)
ancestor(y, 2)
sibling(y, 2)
MRCA(y, 2, 3)
```
All these methods are also implemented in `r Biocpkg("treeio")` for working with `phylo` and `treedata` objects. You can try accessing related nodes using the tree object. For instance, the following command will output child nodes of the selected internal node `5`:
```{r}
child(tree, 5)
```
Beware that the methods for tree objects output relevant node numbers, while the methods for `tbl_tree` object output a `tibble` object that contains related information.
## Data Integration
### Combining tree data {#merge-tree}
The `r Biocpkg("treeio")` package [@wang_treeio_2020] serves as an
infrastructure that enables various types of phylogenetic data inferred from
common analysis programs to be imported and used in R. For instance, *d~N~/d~S~*
or ancestral sequences estimated by `r pkg_codeml`,
and clade support values (posterior) inferred by `r pkg_beast`/`r pkg_mrbayes`.
In addition, `r Biocpkg("treeio")` supports linking external data to phylogeny. It brings these external phylogenetic data (either from software output or external sources) to the R
community and makes it available for further analysis in R.
Furthermore, `r Biocpkg("treeio")` can combine
multiple phylogenetic trees into one with their node/branch-specific
attribute data. Essentially, as a result, one such attribute (*e.g.*,
substitution rate) can be mapped to another attribute (*e.g.*, *d~N~/d~S~*) of
the same node/branch for comparison and further computations [@yu_ggtree:_2017; @yu_two_2018].
A previously published dataset, seventy-six H3 hemagglutinin gene sequences of
a lineage containing swine and human influenza A viruses
[@liang_expansion_2014], was used here to demonstrate the utilities of comparing
evolutionary statistics inferred by different software. The dataset was
re-analyzed by `r pkg_beast` for timescale estimation
and `r pkg_codeml` for synonymous and
non-synonymous substitution estimation. In this example, we first parsed the
outputs from `r pkg_beast` using the `read.beast()` function and
from `r pkg_codeml` using
the `read.codeml()` function into two `treedata` objects. Then these two objects containing separate sets of node/branch-specific data were merged via the `merge_tree()` function.
```{r}
beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree")
rst_file <- system.file("examples/rst", package="ggtree")
mlc_file <- system.file("examples/mlc", package="ggtree")
beast_tree <- read.beast(beast_file)
codeml_tree <- read.codeml(rst_file, mlc_file)
merged_tree <- merge_tree(beast_tree, codeml_tree)
merged_tree
```
After merging the `beast_tree` and `codeml_tree` objects, all
node/branch-specific data imported from `r pkg_beast`
and `r pkg_codeml` output files are
all available in the `merged_tree` object. The tree object was converted to a
tidy data frame using the [tidytree](https://cran.r-project.org/package=tidytree)
package and visualized as hexbin scatterplots of *d~N~/d~S~*, *d~N~*, and *d~S~* inferred
by `r pkg_codeml` vs. *rate*
(substitution rate in a unit of substitutions/site/year) inferred
by `r pkg_beast` on the same branches.
(ref:correlationscap) Correlation of *d~N~/d~S~*, *d~N~*, and *d~S~* versus substitution rate.
(ref:correlationcap) **Correlation of *d~N~/d~S~*, *d~N~*, and *d~S~* vs. substitution rate.** After merging the *BEAST* and *CodeML* outputs, the branch-specific estimates (substitution rate, *d~N~/d~S~* , *d~N~*, and *d~S~*) from the two analysis programs are compared on the same branch basis. The associations of *d~N~/d~S~*, *d~N~*, and *d~S~* vs. *rate* are visualized in hexbin scatter plots.
```{r correlations, fig.width=9, fig.height=3, warning=FALSE, fig.cap="(ref:correlationcap)", fig.scap="(ref:correlationscap)", out.extra='', out.width="100%"}
library(dplyr)
df <- merged_tree %>%
as_tibble() %>%
select(dN_vs_dS, dN, dS, rate) %>%
subset(dN_vs_dS >=0 & dN_vs_dS <= 1.5) %>%
tidyr::gather(type, value, dN_vs_dS:dS)
df$type[df$type == 'dN_vs_dS'] <- 'dN/dS'
df$type <- factor(df$type, levels=c("dN/dS", "dN", "dS"))
ggplot(df, aes(rate, value)) + geom_hex() +
facet_wrap(~type, scale='free_y')
```
The output is illustrated in Figure \@ref(fig:correlations). We can then test the association of these node/branch-specific data using Pearson correlation, which in this case showed that *d~N~* and *d~S~*, but not *d~N~/d~S~*\index{d\textsubscript{N}/d\textsubscript{S}} are significantly (*p*-values) associated with *rate*.
Using the `merge_tree()` function, we are able to compare analysis results using an identical
model from different software packages or different models using different or
identical software. It also allows users to integrate different analysis findings
from different software packages. Merging tree data is not restricted to
software findings, associating external data to analysis findings is also
granted. The `merge_tree()` function is chainable and allows several tree objects
to be merged into one.
```{r}
phylo <- as.phylo(beast_tree)
N <- Nnode2(phylo)
d <- tibble(node = 1:N, fake_trait = rnorm(N), another_trait = runif(N))
fake_tree <- treedata(phylo = phylo, data = d)
triple_tree <- merge_tree(merged_tree, fake_tree)
triple_tree
```
The `triple_tree` object shown above contains analysis results obtained from `r pkg_beast`
and `r pkg_codeml`, and evolutionary
traits from external sources. All these pieces of information can be used to annotate the
tree using `r Biocpkg("ggtree")` [@yu_ggtree:_2017] and `r Biocpkg("ggtreeExtra")` [@ggtreeExtra_2021].
### Linking external data to phylogeny {#link-external-data}
In addition to analysis findings that are associated with the tree as demonstrated
above, there is a wide range of heterogeneous data, including phenotypic data,
experimental data, and clinical data, *etc.*, that need to be integrated and
linked to phylogeny. For example, in the study of viral evolution, tree nodes may be
associated with epidemiological information, such as location, age, and subtype.
Functional annotations may need to be mapped onto gene trees for comparative
genomics studies. To facilitate data
integration, `r Biocpkg("treeio")` provides
`full_join()` methods to link external data to phylogeny and store it in either `phylo` or `treedata` object. Beware that linking external data to a `phylo` object will produce a `treedata` object to store the input `phylo` with associated data. The `full_join` methods can also be used at tidy data frame level (*i.e.*, `tbl_tree` object described previously) and at `ggtree` level (described in [Chapter 7](#attach-operator)) [@yu_two_2018].
The following example calculated bootstrap values and merged those values with the tree (a `phylo` object) by matching their node numbers.
```{r apeBoot, message=FALSE}
library(ape)
data(woodmouse)
d <- dist.dna(woodmouse)
tr <- nj(d)
bp <- boot.phylo(tr, woodmouse, function(x) nj(dist.dna(x)))
bp2 <- tibble(node=1:Nnode(tr) + Ntip(tr), bootstrap = bp)
full_join(tr, bp2, by="node")
```
Another example demonstrates merging evolutionary traits with the tree (a `treedata` object) by matching their tip labels.
```{r}
file <- system.file("extdata/BEAST", "beast_mcc.tree", package="treeio")
beast <- read.beast(file)
x <- tibble(label = as.phylo(beast)$tip.label, trait = rnorm(Ntip(beast)))
full_join(beast, x, by="label")
```
Manipulating tree objects is frustrated with the fragmented functions available
for working with `phylo` objects, not to mention linking external data to the
phylogeny structure. With the `r Biocpkg("treeio")` package [@wang_treeio_2020], it is easy to combine tree data from various sources.
In addition, with the `r CRANpkg("tidytree")` package, manipulating trees is easier using the [tidy data principles](https://www.jstatsoft.org/article/view/v059i10) and
consistent with tools already in wide use, including
`r CRANpkg("dplyr")`,
`r CRANpkg("tidyr")`,
`r CRANpkg("ggplot2")`,
and `r Biocpkg("ggtree")` [@yu_ggtree:_2017].
### Grouping taxa
The `groupOTU()`\index{groupOTU} and `groupClade()`\index{groupClade} methods are designed for adding taxa grouping
information to the input tree object. The methods were implemented in `r CRANpkg("tidytree")`,
`r Biocpkg("treeio")`, and `r Biocpkg("ggtree")` respectively to support adding grouping information for the
`tbl_tree`, `phylo` and `treedata`, and `ggtree` objects. This grouping information can be
used directly in tree visualization (*e.g.*, [coloring a tree based on grouping information](#group-taxa-vis))
with `r Biocpkg("ggtree")` (Figure \@ref(fig:groupOTU)).
#### groupClade
The `groupClade()` method accepts an internal node or a vector of internal nodes
to add grouping information of selected clade/clades.
```{r}
nwk <- '(((((((A:4,B:4):6,C:5):8,D:6):3,E:21):10,((F:4,G:12):14,H:8):13):
13,((I:5,J:2):30,(K:11,L:11):2):17):4,M:56);'
tree <- read.tree(text=nwk)
groupClade(as_tibble(tree), c(17, 21))
```
#### groupOTU
```{r}
set.seed(2017)
tr <- rtree(4)
x <- as_tibble(tr)
## the input nodes can be node ID or label
groupOTU(x, c('t1', 't4'), group_name = "fake_group")
```
Both `groupClade()` and `groupOTU()` work with the `tbl_tree`, `phylo` and `treedata`, and `ggtree` objects. Here is an example of using `groupOTU()` with a `phylo` tree object\index{groupOTU}.
```{r}
groupOTU(tr, c('t2', 't4'), group_name = "fake_group") %>%
as_tibble
```
Another example of working with the `ggtree` object can be found in [session 6.4](#group-taxa-vis).
The `groupOTU` will trace back from input nodes to most recent common ancestor.
In this example, nodes 1, 4, 5 and 6 are
grouping together (`4 (t2) -> 6 -> 5` and `1 (t4) -> 5`).
Related operational taxonomic units (OTUs) are grouping and they are not necessarily within a clade.
They can be monophyletic (clade), polyphyletic or paraphyletic.
```{r}
cls <- list(c1=c("A", "B", "C", "D", "E"),
c2=c("F", "G", "H"),
c3=c("L", "K", "I", "J"),
c4="M")
as_tibble(tree) %>% groupOTU(cls)
```
If there are conflicts when tracing back to the most recent common ancestor, users can set `overlap`
parameter to "origin" (the first one counts), "overwrite" (default, the last one
counts), or "abandon" (un-selected for grouping)^[<https://groups.google.com/forum/#!msg/bioc-ggtree/Q4LnwoTf1DM/uqYdYB_VBAAJ>].
## Rerooting tree {#reroot-treeio}
A phylogenetic tree can be rerooted with a specified `outgroup`. The `r CRANpkg("ape")` package implements a `root()` method to reroot a tree stored in a `phylo` object, while the `r Biocpkg("treeio")` package provides the `root()` method for `treedata` object. This method is designed to re-root a phylogenetic tree with associated data concerning the specified `outgroup` or at the specified `node` based on the `root()` implemented in the `r CRANpkg("ape")` package.
We first linked external data to a tree using `left_join()` and stored all the information in a `treedata` object, `trda`.
```{r reroot_build_td}
library(ggtree)
library(treeio)
library(tidytree)
library(TDbook)
# load `tree_boots`, `df_tip_data`, and `df_inode_data` from 'TDbook'
trda <- tree_boots %>%
left_join(df_tip_data, by=c("label" = "Newick_label")) %>%
left_join(df_inode_data, by=c("label" = "newick_label"))
trda
```
Then we can reroot the tree with the associated data mapping to the branches and nodes correctly as demonstrated in Figure \@ref(fig:reroot). The figure was visualized using `r Biocpkg("ggtree")` (see also Chapters [4](#chapter4) and [5](#chapter5)).
(ref:rerootscap) Reroot a phylogenetic tree with associated data.
(ref:rerootcap) **Reroot a phylogenetic tree with associated data.** Original tree (A) and re-rooted tree (B) with associated data mapped to the branches or nodes of the tree correctly. (A) and (B) present before and after rooting on the branch leading to the tip node 'Suricata_suricatta', respectively.
```{r reroot, fig.width=14, fig.height=5, fig.cap="(ref:rerootcap)", fig.scap="(ref:rerootscap)"}
# reroot
trda2 <- root(trda, outgroup = "Suricata_suricatta", edgelabel = TRUE)
# The original tree
p1 <- trda %>%
ggtree() +
geom_nodelab(
mapping = aes(
x = branch,
label = bootstrap
),
nudge_y = 0.36
) +
xlim(-.1, 4.5) +
geom_tippoint(
mapping = aes(
shape = trophic_habit,
color = trophic_habit,
size = mass_in_kg
)
) +
scale_size_continuous(range = c(3, 10)) +
geom_tiplab(
offset = .14,
) +
geom_nodelab(
mapping = aes(
label = vernacularName.y,
fill = posterior
),
geom = "label"
) +
scale_fill_gradientn(colors = RColorBrewer::brewer.pal(3, "YlGnBu")) +
theme(legend.position = "right")
# after reroot
p2 <- trda2 %>%
ggtree() +
geom_nodelab(
mapping = aes(
x = branch,
label = bootstrap
),
nudge_y = 0.36
) +
xlim(-.1, 5.5) +
geom_tippoint(
mapping = aes(
shape = trophic_habit,
color = trophic_habit,
size = mass_in_kg
)
) +
scale_size_continuous(range = c(3, 10)) +
geom_tiplab(
offset = .14,
) +
geom_nodelab(
mapping = aes(
label = vernacularName.y,
fill = posterior
),
geom = "label"
) +
scale_fill_gradientn(colors = RColorBrewer::brewer.pal(3, "YlGnBu")) +
theme(legend.position = "right")
plot_list(p1, p2, tag_levels='A', ncol=2)
```
The `outgroup` parameter represents the specific new `outgroup`, it can be a node label (character) or node number. If it is a "single one" value, meaning using the node below this tip as the new root, if it has multiple values, meaning the most recent common of the values will be used as the new root. Note that, if the node labels should be treated as edge labels, the `edgelabel` should be set to `TRUE` to return the correct relationship between the `node` and `associated data`. For more details about re-root, including precautions and pitfalls, please refer to the review article [@reroot_review].
## Rescaling Tree Branches {#rescale-treeio}
Phylogenetic data can be merged for joint analysis (Figure \@ref(fig:correlations)). They can be displayed on the same tree structure as a more complex annotation to help visually inspection of their evolutionary patterns. All the numerical data stored in a `treedata` object can be used to re-scale tree branches. For example, CodeML infers d~N~/d~S~, d~N~, and d~S~, all these statistics can be used as branch lengths (Figure \@ref(fig:rescale)). All these values can also be used to color the tree (session [4.3.4](#color-tree)) and can be projected to a vertical dimension to create a two-dimensional tree or phenogram (session [4.2.2](#layouts-of-phylogenetic-tree) and Figures \@ref(fig:2d) and \@ref(fig:continuousColor)).
(ref:rescalescap) Rescaling tree branches.
(ref:rescalecap) **Rescaling tree branches.** The tree with branches scaled in time (year from the root) (A). The tree was rescaled using *d~N~* as branch lengths (B). The tree was rescaled using substitution rates (C).
```{r rescale, fig.width=12, fig.height=4.5, message=F, echo=T, fig.cap="(ref:rescalecap)", fig.scap="(ref:rescalescap)", out.extra='', out.width="100%"}
p1 <- ggtree(merged_tree) + theme_tree2()
p2 <- ggtree(rescale_tree(merged_tree, 'dN')) + theme_tree2()
p3 <- ggtree(rescale_tree(merged_tree, 'rate')) + theme_tree2()
plot_list(p1, p2, p3, ncol=3, tag_levels='A')
```
Modifying branch lengths in the tree object in addtion to using the `rescale_tree()` function, users can directly specify a variable as branch length in `ggtree()` as demonstrated in [session 4.3.6](#rescale-tree).
## Subsetting Tree with Data
### Removing tips in a phylogenetic tree {#remove-tip}
Sometimes we want to remove selected tips from a phylogenetic tree. This is due to several reasons, including low sequence quality, errors in sequence assembly, an alignment error in part of the sequence, an error in phylogenetic inference, *etc*.
Let's say that we want to remove three tips (colored red) from the tree (Figure \@ref(fig:removeTip)A), the `drop.tip()` method removes specified tips and updates the tree (Figure \@ref(fig:removeTip)B). All associated data will be maintained in the updated tree.
(ref:removeTipscap) Removing tips from a tree.
(ref:removeTipcap) **Removing tips from a tree.** Original tree with three tips (colored red) to remove (A). The updated tree removed selected tips (B).
```{r removeTip, fig.width=12, fig.height=6, fig.cap="(ref:removeTipcap)", fig.scap="(ref:removeTipscap)", out.width="100%"}
f <- system.file("extdata/NHX", "phyldog.nhx", package="treeio")
nhx <- read.nhx(f)
to_drop <- c("Physonect_sp_@2066767",
"Lychnagalma_utricularia@2253871",
"Kephyes_ovata@2606431")
p1 <- ggtree(nhx) + geom_tiplab(aes(color = label %in% to_drop)) +
scale_color_manual(values=c("black", "red")) + xlim(0, 0.8)
nhx_reduced <- drop.tip(nhx, to_drop)
p2 <- ggtree(nhx_reduced) + geom_tiplab() + xlim(0, 0.8)
plot_list(p1, p2, ncol=2, tag_levels = "A")
```
### Subsetting tree by tip label {#subset-tip}
Sometimes a tree can be large and difficult to look at only the portions of interest. The `tree_subset()` function was created in the `r Biocpkg("treeio")` package [@wang_treeio_2020] to extract a subset of the tree portion while still maintaining the structure of the tree portion. The `beast_tree` in Figure \@ref(fig:subsetTip)A is slightly crowded. Obviously, we can make the figure taller to allow more space for the labels (similar to using the "Expansion" slider in `FigTree`) or we can make the text smaller. However, these solutions are not always applicable when you have a lot of tips (*e.g.*, hundreds or thousands of tips). In particular, when you are only interested in the portion of the tree around a particular tip, you certainly don't want to explore a large tree to find the certain species you are interested in.
Let's say you are interested in tip *A/Swine/HK/168/2012* from the tree (Figure \@ref(fig:subsetTip)A) and you want to look at the immediate relatives of this tip.
The `tree_subset()` function allows you to look at the portions of the tree that are of interest. By default, the `tree_subset()` function will internally call the [`groupOTU()`](#groupotu) to assign the group specified tip from the rest of the other tips (Figure \@ref(fig:subsetTip)B). Additionally, the branch lengths and related associated data are maintained after subsetting (Figure \@ref(fig:subsetTip)C). The root of the tree is always anchored at zero for the subset tree by default and all the distances are relative to this root. If you want all the distances to be relative to the original root, you can specify the root position (by `root.position` parameter) to the root edge of the subset tree, which is the sum of branch lengths from the original root to the root of the subset tree (Figures \@ref(fig:subsetTip)D and E).
(ref:subsetTipscap) Subsetting tree for a specific tip.
(ref:subsetTipcap) **Subsetting tree for a specific tip.** Original tree (A). Subset tree (B). Subset tree with data (C). Visualize the subset tree relative to the original position, without rootedge (D) and with rootedge (E).
```{r subsetTip, fig.width=12, fig.height=7, echo=T, fig.cap="(ref:subsetTipcap)", fig.scap="(ref:subsetTipscap)", out.extra='', out.width="100%"}
beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree")
beast_tree <- read.beast(beast_file)
p1 = ggtree(beast_tree) +
geom_tiplab(offset=.05) + xlim(0, 40) + theme_tree2()
tree2 = tree_subset(beast_tree, "A/Swine/HK/168/2012", levels_back=4)
p2 <- ggtree(tree2, aes(color=group)) +
scale_color_manual(values = c("black", "red"), guide = 'none') +
geom_tiplab(offset=.2) + xlim(0, 4.5) + theme_tree2()
p3 <- p2 +
geom_point(aes(fill = rate), shape = 21, size = 4) +
scale_fill_continuous(low = 'blue', high = 'red') +
xlim(0,5) + theme(legend.position = 'right')
p4 <- ggtree(tree2, aes(color=group),
root.position = as.phylo(tree2)$root.edge) +
geom_tiplab() + xlim(18, 24) +
scale_color_manual(values = c("black", "red"), guide = 'none') +
theme_tree2()
p5 <- p4 +
geom_rootedge() + xlim(0, 50)
plot_list(p1, p2, p3, p4, p5,
design="AABBCC\nAADDEE", tag_levels='A')
```
### Subsetting tree by internal node number {#subset-node}
If you are interested in a certain clade, you can specify the input node as an internal node number. The `tree_subset()` function will take the clade as a whole and also trace it back to particular levels to look at the immediate relatives of the clade (Figures \@ref(fig:subsetNode)A and B). We can use the `tree_subset()` function to zoom in selected portions and plot a whole tree with the portion of it, which is similar to the `ape::zoom()` function to explore a very large tree (Figures \@ref(fig:subsetNode)C and D). Users can also use `viewClade()` function to restrict tree visualization at specific clade as demonstrated in [session 6.1](#viewing-selected-clade).
(ref:subsetNodescap) Subsetting tree for a specific clade.
(ref:ssubsetNodecap) **Subsetting tree for the specific clade.** Extracting a clade (A). Extracting a clade and tracing it back to look at its immediate relatives (B). Viewing a very large tree (C) and a selected portion of it (D).
```{r subsetNode, fig.width=10, fig.height=10, echo=T, fig.cap="(ref:ssubsetNodecap)", fig.scap="(ref:subsetNodescap)", out.extra='', out.width="100%"}
clade <- tree_subset(beast_tree, node=121, levels_back=0)
clade2 <- tree_subset(beast_tree, node=121, levels_back=2)
p1 <- ggtree(clade) + geom_tiplab() + xlim(0, 5)
p2 <- ggtree(clade2, aes(color=group)) + geom_tiplab() +
xlim(0, 9) + scale_color_manual(values=c("black", "red"))
library(ape)
library(tidytree)
library(treeio)
data(chiroptera)
nodes <- grep("Plecotus", chiroptera$tip.label)
chiroptera <- groupOTU(chiroptera, nodes)
clade <- MRCA(chiroptera, nodes)
x <- tree_subset(chiroptera, clade, levels_back = 0)
p3 <- ggtree(chiroptera, aes(colour = group)) +
scale_color_manual(values=c("black", "red")) +
theme(legend.position = "none")
p4 <- ggtree(x) + geom_tiplab() + xlim(0, 6)
plot_list(p1, p2, p3, p4,
ncol=2, tag_levels = 'A')
```
## Manipulating Tree Data for Visualization {#ggtree-fortify}
Tree visualization is supported by `r Biocpkg("ggtree")` [@yu_ggtree:_2017]. Although `r Biocpkg("ggtree")`\index{ggtree} implemented several methods for [visual exploration of trees with data](#chapter6), you may want to do something that is not supported directly. In this case, you need to manipulate tree data with node coordination positions that are used for visualization. This is quite easy with `r Biocpkg("ggtree")`. Users can use the `fortify()` method which internally calls `tidytree::as_tibble()` to convert the tree to a tidy data frame and add columns of coordination positions (*i.e.*, x, y, branch, and angle) that are used to plot the tree. You can also access the data via `ggtree(tree)$data`.
Here is an example to plot two trees face-to-face that is similar to a graph produced by the `ape::cophyloplot()` function\index{ape} (Figure \@ref(fig:cophylo))\index{geom\textunderscore tree}.
(ref:cophyloscap) Plot two phylogenetic trees face to face.
(ref:cophylocap) **Plot two phylogenetic trees face-to-face.** Plotting a tree using `ggtree()` (left-hand side) and subsequently adding another layer of a tree by `geom_tree()` (right-hand side). The relative positions of the plotted trees can be manually adjusted and adding layers to each of the trees (*e.g.*, tip labels and highlighting clades) is independent.
```{r cophylo, fig.width=8, fig.height=6, message=F, echo=T, fig.cap="(ref:cophylocap)", fig.scap="(ref:cophyloscap)", out.extra='', out.width="100%"}
library(dplyr)
library(ggtree)
set.seed(1024)
x <- rtree(30)
y <- rtree(30)
p1 <- ggtree(x, layout='roundrect') +
geom_hilight(
mapping=aes(subset = node %in% c(38, 48, 58, 36),
node = node,
fill = as.factor(node)
)
) +
labs(fill = "clades for tree in left" )
p2 <- ggtree(y)
d1 <- p1$data
d2 <- p2$data
## reverse x-axis and
## set offset to make the tree on the right-hand side of the first tree
d2$x <- max(d2$x) - d2$x + max(d1$x) + 1
pp <- p1 + geom_tree(data=d2, layout='ellipse') +
ggnewscale::new_scale_fill() +
geom_hilight(
data = d2,
mapping = aes(
subset = node %in% c(38, 48, 58),
node=node,
fill=as.factor(node))
) +
labs(fill = "clades for tree in right" )
dd <- bind_rows(d1, d2) %>%
filter(!is.na(label))
pp + geom_line(aes(x, y, group=label), data=dd, color='grey') +
geom_tiplab(geom = 'shadowtext', bg.colour = alpha('firebrick', .5)) +
geom_tiplab(data = d2, hjust=1, geom = 'shadowtext',
bg.colour = alpha('firebrick', .5))
```
It is quite easy to plot multiple trees and connect taxa in one figure; for instance, plotting trees constructed from all internal gene segments of influenza virus and connecting equivalent strains across the trees [@venkatesh_avian_2018]. Figure \@ref(fig:cophylo3) demonstrates the usage of plotting multiple trees by combining multiple layers of `geom_tree()`.
(ref:cophylo3scap) Plot multiple phylogenetic trees side-by-side.
(ref:cophylo3cap) **Plot multiple phylogenetic trees side-by-side.** Plotting a tree using `ggtree()` and subsequently adding multiple layers of trees by `geom_tree()`.
```{r cophylo3, fig.width=8, fig.height=6, message=F, echo=T, fig.cap="(ref:cophylo3cap)", fig.scap="(ref:cophylo3scap)", out.extra='', out.width="100%"}
z <- rtree(30)
d2 <- fortify(y)
d3 <- fortify(z)
d2$x <- d2$x + max(d1$x) + 1
d3$x <- d3$x + max(d2$x) + 1
dd = bind_rows(d1, d2, d3) %>%
filter(!is.na(label))
p1 + geom_tree(data = d2) + geom_tree(data = d3) + geom_tiplab(data=d3) +
geom_line(aes(x, y, group=label, color=node < 15), data=dd, alpha=.3)
```
## Summary {#summary2}
The `r Biocpkg("treeio")` package allows us to import diverse phylogeny associated data into R. However, a phylogenetic tree is stored in a way to facilitate computational processing which is not human friendly and needs the expertise to manipulate and explore tree data. The `r CRANpkg("tidytree")` package provides a tidy interface for exploring tree data, while `r Biocpkg("ggtree")` provides a set of utilities to visualize and explore tree data using the grammar of graphics. This full suite of packages makes it easy for ordinary users to interact with tree data and allows us to integrate phylogeny associated data from different sources (*e.g.*, experimental results or analysis findings), which creates the possibilities of integrative and comparative study. Moreover, this package suite brings phylogenetic analysis into the tidyverse and certainly takes us to the next level of processing phylogenetic data.