generated from jhudsl/AnVIL_Template
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-tree-thinking.Rmd
218 lines (138 loc) · 13.1 KB
/
03-tree-thinking.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
# (PART\*) TREE THINKING {-}
```{r, include = FALSE}
ottrpal::set_knitr_image_path()
```
# Phylogenetics basics
## How to read a phylogenetic tree
A **phylogeny**, or **phylogenetic tree**, is a diagram that shows the evolutionary history and relationships among or within groups of organisms. Phylogenetics was traditionally a somewhat obscure field in which systematists (biologists concerned with arranging organisms into a tree that showed their ancestral relatedness) arranged related living organisms at the **tips** (or “leaves” of the tree), and made **branches** to connect different organisms back to putative ancestral organisms.
Here's a phylogeny of the family Ursidae (the bears).
<img src="resources/images/03-bear_phylogeny.png" title="Major point!! example image" alt="Major point!! example image" style="display: block; margin: auto;" />
In this tree, all the extant species (or currently living species) are at the tips on the far right side of the phylogeny. Inferences about how the bear species are related become apparent as you move away from the tips down the branches. When two branches meet at a **node** (as they do at point A), you can assume the species at the tips of those branches share a common ancestor. For example, this phylogeny of the Ursidae indicates that American black bears and Asian black bears share a common ancestor (indicated by the node at point A). However, we don’t _know_ what the common ancestor is for certain, we are just inferring based on similarities between the species that exist today.
Nodes that are closer to the tips indicate species that are more closely related (and thus indicate a more recent common ancestor than nodes farther away from the tips). American black bears are more closely related to Asian black bears than to American black bears are to giant pandas, because the American black bear branch connects to a node shared by the Asian black bear branch (point A) before it connects to a node shared with the giant panda branch (point B).
Another unusual thing about phylogenies is we can change the order of the taxa on the tips without actually changing the topology of the tree.
<img src="resources/images/03-bear_phylogeny_v1.png" title="Major point!! example image" alt="Major point!! example image" style="display: block; margin: auto;" />
These two trees are the same, even though we have changed the position of the labels of American black bear and Asian black bear. **In phylogenetic trees, relatedness is expressed by the distance to a common node between two species, NOT by whether the labels are near each other.** Branches can rotate freely around nodes without changing the tree.
## Outgroups
Although this is a phylogeny of the Ursidae, you might have noticed there are two branches belonging to the gray wolf and the spotted seal, neither of which is a bear. These two species are included as outgroups. Outgroups are taxa that are only distantly related to the group of interest and serve as reference points for determining evolutionary changes.
## Branch lengths
Branch lengths (the distance between two nodes, or between a node and a tip), may or may not be indicative of the passage of a particular amount of time. It depends on how the tree was inferred (we infer phylogenetic trees, we don't make them). If the tree is created by parsimony or neighbor-joining methods, the branches simply indicate that there was one (or more) change from the ancestor to the descendant. If the tree was created using maximum likelihood methods, the branch lengths represent how many genetic changes occurred over time.
Regardless of how the trees are constructed, they are estimates of what we think happened historically. Each estimate contains within it implicit assumptions about rates of mutation accumulating, likelihood of different types of changes being more common (transitions vs. transversions, for example), and so on. The tree is our best hypothesis as to the history of the organisms on it, but it is only a hypothesis.
::: {.dictionary}
At one time, only morphological data could be used to make these trees. Thus, phylogenetic trees might have been based on similarities of bone structures, or fur types, or other gross physiological features. Even though the trees were called “phylogenetic” trees, they were not based on genetic data.
Now, phylogenetic trees are generally based on DNA sequence (for closely related species) or amino acid sequences (for more distantly related species). Furthermore, the trees are generally based on several genetic loci, rather than on the whole genome. This is changing, with next generation sequencing and advances in computing power. Nevertheless, at present most phylogenetic trees are “gene trees” rather than “species trees,” and it is important to remember that selection or drift on a particular locus can influence a tree so that it reflects the history of the gene, but NOT the history of the species
:::
# Visualizing trees in R
## Creating a Newick object
Computer programs use the Newick tree format for phylogenetic trees. This format uses a series of parentheses, commas, and colons to store information about evolutionary relationships.
* (A,B) indicates a pair of taxa that form their own group, or _clade_
* ((A,B),C) indicates the next most closely related taxon to the A-B clade is taxon C
* (A:5,B:7) tells the program (and us!) the length of the branch connecting each taxon to the node. In this case, the branch length between the node and A is 5 and the branch length for B is 7. The total distance between A and B is 5+7, or 12.
* ((A,B),C)); tells the program the tree is complete. If the semicolon is missing at the end, the program will keep looking for information on another taxon.
For this exercise, we are going to create an R object in Newick formula that illustrates the relationships among several species of mammals.
```{r, echo=FALSE, warning = FALSE, message = FALSE}
install.packages('ape')
install.packages('phytools')
install.packages('nlme')
```
```{r, warning = FALSE, message = FALSE}
#install.packages('ape') #this installs the ape package
#install.packages('nlme') #this installs the nlme package
library(ape) #this opens the ape package
library(nlme) #this opens the nlme package
#we first create an object that stores the tree information
mammal.1 <- read.tree(text = "((((raccoon:19.19959,bear:6.80041):0.84600,
weasel:18.87953):2.09460):3.87382,dog:25.46154);")
#typing the name of the object means R will tell us about it
mammal.1
```
We now have a phylogenetic tree loaded into R.
::: {.dictionary}
Why is it called Newick format?
This is what Joe Felsenstein, one of the giants of the phylogenetic field, says:
"The Newick Standard was adopted 26 June 1986 by an informal committee meeting convened by me during the Society for the Study of Evolution meetings in Durham, New Hampshire and consisting of James Archie, William H.E. Day, Wayne Maddison, Christopher Meacham, F. James Rohlf, David Swofford, and myself. (The committee was not an activity of the SSE nor endorsed by it). The reason for the name is that the second and final session of the committee met at Newick's restaurant in Dover, New Hampshire, and we enjoyed the meal of lobsters. The tree representation was a generalization of one developed by Christopher Meacham in 1984 for the tree plotting programs that he wrote for the PHYLIP package while visiting Seattle. His visit was a sabbatical leave from the University of Georgia, which thus indirectly partly funded that work."
:::
## Drawing trees
It is quite difficult for humans to quickly interpret the relationships and branch lengths in the Newick format. Luckily, R (and other phylogenetics programs) can convert Newick formats into a more understandable form.
```{r}
#plot is the command we use to create trees with the ape package
#one of the options is the type of tree the command draws
#this can also be written as plot(mammal.1, "u")
plot(mammal.1, type="unrooted")
```
You've inferred an unrooted tree. It probably looks a bit different than trees you've seen before (including the one in the previous section); most trees are displayed in a rooted form. We can do that by specifying that we want to draw a phylogram. If you don't declare an outgroup first, R will choose to root the phylogram halfway between the two longest branches (this is called midpoint rooting).
```{r}
#here we draw a phylogram
#alternatively, you can use the command:
#plot(plot(mammal.1), as phylogram is the default type
plot(mammal.1, type="phylogram")
```
Now the tree looks more like the Ursidae tree we examined earlier. The order of the tips is partly determined by the order in which we wrote the taxa in our Newick format. We can change the order of the tips and still have the same tree.
```{r}
mammal.2 <- read.tree(text = "((((bear:6.80041,raccoon:19.19959):0.84600,
weasel:18.87953):2.09460):3.87382,dog:25.46154);")
#this bit of code here tells R to put the trees in side-by-side in
#a single row (1 row, 2 columns)
par(mfrow=c(1,2))
plot(mammal.1)
plot(mammal.2)
```
Clades can rotate freely around nodes without changing the relationships among the tips. Although the "weasel" label is closer to "bear" in our first tree than it is in the second tree, the evolutionary distance between the two is the same in both trees, because we trace through the same nodes to find their common ancestor. **Both of these trees are exactly the same, in a phylogenetic sense.**
## Adding outgroups
Let's add some more taxa to our tree!
```{r}
mammal.3 <- read.tree(text = "((raccoon:19.19959,bear:6.80041):0.84600,((sea_lion:11.99700,
seal:12.00300):7.52973,((monkey:100.85930,cat:47.14069):20.59201,
weasel:18.87953):2.09460):3.87382,dog:25.46154);")
mammal.3
```
We've now added an additional 5 taxa to our tree of mammalian species. Let's first take a look at the unrooted tree.
```{r}
plot(mammal.3, type="u")#"u" is short for "unrooted"
```
Even with the unrooted tree, we can see that some species are definitely more closely related than others. In fact, it looks like both "cat" and "monkey" are pretty distantly related to the others, since the branches connecting these taxa are much longer than any other branch. Given this information, we will define these two taxa as our outgroup and redraw our tree, this time as a rooted phylogram.
```{r}
#this command tells R that monkey and cat are outgroups
mammal.3.root <- root(mammal.3, outgroup = c('monkey','cat'))
plot(mammal.3.root, type="p")#"p" is short for "phylogram"
```
## Drawing trees multiple ways
So far you've drawn trees in two ways - unrooted, and as a phylogram. For both of these tree types, the branch lengths are scaled to indicate evolutionary distance (or how many changes have occurred). As a result, the tips aren't all even with each other.
There are two other common ways of drawing trees. The radial tree (sometimes called the fan tree) arranges all the branches in a circle. This is a popular way to draw a phylogeny with many tips that would otherwise take up a lot of space.
```{r}
#now we're looking at three different trees next to each other
#basically, figures are in 1 row and 3 columns
par(mfrow=c(1,3))
plot(mammal.3, type="u")
plot(mammal.3.root, type="p")
plot(mammal.3.root, type="f")#f is short for "fan"
```
All three of these trees show exactly the same information.
The last common way to draw trees is as a cladogram. Cladograms are a little different than the others, because the branches are not scaled to evolutionary distance. Instead, the tree is drawn so that all the tips (taxa) are lined up. It is often easier to see relationships in a cladogram, particularly if the internode distances (the distance between two internal nodes of a tree) are small.
To properly draw a cladogram, we will rewrite our tree in Newick formula so that it doesn't include branch lengths.
```{r}
mammal.4 <- read.tree(text = "(dog,(raccoon,bear),((seal,sea_lion),
((monkey,cat), weasel)));")
mammal.4.root <- root(mammal.4, outgroup = c('monkey','cat'))
par(mfrow=c(1,2))
plot(mammal.3.root, type="p")
plot(mammal.4.root, type="c")#c is short for "cladogram"
```
::: {.reflection}
QUESTIONS
1. What is the total branch length between "bear" and "raccoon"? (You will need to look at the tree in Newick format.)
2. Does "weasel" share a more common recent ancestor with "seal" or with "sea lion"?
3. Why does it look like "weasel" is more closely related to "bear" in the tree with four taxa, but it looks like "dog" is more closely related to "bear" in the tree with eight taxa? (HINT: Think about the purpose of an outgroup, and whether we specified one for the four-taxa tree.)
:::
## The `phylo` class
When we use the `ape` package, R converts a tree in Newick format to an object of the `phylo` class. This is basically a list of four dataframes.
```{r}
str(mammal.3.root)
```
Each dataframe holds information about some part of the tree.
_edge_: the number of steps needed to connect two tips. It's easiest to think of each branch as an edge.
_edge.length_: the length of each corresponding edge, or branch
_Nnode_: the number of nodes in the tree
_tip.label_: the tip names (the taxa)
```{r}
sessionInfo()
```