8/23/24
- References and External Resources
- Libraries
- Query GenBank
- Retrieve Sequences
- Alignment
- Substitution Model
- Maximum Likelihood Tree
- Bootstrap Values
- Tree Plots
- BEAST
- Tracer Type Stats
- Maximum Clade Credability Tree
Osmondi et al 2020(2020) The role of African buffalo in the epidemiology of foot-and-mouth disease in sympatric cattle and buffalo populations in Kenya. https://doi.org/10.1111/tbed.13573
GenBank [PopSet: 1685824549] (https://www.ncbi.nlm.nih.gov/popset?LinkName=nuccore_popset&from_uid=1685824549)
A few packages are needed. First, these help withdirectory management, visualization, and data wrangling.
Hide code
library(here) # directory management
library(tidyverse) #ggplot, lubridate, and the like
options(dplyr.summarise.inform = FALSE) # don't render data default summaries
library(ggmap) # maps
library(ggspatial) # spatial plots
library(pals) # color pallets
library(gt) # pretty tables
library(coda) # mcmc summaries/tools
Next, several genetics specific packages are recommended. The BioManager packages may take a few minutes to compile.
Hide code
library(ape) #Analyses of Phylogenetics and Evolution (APE)
library(phangorn) # phylogenetic trees and networks
library(rentrez) # R interface to the NCBI - GenBank
# these next 4 pieces of code check each package, then installs them if not already installed.
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
if (!requireNamespace("Biostrings", quietly = TRUE)) {
BiocManager::install("Biostrings")
}
if (!requireNamespace("msa", quietly = TRUE)) {
BiocManager::install("msa")
}
if (!requireNamespace("ggtree", quietly = TRUE)) {
BiocManager::install("ggtree")
}
# if now installed, load the packages
library(Biostrings) # sequence wrangling
library(msa) # Multiple Sequence Alignment (MSA) algorithms
library(ggtree) # tree visualization and annotation
Custom functions created for this demo.
Hide code
source(here("R/utilities.R"))
source_dir(here("R"))
The Osmondi paper provided a listing of accession numbers as a
supplemental. This was attached as a Word Doc, which I copy-and-pasted
into osmondi_2020_supplemental.csv
.
Hide code
osmondi_seqs <- read_csv(here("assets/osmondi_2020_supplemental.csv"))
Rows: 98 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Serotype, Accession, Strain Name, Species
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Hide code
head(osmondi_seqs)
# A tibble: 6 × 4
Serotype Accession `Strain Name` Species
<chr> <chr> <chr> <chr>
1 SAT1 MH882580 K29 Cattle
2 SAT1 MH882590 PRB3 Buffalo
3 SAT1 MH882596 PRB4 Buffalo
4 SAT1 MH882603 PRB6 Buffalo
5 SAT1 MH882582 PRB10 Buffalo
6 SAT1 MH882598 PRB5 Buffalo
Hide code
search_term <- paste(osmondi_seqs$Accession, collapse = " OR ")
genbank_return <- entrez_search(
db = "nucleotide",
term = search_term,
retmax = 100 # the default is 20 records, our list is 95
)
Alternatively, search the database for the popset of interest and return all metadata. This is simpler, but not always available and GenBank may stop using PopSet numbers altogether next year.
Hide code
popset_id <- "1685824549" # "population set" from GenBank page
genbank_return <-
entrez_search(db = "nucleotide",
term = paste0("POPSET:", popset_id), # search by popset number
retmax = 100)
No actual sequences yet, only metadata. Even if you can easily access the sequences, the expanded retrieval process in this script is often needed to pull additional info, like collection dates, isolate names/labels, or geographic coordinates.
Hide code
genbank_return # check results
Entrez search result with 97 hits (object contains 97 IDs and no web_history object)
Search term (as translated): POPSET[All Fields] AND 1685824549[All Fields]
Hide code
class(genbank_return)
[1] "esearch" "list"
Hide code
str(genbank_return)
List of 5
$ ids : chr [1:97] "1685824741" "1685824739" "1685824737" "1685824735" ...
$ count : int 97
$ retmax : int 97
$ QueryTranslation: chr "POPSET[All Fields] AND 1685824549[All Fields]"
$ file :Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
- attr(*, "class")= chr [1:2] "esearch" "list"
Using the metadata for each record, the desired data is pulled one at a time, then organized as a data frame. Lot’s of character string wrangling, yuk!
First, create an empty data frame to hold results
Hide code
seq_meta <- data.frame(Accession=character(),
Collection=character(),
Host=character(),
Isolate=character(),
stringsAsFactors=FALSE)
Then, loop through each record and pull desired data. In this case, searching the metadata for accession numbers, collection dates, host type, and the more detailed isolate names.
Hide code
for (id in genbank_return$ids) {
try({
record <- entrez_fetch(db="nucleotide", id=id, rettype="gb", retmode="text")
accession <- sub("^.*?ACCESSION\\s+([^\n]+).*", "\\1", record)
Collection <- ifelse(grepl("/collection_date=", record),
sub("^.*?/collection_date=\"([^\"]+)\".*", "\\1", record), NA)
host <- ifelse(grepl("/host=", record),
sub("^.*?/host=\"([^\"]+)\".*", "\\1", record), NA)
isolate <- ifelse(grepl("/isolate=", record),
sub("^.*?/isolate=\"([^\"]+)\".*", "\\1", record), NA)
# add to data frame
seq_meta <- rbind(seq_meta, data.frame(Accession=accession,
Collection=Collection,
Host=host,
Isolate=isolate,
stringsAsFactors=FALSE)) %>%
as.data.frame()
# delay to prevent overwhelming the API server
Sys.sleep(0.5) # this gives a 0.5 second gap
}, silent = TRUE) # continue if an error
}
Examine what was retrived.
Hide code
dim(seq_meta)
[1] 97 4
Hide code
head(seq_meta)
Accession Collection Host Isolate
1 MH882663 2016-01 Syncerus_caffer SAT2/KEN/PRB88/2016_pro
2 MH882662 2016-01 Syncerus_caffer SAT2/KEN/PRB87/2016_pro
3 MH882661 2016-01 Syncerus_caffer SAT2/KEN/PRB86/2016_pro
4 MH882660 2016-01 Syncerus_caffer SAT2/KEN/PRB85/2016_pro
5 MH882659 2016-01 Syncerus_caffer SAT2/KEN/PRB83/2016_pro
6 MH882658 2016-01 Syncerus_caffer SAT2/KEN/PRB81/2016_pro
Hide code
# using trimws due to an extra space in the numbers
seq_meta$Accession <- trimws(seq_meta$Accession)
# add a couple more columns
seq_meta$Serotype <- sub("/.*", "", seq_meta$Isolate)
seq_meta$Animal <- sub("^.*/([^/]+)/[^/]+$", "\\1", seq_meta$Isolate)
seq_meta %>%
gt() %>%
tab_header(
title = md("Kenya Sequences Metadata")) %>%
cols_width(starts_with("Accession") ~ px(90),
starts_with("Collection") ~ px(80),
starts_with("Host") ~ px(100),
starts_with("Isolate") ~ px(180),
starts_with("Serotype") ~ px(80),
starts_with("Animal") ~ px(80),
everything() ~ px(95)) %>%
tab_options(table.font.size = "small",
row_group.font.size = "small",
stub.font.size = "small",
column_labels.font.size = "medium",
heading.title.font.size = "large",
data_row.padding = px(2),
heading.title.font.weight = "bold",
column_labels.font.weight = "bold") %>%
opt_stylize(style = 6, color = 'gray')
<tr class="gt_col_headings" style="border-style: none; border-top-style: solid; border-top-width: 2px; border-top-color: #5F5F5F; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #5F5F5F; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3;">
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" scope="col" id="Accession" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: left;" bgcolor="#5F5F5F" valign="bottom" align="left">Accession</th>
<th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Collection" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: right; font-variant-numeric: tabular-nums;" bgcolor="#5F5F5F" valign="bottom" align="right">Collection</th>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" scope="col" id="Host" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: left;" bgcolor="#5F5F5F" valign="bottom" align="left">Host</th>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" scope="col" id="Isolate" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: left;" bgcolor="#5F5F5F" valign="bottom" align="left">Isolate</th>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" scope="col" id="Serotype" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: left;" bgcolor="#5F5F5F" valign="bottom" align="left">Serotype</th>
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" scope="col" id="Animal" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: left;" bgcolor="#5F5F5F" valign="bottom" align="left">Animal</th>
</tr>
Kenya Sequences Metadata |
|||||
MH882663 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB88/2016_pro | SAT2 | PRB88 |
MH882662 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB87/2016_pro | SAT2 | PRB87 |
MH882661 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB86/2016_pro | SAT2 | PRB86 |
MH882660 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB85/2016_pro | SAT2 | PRB85 |
MH882659 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB83/2016_pro | SAT2 | PRB83 |
MH882658 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB81/2016_pro | SAT2 | PRB81 |
MH882657 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB80/2016_pro | SAT2 | PRB80 |
MH882656 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB76/2016_pro | SAT2 | PRB76 |
MH882655 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB75/2016_pro | SAT2 | PRB75 |
MH882654 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB74/2016_pro | SAT2 | PRB74 |
MH882653 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB71/2016_pro | SAT2 | PRB71 |
MH882652 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB70/2016_pro | SAT2 | PRB70 |
MH882651 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB69/2016_pro | SAT2 | PRB69 |
MH882650 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB68/2016_pro | SAT2 | PRB68 |
MH882649 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB67/2016_pro | SAT2 | PRB67 |
MH882648 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB66/2016_pro | SAT2 | PRB66 |
MH882647 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB65/2016_pro | SAT2 | PRB65 |
MH882646 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB63/2016_pro | SAT2 | PRB63 |
MH882645 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB62/2016_pro | SAT2 | PRB62 |
MH882644 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB61/2016_pro | SAT2 | PRB61 |
MH882643 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB60/2016_pro | SAT2 | PRB60 |
MH882642 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB59/2016_pro | SAT2 | PRB59 |
MH882641 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB58/2016_pro | SAT2 | PRB58 |
MH882640 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB56/2016_pro | SAT2 | PRB56 |
MH882639 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB54/2016_pro | SAT2 | PRB54 |
MH882638 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB52/2016_pro | SAT2 | PRB52 |
MH882637 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB51/2016_pro | SAT2 | PRB51 |
MH882636 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB50/2016_pro | SAT2 | PRB50 |
MH882635 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB47/2016_pro | SAT2 | PRB47 |
MH882634 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB46/2016_pro | SAT2 | PRB46 |
MH882633 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB44/2016_pro | SAT2 | PRB44 |
MH882632 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB43/2016_pro | SAT2 | PRB43 |
MH882631 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB42/2016_pro | SAT2 | PRB42 |
MH882630 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB41/2016_pro | SAT2 | PRB41 |
MH882629 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB40/2016_pro | SAT2 | PRB40 |
MH882628 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB39/2016_pro | SAT2 | PRB39 |
MH882627 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB36/2016_pro | SAT2 | PRB36 |
MH882626 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB35/2016_pro | SAT2 | PRB35 |
MH882625 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB34/2016_pro | SAT2 | PRB34 |
MH882624 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB23/2016_pro | SAT2 | PRB23 |
MH882623 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB22/2016_pro | SAT2 | PRB22 |
MH882622 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB20/2016_pro | SAT2 | PRB20 |
MH882621 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB19/2016_pro | SAT2 | PRB19 |
MH882620 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB18/2016_pro | SAT2 | PRB18 |
MH882619 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB15/2016_pro | SAT2 | PRB15 |
MH882618 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB14/2016_pro | SAT2 | PRB14 |
MH882617 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB13/2016_pro | SAT2 | PRB13 |
MH882616 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB12/2016_pro | SAT2 | PRB12 |
MH882615 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB6/2016_pro | SAT2 | PRB6 |
MH882614 | 2016-01 | Syncerus_caffer | SAT2/KEN/PRB1/2016_pro | SAT2 | PRB1 |
MH882613 | 2015-03 | Bos_taurus | SAT2/KEN/K39/2015 | SAT2 | K39 |
MH882612 | 2014-09 | Bos_taurus | SAT2/KEN/K137/2014 | SAT2 | K137 |
MH882611 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB9/2016_pro | SAT1 | PRB9 |
MH882610 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB84/2016_pro | SAT1 | PRB84 |
MH882609 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB8/2016_pro | SAT1 | PRB8 |
MH882608 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB73/2016_pro | SAT1 | PRB73 |
MH882607 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB72/2016_pro | SAT1 | PRB72 |
MH882606 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB7/2016_pro | SAT1 | PRB7 |
MH882605 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB64/2016_pro | SAT1 | PRB64 |
MH882604 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB61/2016_pro | SAT1 | PRB61 |
MH882603 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB6/2016_pro | SAT1 | PRB6 |
MH882602 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB59/2016_pro | SAT1 | PRB59 |
MH882601 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB57/2016_pro | SAT1 | PRB57 |
MH882600 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB55/2016_pro | SAT1 | PRB55 |
MH882599 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB51/2016_pro | SAT1 | PRB51 |
MH882598 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB5/2016_pro | SAT1 | PRB5 |
MH882597 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB48/2016_pro | SAT1 | PRB48 |
MH882596 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB4/2016_pro | SAT1 | PRB4 |
MH882595 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB38/2016_pro | SAT1 | PRB38 |
MH882594 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB37/2016_pro | SAT1 | PRB37 |
MH882593 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB36/2016_pro | SAT1 | PRB36 |
MH882592 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB32/2016_pro | SAT1 | PRB32 |
MH882591 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB31/2016_pro | SAT1 | PRB31 |
MH882590 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB3/2016_pro | SAT1 | PRB3 |
MH882589 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB29/2016_pro | SAT1 | PRB29 |
MH882588 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB28/2016_pro | SAT1 | PRB28 |
MH882587 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB27/2016_pro | SAT1 | PRB27 |
MH882586 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB26/2016_pro | SAT1 | PRB26 |
MH882585 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB25/2016_pro | SAT1 | PRB25 |
MH882584 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB17/2016_pro | SAT1 | PRB17 |
MH882583 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB11/2016_pro | SAT1 | PRB11 |
MH882582 | 2016-01 | Syncerus_caffer | SAT1/KEN/PRB10/2016_pro | SAT1 | PRB10 |
MH882581 | 2016-10 | Bos_taurus | SAT1/KEN/K75/2016 | SAT1 | K75 |
MH882580 | 2014-01 | Bos_taurus | SAT1/KEN/K29/2014 | SAT1 | K29 |
MH882579 | 2014-01 | Bos_taurus | SAT1/KEN/K19/2014 | SAT1 | K19 |
MH882578 | 2014-01 | Bos_taurus | SAT1/KEN/K14/2014 | SAT1 | K14 |
MH882577 | 2016 | Syncerus_caffer | O/KEN/PRC60/2016_pro | O | PRC60 |
MH882576 | 2016 | Syncerus_caffer | O/KEN/PRC49/2016_pro | O | PRC49 |
MH882575 | 2016 | Syncerus_caffer | O/KEN/PRC45/2016_pro | O | PRC45 |
MH882574 | 2014-03 | Bos_taurus | O/KEN/K60/2014 | O | K60 |
MH882573 | 2016-04 | Bos_taurus | O/KEN/K19/2016 | O | K19 |
MH882572 | 2015-10 | Bos_taurus | A/KEN/K103/2015 | A | K103 |
MH882571 | 2014-07 | Bos_taurus | A/KEN/K103/2014 | A | K103 |
MH882570 | 2015-03 | Bos_taurus | A/KEN/K39/2015 | A | K39 |
MH882569 | 2014-09 | Bos_taurus | A/KEN/K145/2014 | A | K145 |
MH882568 | 2015-09 | Bos_taurus | A/KEN/K91/2015 | A | K91 |
MH882567 | 2016-10 | Bos_taurus | A/KEN/K74/2016 | A | K74 |
Hide code
seq_meta %>%
group_by(Host, Serotype) %>%
summarise(Count = length(Accession))
# A tibble: 7 × 3
# Groups: Host [2]
Host Serotype Count
<chr> <chr> <int>
1 Bos_taurus A 6
2 Bos_taurus O 2
3 Bos_taurus SAT1 4
4 Bos_taurus SAT2 2
5 Syncerus_caffer O 3
6 Syncerus_caffer SAT1 30
7 Syncerus_caffer SAT2 50
Now, query GenBank for the actual sequences. Example here using SAT1 as an example.
Hide code
sat1_df <- seq_meta %>%
filter(Serotype == "SAT1")
# function to get sequences
get_sequences <- function(accessions) {
sequences <- sapply(accessions, function(acc) {
entrez_fetch(db = "nuccore", id = acc, rettype = "fasta")
})
return(sequences)
}
# run function
sat1_sequences <- get_sequences(sat1_df$Accession)
# remove special characters
sat1_sequences <- gsub("[^ATCG]", "", sat1_sequences)
# save to text file - fasta format
writeLines(sat1_sequences, here("assets/sat1_sequences.fasta"))
Hide code
# ensure all is named correctly
unique_names <- make.unique(names(sat1_sequences))
names(sat1_sequences) <- unique_names
# convert to a DNAStringSet object, needed for the msa package
dna_sequences <- DNAStringSet(sat1_sequences)
# MUSCLE alignment
alignment <- msa(dna_sequences, method = "Muscle")
alignment <- as(alignment, "DNAStringSet")
# save the aligned sequences to a text file
writeXStringSet(alignment, filepath = here("assets/aligned_SAT1.fasta"))
A plot to view the alignment. These are very clean, hardly any breaks or missingness.
Hide code
plot_alignment(alignment)
Read in the saved alignment. This rather than using the version already in the environment, becuase the classes are different.
Hide code
alignment <- read.dna(here("assets/aligned_SAT1.fasta"),
format="fasta", as.matrix=TRUE)
# convert again for modelTest (phangorn pkg)
aligned_phyDat <- as.phyDat(alignment)
# run the test, compare the models
mt <- modelTest(aligned_phyDat)
Model df logLik AIC BIC
JC 65 -4252.262 8634.525 8926.619
JC+I 66 -4028.036 8188.072 8484.66
JC+G(4) 66 -4013.369 8158.738 8455.325
JC+G(4)+I 67 -4010.842 8155.684 8456.766
F81 68 -4247.613 8631.226 8936.801
F81+I 69 -4021.402 8180.804 8490.873
F81+G(4) 69 -4007.382 8152.764 8462.833
F81+G(4)+I 70 -4005.165 8150.33 8464.893
K80 66 -3966.996 8065.992 8362.58
K80+I 67 -3727.478 7588.957 7890.038
K80+G(4) 67 -3707.184 7548.369 7849.45
K80+G(4)+I 68 -3701.669 7539.338 7844.913
HKY 69 -3954.414 8046.827 8356.896
HKY+I 70 -3705.849 7551.699 7866.262
HKY+G(4) 70 -3687.519 7515.039 7829.602
HKY+G(4)+I 71 -3683.581 7509.162 7828.218
TrNe 67 -3964.784 8063.569 8364.65
TrNe+I 68 -3726.731 7589.462 7895.037
TrNe+G(4) 68 -3706.238 7548.476 7854.051
TrNe+G(4)+I 69 -3700.723 7539.445 7849.514
TrN 70 -3950.892 8041.783 8356.346
TrN+I 71 -3702.937 7547.874 7866.931
TrN+G(4) 71 -3685.39 7512.781 7831.837
TrN+G(4)+I 72 -3681.133 7506.266 7829.816
TPM1 67 -3965.579 8065.157 8366.239
TPM1+I 68 -3725.95 7587.901 7893.476
TPM1+G(4) 68 -3705.385 7546.769 7852.345
TPM1+G(4)+I 69 -3699.61 7537.22 7847.289
K81 67 -3965.579 8065.157 8366.239
K81+I 68 -3725.95 7587.901 7893.476
K81+G(4) 68 -3705.385 7546.769 7852.345
K81+G(4)+I 69 -3699.61 7537.22 7847.289
TPM1u 70 -3952.836 8045.672 8360.234
TPM1u+I 71 -3704.189 7550.379 7869.436
TPM1u+G(4) 71 -3685.901 7513.803 7832.859
TPM1u+G(4)+I 72 -3681.689 7507.377 7830.928
TPM2 67 -3962.221 8058.442 8359.523
TPM2+I 68 -3723.348 7582.697 7888.272
TPM2+G(4) 68 -3703.442 7542.885 7848.46
TPM2+G(4)+I 69 -3697.652 7533.304 7843.373
TPM2u 70 -3948.829 8037.658 8352.221
TPM2u+I 71 -3701.332 7544.664 7863.72
TPM2u+G(4) 71 -3683.739 7509.477 7828.534
TPM2u+G(4)+I 72 -3679.416 7502.832 7826.383
TPM3 67 -3965.493 8064.985 8366.067
TPM3+I 68 -3726.879 7589.757 7895.333
TPM3+G(4) 68 -3706.169 7548.339 7853.914
TPM3+G(4)+I 69 -3700.393 7538.786 7848.855
TPM3u 70 -3954.267 8048.534 8363.097
TPM3u+I 71 -3705.651 7553.301 7872.358
TPM3u+G(4) 71 -3687.362 7516.723 7835.78
TPM3u+G(4)+I 72 -3683.487 7510.973 7834.523
TIM1e 68 -3963.369 8062.737 8368.313
TIM1e+I 69 -3725.202 7588.403 7898.472
TIM1e+G(4) 69 -3704.415 7546.83 7856.899
TIM1e+G(4)+I 70 -3698.637 7537.275 7851.837
TIM1 71 -3949.324 8040.648 8359.704
TIM1+I 72 -3701.231 7546.462 7870.013
TIM1+G(4) 72 -3683.711 7511.423 7834.973
TIM1+G(4)+I 73 -3679.196 7504.391 7832.435
TIM2e 68 -3960.031 8056.063 8361.638
TIM2e+I 69 -3722.604 7583.209 7893.278
TIM2e+G(4) 69 -3702.506 7543.011 7853.08
TIM2e+G(4)+I 70 -3696.73 7533.46 7848.023
TIM2 71 -3945.352 8032.704 8351.76
TIM2+I 72 -3698.384 7540.769 7864.319
TIM2+G(4) 72 -3681.549 7507.099 7830.649
TIM2+G(4)+I 73 -3676.939 7499.877 7827.921
TIM3e 68 -3963.278 8062.555 8368.131
TIM3e+I 69 -3726.132 7590.264 7900.333
TIM3e+G(4) 69 -3705.088 7548.177 7858.246
TIM3e+G(4)+I 70 -3699.312 7538.624 7853.186
TIM3 71 -3950.787 8043.575 8362.631
TIM3+I 72 -3702.673 7549.345 7872.896
TIM3+G(4) 72 -3685.209 7514.417 7837.967
TIM3+G(4)+I 73 -3680.952 7507.905 7835.949
TVMe 69 -3960.333 8058.665 8368.734
TVMe+I 70 -3722.258 7584.517 7899.079
TVMe+G(4) 70 -3701.721 7543.442 7858.005
TVMe+G(4)+I 71 -3695.488 7532.976 7852.032
TVM 72 -3948.254 8040.508 8364.059
TVM+I 73 -3700.559 7547.119 7875.163
TVM+G(4) 73 -3682.849 7511.698 7839.742
TVM+G(4)+I 74 -3678.413 7504.826 7837.364
SYM 70 -3958.146 8056.291 8370.854
SYM+I 71 -3721.518 7585.035 7904.092
SYM+G(4) 71 -3700.642 7543.283 7862.34
SYM+G(4)+I 72 -3694.43 7532.861 7856.411
GTR 73 -3944.84 8035.68 8363.724
GTR+I 74 -3697.533 7543.066 7875.604
GTR+G(4) 74 -3680.622 7509.244 7841.781
GTR+G(4)+I 75 -3675.83 7501.66 7838.691
Hide code
mt %>%
arrange(AIC) %>%
slice_head(n=5) %>%
gt()
Model | df | logLik | AIC | AICw | AICc | AICcw | BIC |
---|---|---|---|---|---|---|---|
TIM2+G(4)+I | 73 | -3676.939 | 7499.877 | 0.50564699 | 7518.283 | 0.52686441 | 7827.921 |
GTR+G(4)+I | 75 | -3675.830 | 7501.660 | 0.20739499 | 7521.147 | 0.12582167 | 7838.691 |
TPM2u+G(4)+I | 72 | -3679.416 | 7502.832 | 0.11539088 | 7520.710 | 0.15655071 | 7826.383 |
TIM1+G(4)+I | 73 | -3679.196 | 7504.391 | 0.05292586 | 7522.797 | 0.05514667 | 7832.435 |
TVM+G(4)+I | 74 | -3678.413 | 7504.826 | 0.04258996 | 7523.768 | 0.03393541 | 7837.364 |
Hide code
# use the next to best
env <- attr(mt, "env")
best_mod <- eval(get("GTR+G(4)+I", env), env)
best_mod
model: GTR+G(4)+I
loglikelihood: -3675.83
unconstrained loglikelihood: -2240.76
Proportion of invariant sites: 0.5208753
Discrete gamma model
Number of rate categories: 4
Shape parameter: 1.148496
Rate matrix:
a c g t
a 0.000000 1.4780899 10.3864230 1.332913
c 1.478090 0.0000000 0.4662525 15.398463
g 10.386423 0.4662525 0.0000000 1.000000
t 1.332913 15.3984631 1.0000000 0.000000
Base frequencies:
a c g t
0.2473969 0.3145267 0.2642101 0.1738664
Quick tree to see if there’s any craziness happening. Also an opportunity to check out ggtree
This optimization process is rather specific to this algorithm; it’s OK for quick checks, but you’ll want to use other methods for publishable results.
Hide code
# optimize model parameters without fitting edges
fit1 <- optim.pml(best_mod, # best model
optNni = FALSE, optBf = TRUE,
optQ = TRUE, optInv = TRUE,
optGamma = TRUE, optEdge = FALSE,
optRate = TRUE,
control = pml.control(epsilon = 1e-08,
maxit = 10, trace = 0))
#Fix substitution model and fit tree
fit2 <- optim.pml(fit1,
optNni = TRUE, optBf = FALSE,
optQ = FALSE, optInv = FALSE,
optGamma = FALSE, optEdge = TRUE,
control = pml.control(epsilon = 1e-08,
maxit = 10, trace = 0))
#Fine tune
fit3 <- optim.pml(fit2,
optNni = TRUE, optBf = TRUE,
optQ = TRUE, optInv = TRUE,
optGamma = TRUE, optEdge = TRUE,
optRate = FALSE,
control = pml.control(epsilon = 1e-08,
maxit = 10, trace = 0))
Only running 100 trees as an example, although a small number, this might take a few minutes…
Hide code
set.seed(1976)
boots <- bootstrap.pml(fit3,
bs = 100,
optNni = TRUE,
control = pml.control(trace = 0))
Phangorn plotting functions as an example.
Hide code
# get the best tree from optimization
ml_tree <- fit3$tree
# Or use a consensus tree
# consensus_tree <- consensus(boots, p = 0.5)
# phangorn specific plots
plotBS(midpoint(ml_tree), boots,
type="p", cex=0.4,
bs.adj = c(1.25, 1.25),
bs.col = "black")
add.scale.bar()
title("Maximum Likelihood")
Could also extract the bootstrap values for use in other plotting tools.
Hide code
bootstrap_matrix <- sapply(boots, function(tree) tree$node.label)
bootstrap_matrix <- apply(bootstrap_matrix, 2, as.numeric)
bootstrap_summarized <- apply(bootstrap_matrix, 1, mean, na.rm = TRUE)
ml_tree$node.label <- round(bootstrap_summarized, 2)
# root on oldest sequences, Jan 2014
ml_tree = root(ml_tree, c("MH882578", "MH882579", "MH882580"),
resolve.root = TRUE)
Using ggtree. This basic plot isn’t much prettier than the above, but offers more flexibility for customization.
Hide code
ggtree(ml_tree, size=0.5, col = "gray30",
ladderize=TRUE) +
geom_text2(aes(subset = !isTip, label=label),
hjust=-0.25,
size=3,
color="black") +
geom_tiplab(col="gray40", size=3,
align=FALSE, offset = 0.025, hjust = 1) +
geom_treescale(width=0.02)
Here’s a more fancy-fied version:
Hide code
# get labels
tree_df <- as.data.frame(
ml_tree$tip.label
)
names(tree_df) <- "label"
# match to isolate names, more info
tree_df$isolate <- with(seq_meta,
Isolate[match(
tree_df$label,
Accession)])
# match to host type
tree_df$host <- with(seq_meta,
Host[match(
tree_df$label,
Accession)])
# add data to tree
tmp_tree <- full_join(ml_tree, tree_df, by = 'label')
ggtree(tmp_tree, size=0.5, col = "gray30",
ladderize=TRUE) +
geom_tiplab(aes(label = isolate), col="gray40", size=3,
align=FALSE, offset = 0.025, hjust = 0.6) +
geom_text2(aes(subset = !isTip, label=label),
hjust=-0.4,
size=3,
color="darkred") +
geom_tippoint(aes(colour=host, shape=host), size = 4) +
scale_color_manual(values = c("Syncerus_caffer" = "red", "Bos_taurus" = "blue")) +
scale_shape_manual(values = c("Syncerus_caffer" = 16, "Bos_taurus" = 17)) +
theme(plot.margin = unit(c(1,8,1,0.1), "mm"),
axis.title.x = element_text(size=24, face="bold"),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks=element_blank(),
axis.line=element_blank(),
legend.direction = "vertical",
legend.position= "inside",
legend.position.inside = c(0.2, 0.8),
strip.text = element_blank(),
strip.background = element_blank(),
legend.key.size = unit(2,"line"),
legend.key.width = unit(3,"line"),
legend.text = element_text(size=16, face="bold"),
legend.title = element_text(size=16, face="bold"),
plot.title = element_text(size=28, face="bold")) +
geom_treescale(width=0.02) +
labs(colour = "Host", shape = "Host") +
guides(colour = guide_legend(override.aes = list(size=4)))
- Beauti walk through
- Write shell script (.sh)
- Login to HPC (Ceres has Beast, Atlas has Beast2)
- Upload data (Beauti .xml and .sh)
- Navigate to working folder
Hide code
sat1_dates <- sat1_df %>%
select(Accession, Collection)
unique(sat1_dates$Collection)
[1] "2016-01" "2016-10" "2014-01"
Hide code
sat1_dates <- sat1_dates %>%
mutate(samp_date = case_when(
Collection == "2016-01" ~ as_date("2016-01-01", format = "%Y-%m-%d"),
Collection == "2016-10" ~ as_date("2016-10-01", format = "%Y-%m-%d"),
Collection == "2014-01" ~ as_date("2014-01-01", format = "%Y-%m-%d"),
)) %>%
select(-Collection)
write.table(sat1_dates, file = here("beast/sat1_dates.tsv"),
sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
Command Line
$ cd /project/fadru_fmd/phylo_demo
$ ls
$ sbatch kenya_sat1.sh
$ squeue -u john.humphreys
Hide code
sat_stats <- get_tracer_stats(here("beast/old_beauti_sat1.log"))
sat_stats %>%
gt() %>%
tab_header(
title = md("Simple SAT1 Tree")) %>%
cols_width(everything() ~ px(95)) %>%
tab_options(table.font.size = "small",
row_group.font.size = "small",
stub.font.size = "small",
column_labels.font.size = "medium",
heading.title.font.size = "large",
data_row.padding = px(2),
heading.title.font.weight = "bold",
column_labels.font.weight = "bold") %>%
opt_stylize(style = 6, color = 'gray')
<tr class="gt_col_headings" style="border-style: none; border-top-style: solid; border-top-width: 2px; border-top-color: #5F5F5F; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #5F5F5F; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3;">
<th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" scope="col" id="Parameter" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: left;" bgcolor="#5F5F5F" valign="bottom" align="left">Parameter</th>
<th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Mean" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: right; font-variant-numeric: tabular-nums;" bgcolor="#5F5F5F" valign="bottom" align="right">Mean</th>
<th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Median" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: right; font-variant-numeric: tabular-nums;" bgcolor="#5F5F5F" valign="bottom" align="right">Median</th>
<th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Q_0.025" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: right; font-variant-numeric: tabular-nums;" bgcolor="#5F5F5F" valign="bottom" align="right">Q_0.025</th>
<th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Q_0.975" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: right; font-variant-numeric: tabular-nums;" bgcolor="#5F5F5F" valign="bottom" align="right">Q_0.975</th>
<th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="ESS" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: right; font-variant-numeric: tabular-nums;" bgcolor="#5F5F5F" valign="bottom" align="right">ESS</th>
</tr>
Simple SAT1 Tree |
|||||
joint | -3864.868 | -3865.034 | -3882.570 | -3847.616 | 9 |
prior | -133.460 | -137.145 | -147.452 | -110.589 | 3 |
likelihood | -3731.408 | -3730.939 | -3744.137 | -3721.577 | 90 |
treeModel.rootHeight | 10.067 | 10.388 | 5.442 | 14.819 | 5 |
age.root. | 2006.682 | 2006.361 | 2001.930 | 2011.307 | 5 |
treeLength | 87.645 | 92.488 | 44.085 | 127.009 | 3 |
tmrca.untitled0. | 10.067 | 10.388 | 5.442 | 14.819 | 5 |
age.untitled0. | 2006.682 | 2006.361 | 2001.930 | 2011.307 | 5 |
constant.popSize | 19.311 | 19.100 | 8.246 | 31.802 | 4 |
CP1.kappa | 14.719 | 14.473 | 11.998 | 18.154 | 93 |
CP2.kappa | 3.294 | 3.175 | 2.073 | 5.144 | 100 |
CP3.kappa | 8.288 | 7.572 | 4.441 | 14.233 | 72 |
CP1.frequencies1 | 0.172 | 0.175 | 0.154 | 0.193 | 43 |
CP1.frequencies2 | 0.384 | 0.383 | 0.349 | 0.414 | 26 |
CP1.frequencies3 | 0.250 | 0.248 | 0.220 | 0.287 | 23 |
CP1.frequencies4 | 0.194 | 0.195 | 0.163 | 0.222 | 44 |
CP2.frequencies1 | 0.273 | 0.273 | 0.240 | 0.344 | 30 |
CP2.frequencies2 | 0.240 | 0.242 | 0.180 | 0.294 | 21 |
CP2.frequencies3 | 0.339 | 0.336 | 0.290 | 0.385 | 33 |
CP2.frequencies4 | 0.148 | 0.150 | 0.116 | 0.182 | 48 |
CP3.frequencies1 | 0.324 | 0.325 | 0.274 | 0.379 | 21 |
CP3.frequencies2 | 0.254 | 0.253 | 0.220 | 0.297 | 46 |
CP3.frequencies3 | 0.239 | 0.236 | 0.199 | 0.280 | 32 |
CP3.frequencies4 | 0.183 | 0.184 | 0.148 | 0.217 | 35 |
CP1.nu | 0.766 | 0.768 | 0.731 | 0.799 | 112 |
CP2.nu | 0.123 | 0.122 | 0.093 | 0.152 | 158 |
CP3.nu | 0.111 | 0.111 | 0.089 | 0.135 | 163 |
CP1.mu | 2.292 | 2.298 | 2.187 | 2.391 | 112 |
CP2.mu | 0.369 | 0.367 | 0.279 | 0.457 | 158 |
CP3.mu | 0.333 | 0.335 | 0.268 | 0.407 | 163 |
clock.rate | 0.013 | 0.011 | 0.008 | 0.023 | 4 |
meanRate | 0.013 | 0.011 | 0.008 | 0.023 | 4 |
treeLikelihood | -3731.408 | -3730.939 | -3744.137 | -3721.577 | 90 |
branchRates | 0.000 | 0.000 | 0.000 | 0.000 | 0 |
coalescent | -128.800 | -132.364 | -142.661 | -106.559 | 3 |
Command Line
$ module load beast
$ beast
$ treeannotator -burnin 10000 -heights median aligned_SAT1.trees.txt sat1_mcc.tree
Hide code
options(ignore.negative.edge=TRUE)
sat_mcc.tree <- read.nexus(here("beast/sat1_mcc.tree"))
ggtree(sat_mcc.tree, mrsd = "2016-10-01",
size=0.5, col = "gray30", ladderize=TRUE) +
geom_tiplab(col="gray40", size=3,
align=FALSE, offset = 0.025, hjust = 0.001) +
theme_tree2(axis.title.x = element_text(size = 24, face = "bold"),
axis.title.y = element_blank(),
axis.text.x = element_text(face = "bold", size = 15, vjust = 1,
hjust = 1, angle = 45),
axis.text.y = element_blank(),
plot.title = element_text(size = 28, face = "bold"))