Skip to content

Latest commit

 

History

History
1838 lines (1621 loc) · 492 KB

overview.md

File metadata and controls

1838 lines (1621 loc) · 492 KB

FMD Sympatric Water Buffalo and Cattle

8/23/24

References and External Resources

Osmondi et al 2020(2020) The role of African buffalo in the epidemiology of foot-and-mouth disease in sympatric cattle and buffalo populations in Kenya. https://doi.org/10.1111/tbed.13573

GenBank [PopSet: 1685824549] (https://www.ncbi.nlm.nih.gov/popset?LinkName=nuccore_popset&from_uid=1685824549)

Libraries

A few packages are needed. First, these help withdirectory management, visualization, and data wrangling.

Hide code
library(here) # directory management
library(tidyverse) #ggplot, lubridate, and the like 
options(dplyr.summarise.inform = FALSE) # don't render data default summaries  
library(ggmap) # maps
library(ggspatial) # spatial plots
library(pals) # color pallets
library(gt) # pretty tables
library(coda) # mcmc summaries/tools

Next, several genetics specific packages are recommended. The BioManager packages may take a few minutes to compile.

Hide code
library(ape) #Analyses of Phylogenetics and Evolution (APE)
library(phangorn) # phylogenetic trees and networks
library(rentrez) # R interface to the NCBI - GenBank

# these next 4 pieces of code check each package, then installs them if not already installed. 
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

if (!requireNamespace("Biostrings", quietly = TRUE)) {
    BiocManager::install("Biostrings")
}

if (!requireNamespace("msa", quietly = TRUE)) {
    BiocManager::install("msa")
}

if (!requireNamespace("ggtree", quietly = TRUE)) {
    BiocManager::install("ggtree")
}

# if now installed, load the packages  
library(Biostrings) # sequence wrangling
library(msa) # Multiple Sequence Alignment (MSA) algorithms  
library(ggtree) # tree visualization and annotation

Custom functions created for this demo.

Hide code
source(here("R/utilities.R"))
source_dir(here("R"))

Query GenBank

The Osmondi paper provided a listing of accession numbers as a supplemental. This was attached as a Word Doc, which I copy-and-pasted into osmondi_2020_supplemental.csv.

Hide code
osmondi_seqs <- read_csv(here("assets/osmondi_2020_supplemental.csv"))
Rows: 98 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Serotype, Accession, Strain Name, Species

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Hide code
head(osmondi_seqs)
# A tibble: 6 × 4
  Serotype Accession `Strain Name` Species
  <chr>    <chr>     <chr>         <chr>  
1 SAT1     MH882580  K29           Cattle 
2 SAT1     MH882590  PRB3          Buffalo
3 SAT1     MH882596  PRB4          Buffalo
4 SAT1     MH882603  PRB6          Buffalo
5 SAT1     MH882582  PRB10         Buffalo
6 SAT1     MH882598  PRB5          Buffalo
Hide code
search_term <- paste(osmondi_seqs$Accession, collapse = " OR ")

genbank_return <- entrez_search(
  db = "nucleotide",
  term = search_term,
  retmax = 100 # the default is 20 records, our list is 95
)

Alternatively, search the database for the popset of interest and return all metadata. This is simpler, but not always available and GenBank may stop using PopSet numbers altogether next year.

Hide code
popset_id <- "1685824549" # "population set" from GenBank page

genbank_return <- 
  entrez_search(db = "nucleotide", 
                term = paste0("POPSET:", popset_id), # search by popset number
                retmax = 100) 

Check Contents

No actual sequences yet, only metadata. Even if you can easily access the sequences, the expanded retrieval process in this script is often needed to pull additional info, like collection dates, isolate names/labels, or geographic coordinates.

Hide code
genbank_return # check results
Entrez search result with 97 hits (object contains 97 IDs and no web_history object)
 Search term (as translated):  POPSET[All Fields] AND 1685824549[All Fields] 
Hide code
class(genbank_return)
[1] "esearch" "list"   
Hide code
str(genbank_return)
List of 5
 $ ids             : chr [1:97] "1685824741" "1685824739" "1685824737" "1685824735" ...
 $ count           : int 97
 $ retmax          : int 97
 $ QueryTranslation: chr "POPSET[All Fields] AND 1685824549[All Fields]"
 $ file            :Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr> 
 - attr(*, "class")= chr [1:2] "esearch" "list"

Samples Table

Using the metadata for each record, the desired data is pulled one at a time, then organized as a data frame. Lot’s of character string wrangling, yuk!

First, create an empty data frame to hold results

Hide code
seq_meta <- data.frame(Accession=character(),
                       Collection=character(),
                       Host=character(),
                       Isolate=character(),
                       stringsAsFactors=FALSE)

Then, loop through each record and pull desired data. In this case, searching the metadata for accession numbers, collection dates, host type, and the more detailed isolate names.

Hide code
for (id in genbank_return$ids) {

  try({
    record <- entrez_fetch(db="nucleotide", id=id, rettype="gb", retmode="text")
    
    accession <- sub("^.*?ACCESSION\\s+([^\n]+).*", "\\1", record)
    
    Collection <- ifelse(grepl("/collection_date=", record), 
                         sub("^.*?/collection_date=\"([^\"]+)\".*", "\\1", record), NA)
    
    host <- ifelse(grepl("/host=", record), 
                   sub("^.*?/host=\"([^\"]+)\".*", "\\1", record), NA)
    
    isolate <- ifelse(grepl("/isolate=", record), 
                      sub("^.*?/isolate=\"([^\"]+)\".*", "\\1", record), NA)
    
    # add to data frame
    seq_meta <- rbind(seq_meta, data.frame(Accession=accession,
                                           Collection=Collection,
                                           Host=host,
                                           Isolate=isolate,
                                           stringsAsFactors=FALSE)) %>%
                as.data.frame()
    
    # delay to prevent overwhelming the API server
    Sys.sleep(0.5)  # this gives a 0.5 second gap

  }, silent = TRUE)  # continue if an error
}

Data Table

Examine what was retrived.

Hide code
dim(seq_meta)
[1] 97  4
Hide code
head(seq_meta)
  Accession Collection            Host                 Isolate
1  MH882663    2016-01 Syncerus_caffer SAT2/KEN/PRB88/2016_pro
2  MH882662    2016-01 Syncerus_caffer SAT2/KEN/PRB87/2016_pro
3  MH882661    2016-01 Syncerus_caffer SAT2/KEN/PRB86/2016_pro
4  MH882660    2016-01 Syncerus_caffer SAT2/KEN/PRB85/2016_pro
5  MH882659    2016-01 Syncerus_caffer SAT2/KEN/PRB83/2016_pro
6  MH882658    2016-01 Syncerus_caffer SAT2/KEN/PRB81/2016_pro
Hide code
# using trimws due to an extra space in the numbers
seq_meta$Accession <- trimws(seq_meta$Accession)

# add a couple more columns
seq_meta$Serotype <- sub("/.*", "", seq_meta$Isolate)
seq_meta$Animal <- sub("^.*/([^/]+)/[^/]+$", "\\1", seq_meta$Isolate)

seq_meta  %>%
  gt() %>%
  tab_header(
    title = md("Kenya Sequences Metadata")) %>%
  cols_width(starts_with("Accession") ~ px(90),
             starts_with("Collection") ~ px(80),
             starts_with("Host") ~ px(100),
             starts_with("Isolate") ~ px(180),
             starts_with("Serotype") ~ px(80),
             starts_with("Animal") ~ px(80),
             everything() ~ px(95)) %>%
  tab_options(table.font.size = "small",
              row_group.font.size = "small",
              stub.font.size = "small",
              column_labels.font.size = "medium",
              heading.title.font.size = "large",
              data_row.padding = px(2),
              heading.title.font.weight = "bold",
              column_labels.font.weight = "bold") %>%
  opt_stylize(style = 6, color = 'gray')
<tr class="gt_col_headings" style="border-style: none; border-top-style: solid; border-top-width: 2px; border-top-color: #5F5F5F; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #5F5F5F; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3;">
  <th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" scope="col" id="Accession" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: left;" bgcolor="#5F5F5F" valign="bottom" align="left">Accession</th>
  <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Collection" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: right; font-variant-numeric: tabular-nums;" bgcolor="#5F5F5F" valign="bottom" align="right">Collection</th>
  <th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" scope="col" id="Host" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: left;" bgcolor="#5F5F5F" valign="bottom" align="left">Host</th>
  <th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" scope="col" id="Isolate" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: left;" bgcolor="#5F5F5F" valign="bottom" align="left">Isolate</th>
  <th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" scope="col" id="Serotype" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: left;" bgcolor="#5F5F5F" valign="bottom" align="left">Serotype</th>
  <th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" scope="col" id="Animal" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: left;" bgcolor="#5F5F5F" valign="bottom" align="left">Animal</th>
</tr>

Kenya Sequences Metadata

MH882663 2016-01 Syncerus_caffer SAT2/KEN/PRB88/2016_pro SAT2 PRB88
MH882662 2016-01 Syncerus_caffer SAT2/KEN/PRB87/2016_pro SAT2 PRB87
MH882661 2016-01 Syncerus_caffer SAT2/KEN/PRB86/2016_pro SAT2 PRB86
MH882660 2016-01 Syncerus_caffer SAT2/KEN/PRB85/2016_pro SAT2 PRB85
MH882659 2016-01 Syncerus_caffer SAT2/KEN/PRB83/2016_pro SAT2 PRB83
MH882658 2016-01 Syncerus_caffer SAT2/KEN/PRB81/2016_pro SAT2 PRB81
MH882657 2016-01 Syncerus_caffer SAT2/KEN/PRB80/2016_pro SAT2 PRB80
MH882656 2016-01 Syncerus_caffer SAT2/KEN/PRB76/2016_pro SAT2 PRB76
MH882655 2016-01 Syncerus_caffer SAT2/KEN/PRB75/2016_pro SAT2 PRB75
MH882654 2016-01 Syncerus_caffer SAT2/KEN/PRB74/2016_pro SAT2 PRB74
MH882653 2016-01 Syncerus_caffer SAT2/KEN/PRB71/2016_pro SAT2 PRB71
MH882652 2016-01 Syncerus_caffer SAT2/KEN/PRB70/2016_pro SAT2 PRB70
MH882651 2016-01 Syncerus_caffer SAT2/KEN/PRB69/2016_pro SAT2 PRB69
MH882650 2016-01 Syncerus_caffer SAT2/KEN/PRB68/2016_pro SAT2 PRB68
MH882649 2016-01 Syncerus_caffer SAT2/KEN/PRB67/2016_pro SAT2 PRB67
MH882648 2016-01 Syncerus_caffer SAT2/KEN/PRB66/2016_pro SAT2 PRB66
MH882647 2016-01 Syncerus_caffer SAT2/KEN/PRB65/2016_pro SAT2 PRB65
MH882646 2016-01 Syncerus_caffer SAT2/KEN/PRB63/2016_pro SAT2 PRB63
MH882645 2016-01 Syncerus_caffer SAT2/KEN/PRB62/2016_pro SAT2 PRB62
MH882644 2016-01 Syncerus_caffer SAT2/KEN/PRB61/2016_pro SAT2 PRB61
MH882643 2016-01 Syncerus_caffer SAT2/KEN/PRB60/2016_pro SAT2 PRB60
MH882642 2016-01 Syncerus_caffer SAT2/KEN/PRB59/2016_pro SAT2 PRB59
MH882641 2016-01 Syncerus_caffer SAT2/KEN/PRB58/2016_pro SAT2 PRB58
MH882640 2016-01 Syncerus_caffer SAT2/KEN/PRB56/2016_pro SAT2 PRB56
MH882639 2016-01 Syncerus_caffer SAT2/KEN/PRB54/2016_pro SAT2 PRB54
MH882638 2016-01 Syncerus_caffer SAT2/KEN/PRB52/2016_pro SAT2 PRB52
MH882637 2016-01 Syncerus_caffer SAT2/KEN/PRB51/2016_pro SAT2 PRB51
MH882636 2016-01 Syncerus_caffer SAT2/KEN/PRB50/2016_pro SAT2 PRB50
MH882635 2016-01 Syncerus_caffer SAT2/KEN/PRB47/2016_pro SAT2 PRB47
MH882634 2016-01 Syncerus_caffer SAT2/KEN/PRB46/2016_pro SAT2 PRB46
MH882633 2016-01 Syncerus_caffer SAT2/KEN/PRB44/2016_pro SAT2 PRB44
MH882632 2016-01 Syncerus_caffer SAT2/KEN/PRB43/2016_pro SAT2 PRB43
MH882631 2016-01 Syncerus_caffer SAT2/KEN/PRB42/2016_pro SAT2 PRB42
MH882630 2016-01 Syncerus_caffer SAT2/KEN/PRB41/2016_pro SAT2 PRB41
MH882629 2016-01 Syncerus_caffer SAT2/KEN/PRB40/2016_pro SAT2 PRB40
MH882628 2016-01 Syncerus_caffer SAT2/KEN/PRB39/2016_pro SAT2 PRB39
MH882627 2016-01 Syncerus_caffer SAT2/KEN/PRB36/2016_pro SAT2 PRB36
MH882626 2016-01 Syncerus_caffer SAT2/KEN/PRB35/2016_pro SAT2 PRB35
MH882625 2016-01 Syncerus_caffer SAT2/KEN/PRB34/2016_pro SAT2 PRB34
MH882624 2016-01 Syncerus_caffer SAT2/KEN/PRB23/2016_pro SAT2 PRB23
MH882623 2016-01 Syncerus_caffer SAT2/KEN/PRB22/2016_pro SAT2 PRB22
MH882622 2016-01 Syncerus_caffer SAT2/KEN/PRB20/2016_pro SAT2 PRB20
MH882621 2016-01 Syncerus_caffer SAT2/KEN/PRB19/2016_pro SAT2 PRB19
MH882620 2016-01 Syncerus_caffer SAT2/KEN/PRB18/2016_pro SAT2 PRB18
MH882619 2016-01 Syncerus_caffer SAT2/KEN/PRB15/2016_pro SAT2 PRB15
MH882618 2016-01 Syncerus_caffer SAT2/KEN/PRB14/2016_pro SAT2 PRB14
MH882617 2016-01 Syncerus_caffer SAT2/KEN/PRB13/2016_pro SAT2 PRB13
MH882616 2016-01 Syncerus_caffer SAT2/KEN/PRB12/2016_pro SAT2 PRB12
MH882615 2016-01 Syncerus_caffer SAT2/KEN/PRB6/2016_pro SAT2 PRB6
MH882614 2016-01 Syncerus_caffer SAT2/KEN/PRB1/2016_pro SAT2 PRB1
MH882613 2015-03 Bos_taurus SAT2/KEN/K39/2015 SAT2 K39
MH882612 2014-09 Bos_taurus SAT2/KEN/K137/2014 SAT2 K137
MH882611 2016-01 Syncerus_caffer SAT1/KEN/PRB9/2016_pro SAT1 PRB9
MH882610 2016-01 Syncerus_caffer SAT1/KEN/PRB84/2016_pro SAT1 PRB84
MH882609 2016-01 Syncerus_caffer SAT1/KEN/PRB8/2016_pro SAT1 PRB8
MH882608 2016-01 Syncerus_caffer SAT1/KEN/PRB73/2016_pro SAT1 PRB73
MH882607 2016-01 Syncerus_caffer SAT1/KEN/PRB72/2016_pro SAT1 PRB72
MH882606 2016-01 Syncerus_caffer SAT1/KEN/PRB7/2016_pro SAT1 PRB7
MH882605 2016-01 Syncerus_caffer SAT1/KEN/PRB64/2016_pro SAT1 PRB64
MH882604 2016-01 Syncerus_caffer SAT1/KEN/PRB61/2016_pro SAT1 PRB61
MH882603 2016-01 Syncerus_caffer SAT1/KEN/PRB6/2016_pro SAT1 PRB6
MH882602 2016-01 Syncerus_caffer SAT1/KEN/PRB59/2016_pro SAT1 PRB59
MH882601 2016-01 Syncerus_caffer SAT1/KEN/PRB57/2016_pro SAT1 PRB57
MH882600 2016-01 Syncerus_caffer SAT1/KEN/PRB55/2016_pro SAT1 PRB55
MH882599 2016-01 Syncerus_caffer SAT1/KEN/PRB51/2016_pro SAT1 PRB51
MH882598 2016-01 Syncerus_caffer SAT1/KEN/PRB5/2016_pro SAT1 PRB5
MH882597 2016-01 Syncerus_caffer SAT1/KEN/PRB48/2016_pro SAT1 PRB48
MH882596 2016-01 Syncerus_caffer SAT1/KEN/PRB4/2016_pro SAT1 PRB4
MH882595 2016-01 Syncerus_caffer SAT1/KEN/PRB38/2016_pro SAT1 PRB38
MH882594 2016-01 Syncerus_caffer SAT1/KEN/PRB37/2016_pro SAT1 PRB37
MH882593 2016-01 Syncerus_caffer SAT1/KEN/PRB36/2016_pro SAT1 PRB36
MH882592 2016-01 Syncerus_caffer SAT1/KEN/PRB32/2016_pro SAT1 PRB32
MH882591 2016-01 Syncerus_caffer SAT1/KEN/PRB31/2016_pro SAT1 PRB31
MH882590 2016-01 Syncerus_caffer SAT1/KEN/PRB3/2016_pro SAT1 PRB3
MH882589 2016-01 Syncerus_caffer SAT1/KEN/PRB29/2016_pro SAT1 PRB29
MH882588 2016-01 Syncerus_caffer SAT1/KEN/PRB28/2016_pro SAT1 PRB28
MH882587 2016-01 Syncerus_caffer SAT1/KEN/PRB27/2016_pro SAT1 PRB27
MH882586 2016-01 Syncerus_caffer SAT1/KEN/PRB26/2016_pro SAT1 PRB26
MH882585 2016-01 Syncerus_caffer SAT1/KEN/PRB25/2016_pro SAT1 PRB25
MH882584 2016-01 Syncerus_caffer SAT1/KEN/PRB17/2016_pro SAT1 PRB17
MH882583 2016-01 Syncerus_caffer SAT1/KEN/PRB11/2016_pro SAT1 PRB11
MH882582 2016-01 Syncerus_caffer SAT1/KEN/PRB10/2016_pro SAT1 PRB10
MH882581 2016-10 Bos_taurus SAT1/KEN/K75/2016 SAT1 K75
MH882580 2014-01 Bos_taurus SAT1/KEN/K29/2014 SAT1 K29
MH882579 2014-01 Bos_taurus SAT1/KEN/K19/2014 SAT1 K19
MH882578 2014-01 Bos_taurus SAT1/KEN/K14/2014 SAT1 K14
MH882577 2016 Syncerus_caffer O/KEN/PRC60/2016_pro O PRC60
MH882576 2016 Syncerus_caffer O/KEN/PRC49/2016_pro O PRC49
MH882575 2016 Syncerus_caffer O/KEN/PRC45/2016_pro O PRC45
MH882574 2014-03 Bos_taurus O/KEN/K60/2014 O K60
MH882573 2016-04 Bos_taurus O/KEN/K19/2016 O K19
MH882572 2015-10 Bos_taurus A/KEN/K103/2015 A K103
MH882571 2014-07 Bos_taurus A/KEN/K103/2014 A K103
MH882570 2015-03 Bos_taurus A/KEN/K39/2015 A K39
MH882569 2014-09 Bos_taurus A/KEN/K145/2014 A K145
MH882568 2015-09 Bos_taurus A/KEN/K91/2015 A K91
MH882567 2016-10 Bos_taurus A/KEN/K74/2016 A K74

Samples by Group

Hide code
seq_meta %>%
  group_by(Host, Serotype) %>%
  summarise(Count = length(Accession))
# A tibble: 7 × 3
# Groups:   Host [2]
  Host            Serotype Count
  <chr>           <chr>    <int>
1 Bos_taurus      A            6
2 Bos_taurus      O            2
3 Bos_taurus      SAT1         4
4 Bos_taurus      SAT2         2
5 Syncerus_caffer O            3
6 Syncerus_caffer SAT1        30
7 Syncerus_caffer SAT2        50

Retrieve Sequences

Now, query GenBank for the actual sequences. Example here using SAT1 as an example.

Hide code
sat1_df <- seq_meta %>%
  filter(Serotype == "SAT1")

# function to get sequences
get_sequences <- function(accessions) {
    sequences <- sapply(accessions, function(acc) {
        entrez_fetch(db = "nuccore", id = acc, rettype = "fasta")
    })
    return(sequences)
}

# run function
sat1_sequences <- get_sequences(sat1_df$Accession)

# remove special characters 
sat1_sequences <- gsub("[^ATCG]", "", sat1_sequences)

# save to text file - fasta format  
writeLines(sat1_sequences, here("assets/sat1_sequences.fasta"))

Alignment

Hide code
# ensure all is named correctly
unique_names <- make.unique(names(sat1_sequences))
names(sat1_sequences) <- unique_names

# convert to a DNAStringSet object, needed for the msa package
dna_sequences <- DNAStringSet(sat1_sequences)

# MUSCLE alignment
alignment <- msa(dna_sequences, method = "Muscle")

alignment <- as(alignment, "DNAStringSet") 

# save the aligned sequences to a text file  
writeXStringSet(alignment, filepath = here("assets/aligned_SAT1.fasta"))

View Alignment

A plot to view the alignment. These are very clean, hardly any breaks or missingness.

Hide code
plot_alignment(alignment)

Substitution Model

Read in the saved alignment. This rather than using the version already in the environment, becuase the classes are different.

Hide code
alignment <- read.dna(here("assets/aligned_SAT1.fasta"),
                      format="fasta", as.matrix=TRUE)

# convert again for modelTest (phangorn pkg)
aligned_phyDat <- as.phyDat(alignment)
  
# run the test, compare the models
mt <- modelTest(aligned_phyDat)
Model        df  logLik   AIC      BIC
          JC 65 -4252.262 8634.525 8926.619 
        JC+I 66 -4028.036 8188.072 8484.66 
     JC+G(4) 66 -4013.369 8158.738 8455.325 
   JC+G(4)+I 67 -4010.842 8155.684 8456.766 
         F81 68 -4247.613 8631.226 8936.801 
       F81+I 69 -4021.402 8180.804 8490.873 
    F81+G(4) 69 -4007.382 8152.764 8462.833 
  F81+G(4)+I 70 -4005.165 8150.33 8464.893 
         K80 66 -3966.996 8065.992 8362.58 
       K80+I 67 -3727.478 7588.957 7890.038 
    K80+G(4) 67 -3707.184 7548.369 7849.45 
  K80+G(4)+I 68 -3701.669 7539.338 7844.913 
         HKY 69 -3954.414 8046.827 8356.896 
       HKY+I 70 -3705.849 7551.699 7866.262 
    HKY+G(4) 70 -3687.519 7515.039 7829.602 
  HKY+G(4)+I 71 -3683.581 7509.162 7828.218 
        TrNe 67 -3964.784 8063.569 8364.65 
      TrNe+I 68 -3726.731 7589.462 7895.037 
   TrNe+G(4) 68 -3706.238 7548.476 7854.051 
 TrNe+G(4)+I 69 -3700.723 7539.445 7849.514 
         TrN 70 -3950.892 8041.783 8356.346 
       TrN+I 71 -3702.937 7547.874 7866.931 
    TrN+G(4) 71 -3685.39 7512.781 7831.837 
  TrN+G(4)+I 72 -3681.133 7506.266 7829.816 
        TPM1 67 -3965.579 8065.157 8366.239 
      TPM1+I 68 -3725.95 7587.901 7893.476 
   TPM1+G(4) 68 -3705.385 7546.769 7852.345 
 TPM1+G(4)+I 69 -3699.61 7537.22 7847.289 
         K81 67 -3965.579 8065.157 8366.239 
       K81+I 68 -3725.95 7587.901 7893.476 
    K81+G(4) 68 -3705.385 7546.769 7852.345 
  K81+G(4)+I 69 -3699.61 7537.22 7847.289 
       TPM1u 70 -3952.836 8045.672 8360.234 
     TPM1u+I 71 -3704.189 7550.379 7869.436 
  TPM1u+G(4) 71 -3685.901 7513.803 7832.859 
TPM1u+G(4)+I 72 -3681.689 7507.377 7830.928 
        TPM2 67 -3962.221 8058.442 8359.523 
      TPM2+I 68 -3723.348 7582.697 7888.272 
   TPM2+G(4) 68 -3703.442 7542.885 7848.46 
 TPM2+G(4)+I 69 -3697.652 7533.304 7843.373 
       TPM2u 70 -3948.829 8037.658 8352.221 
     TPM2u+I 71 -3701.332 7544.664 7863.72 
  TPM2u+G(4) 71 -3683.739 7509.477 7828.534 
TPM2u+G(4)+I 72 -3679.416 7502.832 7826.383 
        TPM3 67 -3965.493 8064.985 8366.067 
      TPM3+I 68 -3726.879 7589.757 7895.333 
   TPM3+G(4) 68 -3706.169 7548.339 7853.914 
 TPM3+G(4)+I 69 -3700.393 7538.786 7848.855 
       TPM3u 70 -3954.267 8048.534 8363.097 
     TPM3u+I 71 -3705.651 7553.301 7872.358 
  TPM3u+G(4) 71 -3687.362 7516.723 7835.78 
TPM3u+G(4)+I 72 -3683.487 7510.973 7834.523 
       TIM1e 68 -3963.369 8062.737 8368.313 
     TIM1e+I 69 -3725.202 7588.403 7898.472 
  TIM1e+G(4) 69 -3704.415 7546.83 7856.899 
TIM1e+G(4)+I 70 -3698.637 7537.275 7851.837 
        TIM1 71 -3949.324 8040.648 8359.704 
      TIM1+I 72 -3701.231 7546.462 7870.013 
   TIM1+G(4) 72 -3683.711 7511.423 7834.973 
 TIM1+G(4)+I 73 -3679.196 7504.391 7832.435 
       TIM2e 68 -3960.031 8056.063 8361.638 
     TIM2e+I 69 -3722.604 7583.209 7893.278 
  TIM2e+G(4) 69 -3702.506 7543.011 7853.08 
TIM2e+G(4)+I 70 -3696.73 7533.46 7848.023 
        TIM2 71 -3945.352 8032.704 8351.76 
      TIM2+I 72 -3698.384 7540.769 7864.319 
   TIM2+G(4) 72 -3681.549 7507.099 7830.649 
 TIM2+G(4)+I 73 -3676.939 7499.877 7827.921 
       TIM3e 68 -3963.278 8062.555 8368.131 
     TIM3e+I 69 -3726.132 7590.264 7900.333 
  TIM3e+G(4) 69 -3705.088 7548.177 7858.246 
TIM3e+G(4)+I 70 -3699.312 7538.624 7853.186 
        TIM3 71 -3950.787 8043.575 8362.631 
      TIM3+I 72 -3702.673 7549.345 7872.896 
   TIM3+G(4) 72 -3685.209 7514.417 7837.967 
 TIM3+G(4)+I 73 -3680.952 7507.905 7835.949 
        TVMe 69 -3960.333 8058.665 8368.734 
      TVMe+I 70 -3722.258 7584.517 7899.079 
   TVMe+G(4) 70 -3701.721 7543.442 7858.005 
 TVMe+G(4)+I 71 -3695.488 7532.976 7852.032 
         TVM 72 -3948.254 8040.508 8364.059 
       TVM+I 73 -3700.559 7547.119 7875.163 
    TVM+G(4) 73 -3682.849 7511.698 7839.742 
  TVM+G(4)+I 74 -3678.413 7504.826 7837.364 
         SYM 70 -3958.146 8056.291 8370.854 
       SYM+I 71 -3721.518 7585.035 7904.092 
    SYM+G(4) 71 -3700.642 7543.283 7862.34 
  SYM+G(4)+I 72 -3694.43 7532.861 7856.411 
         GTR 73 -3944.84 8035.68 8363.724 
       GTR+I 74 -3697.533 7543.066 7875.604 
    GTR+G(4) 74 -3680.622 7509.244 7841.781 
  GTR+G(4)+I 75 -3675.83 7501.66 7838.691 
Hide code
mt %>% 
  arrange(AIC) %>%
  slice_head(n=5) %>%
  gt()
Model df logLik AIC AICw AICc AICcw BIC
TIM2+G(4)+I 73 -3676.939 7499.877 0.50564699 7518.283 0.52686441 7827.921
GTR+G(4)+I 75 -3675.830 7501.660 0.20739499 7521.147 0.12582167 7838.691
TPM2u+G(4)+I 72 -3679.416 7502.832 0.11539088 7520.710 0.15655071 7826.383
TIM1+G(4)+I 73 -3679.196 7504.391 0.05292586 7522.797 0.05514667 7832.435
TVM+G(4)+I 74 -3678.413 7504.826 0.04258996 7523.768 0.03393541 7837.364
Hide code
# use the next to best
env <- attr(mt, "env")
best_mod <- eval(get("GTR+G(4)+I", env), env) 

best_mod
model: GTR+G(4)+I 
loglikelihood: -3675.83 
unconstrained loglikelihood: -2240.76 
Proportion of invariant sites: 0.5208753 
Discrete gamma model
Number of rate categories: 4 
Shape parameter: 1.148496 

Rate matrix:
          a          c          g         t
a  0.000000  1.4780899 10.3864230  1.332913
c  1.478090  0.0000000  0.4662525 15.398463
g 10.386423  0.4662525  0.0000000  1.000000
t  1.332913 15.3984631  1.0000000  0.000000

Base frequencies:  
        a         c         g         t 
0.2473969 0.3145267 0.2642101 0.1738664 

Maximum Likelihood Tree

Quick tree to see if there’s any craziness happening. Also an opportunity to check out ggtree

Optimize model

This optimization process is rather specific to this algorithm; it’s OK for quick checks, but you’ll want to use other methods for publishable results.

Hide code
# optimize model parameters without fitting edges
fit1 <- optim.pml(best_mod, # best model 
                 optNni = FALSE, optBf = TRUE, 
                 optQ = TRUE, optInv = TRUE, 
                 optGamma = TRUE, optEdge = FALSE, 
                 optRate = TRUE, 
                 control = pml.control(epsilon = 1e-08,
                                       maxit = 10, trace = 0))

#Fix substitution model and fit tree
fit2 <- optim.pml(fit1, 
                 optNni = TRUE, optBf = FALSE,
                 optQ = FALSE, optInv = FALSE, 
                 optGamma = FALSE, optEdge = TRUE,
                 control = pml.control(epsilon = 1e-08, 
                                       maxit = 10, trace = 0))

#Fine tune
fit3 <- optim.pml(fit2, 
                 optNni = TRUE, optBf = TRUE,
                 optQ = TRUE, optInv = TRUE, 
                 optGamma = TRUE, optEdge = TRUE, 
                 optRate = FALSE,
                 control = pml.control(epsilon = 1e-08, 
                                       maxit = 10, trace = 0))

Bootstrap Values

Only running 100 trees as an example, although a small number, this might take a few minutes…

Hide code
set.seed(1976)
boots <- bootstrap.pml(fit3,
                       bs = 100,
                       optNni = TRUE,
                       control = pml.control(trace = 0))

Phangorn plotting functions as an example.

Hide code
# get the best tree from optimization
ml_tree <- fit3$tree

# Or use a consensus tree
# consensus_tree <- consensus(boots, p = 0.5)
  
# phangorn specific plots
plotBS(midpoint(ml_tree), boots, 
       type="p", cex=0.4,
       bs.adj = c(1.25, 1.25),
       bs.col = "black")
add.scale.bar()
title("Maximum Likelihood")

Could also extract the bootstrap values for use in other plotting tools.

Hide code
bootstrap_matrix <- sapply(boots, function(tree) tree$node.label)

bootstrap_matrix <- apply(bootstrap_matrix, 2, as.numeric)

bootstrap_summarized <- apply(bootstrap_matrix, 1, mean, na.rm = TRUE)

ml_tree$node.label <- round(bootstrap_summarized, 2)

# root on oldest sequences, Jan 2014
ml_tree = root(ml_tree, c("MH882578", "MH882579", "MH882580"),
               resolve.root = TRUE)

Tree Plots

Using ggtree. This basic plot isn’t much prettier than the above, but offers more flexibility for customization.

Hide code
ggtree(ml_tree, size=0.5, col = "gray30", 
              ladderize=TRUE) + 
    geom_text2(aes(subset = !isTip, label=label), 
                hjust=-0.25, 
                size=3, 
                color="black") +
    geom_tiplab(col="gray40", size=3, 
                align=FALSE, offset = 0.025, hjust = 1) +
    geom_treescale(width=0.02) 

Here’s a more fancy-fied version:

Hide code
# get labels
tree_df <- as.data.frame(
    ml_tree$tip.label
  )

names(tree_df) <- "label"
  
# match to isolate names, more info
tree_df$isolate <- with(seq_meta,
                        Isolate[match(
                            tree_df$label,
                            Accession)])

# match to host type
tree_df$host <- with(seq_meta,
                       Host[match(
                            tree_df$label,
                            Accession)])

# add data to tree
tmp_tree <- full_join(ml_tree, tree_df, by = 'label')
  
ggtree(tmp_tree, size=0.5, col = "gray30", 
              ladderize=TRUE) + 
    geom_tiplab(aes(label = isolate), col="gray40", size=3, 
                align=FALSE, offset = 0.025, hjust = 0.6) +
    geom_text2(aes(subset = !isTip, label=label), 
               hjust=-0.4, 
               size=3, 
               color="darkred") +
    geom_tippoint(aes(colour=host, shape=host), size = 4) +
    scale_color_manual(values = c("Syncerus_caffer" = "red", "Bos_taurus" = "blue")) +  
    scale_shape_manual(values = c("Syncerus_caffer" = 16, "Bos_taurus" = 17)) + 
    theme(plot.margin = unit(c(1,8,1,0.1), "mm"),
          axis.title.x = element_text(size=24, face="bold"),
          axis.title.y = element_blank(),
          axis.text.x = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks=element_blank(),
          axis.line=element_blank(), 
          legend.direction = "vertical",
          legend.position= "inside",
          legend.position.inside = c(0.2, 0.8),
          strip.text = element_blank(), 
          strip.background = element_blank(),
          legend.key.size = unit(2,"line"),
          legend.key.width = unit(3,"line"),
          legend.text = element_text(size=16, face="bold"),
          legend.title = element_text(size=16, face="bold"),
          plot.title = element_text(size=28, face="bold")) +
    geom_treescale(width=0.02) +
    labs(colour = "Host", shape = "Host") +
    guides(colour = guide_legend(override.aes = list(size=4))) 

BEAST

  1. Beauti walk through
  2. Write shell script (.sh)
  3. Login to HPC (Ceres has Beast, Atlas has Beast2)
  4. Upload data (Beauti .xml and .sh)
  5. Navigate to working folder

Create a date file

Hide code
sat1_dates <- sat1_df %>%
  select(Accession, Collection)

unique(sat1_dates$Collection)
[1] "2016-01" "2016-10" "2014-01"
Hide code
sat1_dates <- sat1_dates %>%
  mutate(samp_date = case_when(
    Collection == "2016-01" ~ as_date("2016-01-01", format = "%Y-%m-%d"),
    Collection == "2016-10" ~ as_date("2016-10-01", format = "%Y-%m-%d"),
    Collection == "2014-01" ~ as_date("2014-01-01", format = "%Y-%m-%d"),
  )) %>%
  select(-Collection)

write.table(sat1_dates, file = here("beast/sat1_dates.tsv"), 
            sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

Command Line

$ cd /project/fadru_fmd/phylo_demo
$ ls
$ sbatch kenya_sat1.sh
$ squeue -u john.humphreys

Tracer Type Stats

Hide code
sat_stats <- get_tracer_stats(here("beast/old_beauti_sat1.log")) 

sat_stats %>%
  gt() %>%
  tab_header(
    title = md("Simple SAT1 Tree")) %>%
  cols_width(everything() ~ px(95)) %>%
  tab_options(table.font.size = "small",
              row_group.font.size = "small",
              stub.font.size = "small",
              column_labels.font.size = "medium",
              heading.title.font.size = "large",
              data_row.padding = px(2),
              heading.title.font.weight = "bold",
              column_labels.font.weight = "bold") %>%
  opt_stylize(style = 6, color = 'gray')
<tr class="gt_col_headings" style="border-style: none; border-top-style: solid; border-top-width: 2px; border-top-color: #5F5F5F; border-bottom-style: solid; border-bottom-width: 2px; border-bottom-color: #5F5F5F; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3;">
  <th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="1" colspan="1" scope="col" id="Parameter" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: left;" bgcolor="#5F5F5F" valign="bottom" align="left">Parameter</th>
  <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Mean" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: right; font-variant-numeric: tabular-nums;" bgcolor="#5F5F5F" valign="bottom" align="right">Mean</th>
  <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Median" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: right; font-variant-numeric: tabular-nums;" bgcolor="#5F5F5F" valign="bottom" align="right">Median</th>
  <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Q_0.025" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: right; font-variant-numeric: tabular-nums;" bgcolor="#5F5F5F" valign="bottom" align="right">Q_0.025</th>
  <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="Q_0.975" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: right; font-variant-numeric: tabular-nums;" bgcolor="#5F5F5F" valign="bottom" align="right">Q_0.975</th>
  <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="ESS" style="border-style: none; color: #FFFFFF; background-color: #5F5F5F; font-size: medium; font-weight: bold; text-transform: inherit; border-left-style: none; border-left-width: 1px; border-left-color: #D3D3D3; border-right-style: none; border-right-width: 1px; border-right-color: #D3D3D3; vertical-align: bottom; padding-top: 5px; padding-bottom: 6px; padding-left: 5px; padding-right: 5px; overflow-x: hidden; text-align: right; font-variant-numeric: tabular-nums;" bgcolor="#5F5F5F" valign="bottom" align="right">ESS</th>
</tr>

Simple SAT1 Tree

joint -3864.868 -3865.034 -3882.570 -3847.616 9
prior -133.460 -137.145 -147.452 -110.589 3
likelihood -3731.408 -3730.939 -3744.137 -3721.577 90
treeModel.rootHeight 10.067 10.388 5.442 14.819 5
age.root. 2006.682 2006.361 2001.930 2011.307 5
treeLength 87.645 92.488 44.085 127.009 3
tmrca.untitled0. 10.067 10.388 5.442 14.819 5
age.untitled0. 2006.682 2006.361 2001.930 2011.307 5
constant.popSize 19.311 19.100 8.246 31.802 4
CP1.kappa 14.719 14.473 11.998 18.154 93
CP2.kappa 3.294 3.175 2.073 5.144 100
CP3.kappa 8.288 7.572 4.441 14.233 72
CP1.frequencies1 0.172 0.175 0.154 0.193 43
CP1.frequencies2 0.384 0.383 0.349 0.414 26
CP1.frequencies3 0.250 0.248 0.220 0.287 23
CP1.frequencies4 0.194 0.195 0.163 0.222 44
CP2.frequencies1 0.273 0.273 0.240 0.344 30
CP2.frequencies2 0.240 0.242 0.180 0.294 21
CP2.frequencies3 0.339 0.336 0.290 0.385 33
CP2.frequencies4 0.148 0.150 0.116 0.182 48
CP3.frequencies1 0.324 0.325 0.274 0.379 21
CP3.frequencies2 0.254 0.253 0.220 0.297 46
CP3.frequencies3 0.239 0.236 0.199 0.280 32
CP3.frequencies4 0.183 0.184 0.148 0.217 35
CP1.nu 0.766 0.768 0.731 0.799 112
CP2.nu 0.123 0.122 0.093 0.152 158
CP3.nu 0.111 0.111 0.089 0.135 163
CP1.mu 2.292 2.298 2.187 2.391 112
CP2.mu 0.369 0.367 0.279 0.457 158
CP3.mu 0.333 0.335 0.268 0.407 163
clock.rate 0.013 0.011 0.008 0.023 4
meanRate 0.013 0.011 0.008 0.023 4
treeLikelihood -3731.408 -3730.939 -3744.137 -3721.577 90
branchRates 0.000 0.000 0.000 0.000 0
coalescent -128.800 -132.364 -142.661 -106.559 3

Maximum Clade Credability Tree

Command Line

$ module load beast
$ beast
$ treeannotator -burnin 10000 -heights median aligned_SAT1.trees.txt sat1_mcc.tree
Hide code
options(ignore.negative.edge=TRUE)
sat_mcc.tree <- read.nexus(here("beast/sat1_mcc.tree"))

ggtree(sat_mcc.tree, mrsd = "2016-10-01", 
       size=0.5, col = "gray30", ladderize=TRUE) + 
    geom_tiplab(col="gray40", size=3, 
                align=FALSE, offset = 0.025, hjust = 0.001) +
  theme_tree2(axis.title.x = element_text(size = 24, face = "bold"),
              axis.title.y = element_blank(),
              axis.text.x = element_text(face = "bold", size = 15, vjust = 1, 
                                         hjust = 1, angle = 45),
              axis.text.y = element_blank(),
              plot.title = element_text(size = 28, face = "bold"))