Marker gene phylogenetic tree

Note: this code was not used to create the Caudovirales phylogeny shown in figure 4. For that, please follow the description of methods from https://doi.org/10.1038/s41564-019-0448-z. This code is useful to create a phylogeny for an order or family level viral clade. But it is not sensitive enough to find sufficient prevalent genes across an entire viral class like the Caudoviricites

Requirements

HMMER v3.1b2
FAMSA v1.2.5
trimAL v1.4.rev22
FastTreeMP v2.1.10
MCL v14-137

You should be able to run the following from your command line:
hmmsearch
famsa
trimal
FastTreeMP
mcl
prodigal

Locate viral proteins of genomes from an individual clade

proteins should have headers with the format: [GENOME_ID]_[PROTEIN_NUM]
this is the format that Prodigal will produce

Run pipeline
python marker_tree.py --in_faa proteins.faa --out_dir out

Pipeline overview

Make diamond db of proteins
Perform all-vs-all alignment of proteins using diamond (E-value<1e-5 )
Run MCL to cluster proteins (1.4 inflation factor)
Select protein clusters (copy number <1.1, prevalence >10%)
Write seqs for each PC
Build multiple seq alignment (MSA) for each PC using FAMSA
Build hmm from each MSA using HMMER
Search hmms versus original proteins (E-value<1e-5 )
Extract proteins for top hmm hits
Build multiple seq alignment (MSA) for each PC using expanded set of homologs using FAMSA
Trim multiple seq alignments with trimAl (discard columns with >50% gaps)
Concatenate alignments and fill missing genes with gaps
Count gaps per genome
Write concatenated alignment for genomes with <90% gaps
Built phylogeny with FastTreeMP using default options

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Marker gene phylogenetic tree

Files

README.md

Latest commit

History

README.md

File metadata and controls

Marker gene phylogenetic tree