Note: this code was not used to create the Caudovirales phylogeny shown in figure 4. For that, please follow the description of methods from https://doi.org/10.1038/s41564-019-0448-z. This code is useful to create a phylogeny for an order or family level viral clade. But it is not sensitive enough to find sufficient prevalent genes across an entire viral class like the Caudoviricites
Requirements
- HMMER v3.1b2
- FAMSA v1.2.5
- trimAL v1.4.rev22
- FastTreeMP v2.1.10
- MCL v14-137
You should be able to run the following from your command line:
hmmsearch
famsa
trimal
FastTreeMP
mcl
prodigal
Locate viral proteins of genomes from an individual clade
- proteins should have headers with the format:
[GENOME_ID]_[PROTEIN_NUM]
- this is the format that Prodigal will produce
Run pipeline
python marker_tree.py --in_faa proteins.faa --out_dir out
Pipeline overview
- Make diamond db of proteins
- Perform all-vs-all alignment of proteins using diamond (E-value<1e-5 )
- Run MCL to cluster proteins (1.4 inflation factor)
- Select protein clusters (copy number <1.1, prevalence >10%)
- Write seqs for each PC
- Build multiple seq alignment (MSA) for each PC using FAMSA
- Build hmm from each MSA using HMMER
- Search hmms versus original proteins (E-value<1e-5 )
- Extract proteins for top hmm hits
- Build multiple seq alignment (MSA) for each PC using expanded set of homologs using FAMSA
- Trim multiple seq alignments with trimAl (discard columns with >50% gaps)
- Concatenate alignments and fill missing genes with gaps
- Count gaps per genome
- Write concatenated alignment for genomes with <90% gaps
- Built phylogeny with FastTreeMP using default options