Core protocol of this project. Phylogentic reconstruction of evolutionary relationships of individual gene families and whole genomes.
This protocol is fully programmed, without the need for any manual curation. Although in the actual analysis one needs to manually kick-start individual computational tasks, since they are so heavy with this amount of data and have to be scheduled with care on a super cluster (in our case, SDSC Comet).
- UPP 2.0
- FastTree 2.1.9
- RAxML 8.2.10
- IQ-TREE 1.6.1
- PhyloPhlAn 2c0e61a
- TreeShrink 1.1
- ASTRAL 5.12.6a
Note: the commands listed below have ignored multi-threading parameters and filename-specific parameters.
Align amino acid sequences using UPP.
run_upp.py -s seqfile.fa -B 100000 -M -1 -T 0.66 -m amino
- UPP was run with all sequences but fragments are used for the backbone. Sequences with length L < 0.34*M or L > 1.33*M, where M is the median length of all the sequences, were detected as fragments and not included in the backbone.
After aligned the sequences by UPP, we filtered out sites with more than 95% gaps, then filtered out sequences with more than 66% gaps.
Drop marker genes with more than 75% gaps in the alignment matrix.
This left 381 out of 400 marker genes, and left 10,575 out of 11,079 genomes.
We used ProteinModelSelection.pl
as bundled in RAxML:
perl ProteinModelSelection.pl align.fa > best_model.txt
Build a starting tree using FastTree:
FastTree -lg -gamma -seed 12345 align.fa > fasttree.nwk
Remove outliers (low quality sequences, contaminations, etc.) presented as unproportionally long branches in the FastTree trees using TreeShrink:
run_treeshrink.py -i input_dir -t fasttree.nwk -a align.fa -o output_dir
Infer gene tree topology using CAT in RAxML. Three runs were performed, one with the FastTree tree as the starting tree; the other two with random seeds:
raxmlHPC -m PROTCATLG -F -f D -D -s align.fa -t fasttree.nwk
raxmlHPC -m PROTCATLG -F -f D -D -s align.fa -p 12345
raxmlHPC -m PROTCATLG -F -f D -D -s align.fa -p 23456
- The amino acid substitution matrix
LG
can be replaced with other gene-specific matrices.
Optimize branch lengths and compute likelihood score using Gamma in RAxML:
raxmlHPC -m PROTGAMMALG -f e -s align.fa -t cat.nwk
If one or more of the three runs fail due to computational limitation, use IQ-TREE instead for all three:
iqtree -m LG+G4 -s align.fa -te cat.nwk
Among the three trees, keep one with the highest likelihood score.
The 381 gene alignments were concatenated into a supermatrix. To reduce computational cost, and to improve alignment reliability, we performed two types of site sampling:
- Select up to k most conserved sites per gene, using the Trident scoring function implemented in PhyloPhlAn.
import phylophlan as ppa
ppa.subsample('path/to/input/folder',
'path/to/output/folder',
ppa.onehundred,
ppa.trident,
'substitution_matrices/pfasum60.pkl',
unknown_fraction=0.3)
- Randomly select k sites per gene, from sites with less than 50% gaps.
import phylophlan as ppa
ppa.subsample('path/to/input/folder',
'path/to/output/folder',
ppa.onehundred,
ppa.random_score,
'substitution_matrices/pfasum60.pkl',
unknown_fraction=0.3)
In the current release, k = 100.
The procedures are overall similar to those used for generating individual gene trees.
Generate a starting tree using FastTree:
FastTree -lg -gamma -seed 12345 concat.phy > fasttree.nwk
Infer topology using CAT in RAxML:
raxmlHPC -m PROTCATLG -F -f D -s concat.phy -t fasttree.nwk
raxmlHPC -m PROTCATLG -F -f D -s concat.phy -p 12345
raxmlHPC -m PROTCATLG -F -f D -s concat.phy -p 23456
Optimize branch lengths and compute likelihood score using Gamma in IQ-TREE:
iqtree -m LG+G4 -s concat.phy -te raxml.nwk
Keep the highest-score tree of the three.
Branch support values were provided by 100 rapid bootstraps using RAxML:
raxmlHPC -m PROTCATLG -s concat.phy -p 12345 -x 12345 -N 100
raxmlHPC -m PROTCATLG -f b -z xboot.nwk -t iqtree.nwk
We use ASTRAL, which infers the optimal species tree topolgy by summarizing multiple gene trees.
java -jar astral.jar -i gene.trees -o astral.tre
The output file astral.tre
reports the following branch support values:
- EN, QC, f1, f2, f3, pp1, pp2, pp3, q1, q2, q3
In which three are cared:
- EN: **Effective number (en): number of gene trees that provided information.
- q1: Quartet score (qts): proportion of gene trees that support this branch.
- pp1: Local posterior probability (lpp): computed based on the quartet score.
The branch lengths in astral.tre
are in units of coalescence. To obtain "conventional" branch lengths, i.e., number of mutations per site, we used the concatenated alignments:
iqtree -s align.fa -te tree.nwk -keep-ident -m LG+G4 -pre [] -nt 24
Please also see tree manipulation.