Skip to content

Things to consider

Mike Lee edited this page Jan 17, 2019 · 24 revisions

When to use GToTree and when not?

GToTree is very useful in many situations, but not all. For example, it is useful if you want to make a large-scale Tree of Life spanning all 3 domains including a lot of genomes (like demonstrated here). And it is useful if you want to figure out where some new genomes fit in with references (like the Alteromonas example here). But even if you use a specific marker-gene set to make a tree of all the organisms of interest (like the Gammaproteobacteria set in the Alteromonas example), this is only useful at the level of resolution those marker genes provide. Often that may be enough for your purposes, but sometimes you might need or want to go deeper. In cases like this, you may want to use GToTree to figure out where your new genomes fit in with, say, 500 reference genomes, and then you could use that tree to identify which reference genomes you actually want to include in a pangenomic analysis with your new genomes. Then using something like anvi'o's pangenomic workflow, you can identify many single-copy genes that are specific to the subset of genomes you are focusing on (with some excellent filtering metrics), and eventually generate a phylogenomic tree that is highly specific to what you are working on.

Filtering hits by gene-length

The default setting for this value (set with "-c") is 0.2. This means if the median length of all genes selected as best hits to marker-gene A is 100, genes that were hits to marker-gene A that are greater in length than 120 or shorter in length than 80 will be removed from the analysis. This seems to work well in my experience, but only when there are enough genes in the gene set to give somewhat of a normal distribution. Meaning, at the extreme end, if you only had 3 genes to consider, and their lengths were 100, 100, and 121, the 121 gene would be filtered out, but maybe it shouldn't be. If running GToTree with very few genomes, you might consider increasing this threshold and/or visually inspecting some of the alignments.

Filtering genomes by fraction of hits to targets

The default setting for this value (set with "-G") is set to 0.5, meaning that if you are searching for 100 genes, genomes with hits to less than 50 will be dropped from the analysis. This seems to me to be reasonable when creating a tree than spans a lot of diversity, like all 3 domains, but you may want to increase this threshold when working with a more closely related set of organisms.

Conservative mode

By default, if a given genome has more than one hit to a specific HMM profile (target gene), GToTree will take the best hit and use that in the concatenated alignment. This is because the single-copy gene-sets were generated by taking those Pfam-HMMs that have exactly 1 hit in greater than 90% of ~11,500 searched genomes. For any given genome, it is not all that uncommon to have 1 or so of these target genes present in more than 1 copy (at least as determined by the HMM profile and its cutoffs). My thinking is that in these instances the best hit should be from the protein that is under similar evolutionary pressures to those that built the HMM. So the default behavior still has that genome to contribute to that part of the final alignment. But if you specify the "-C" flag with no arguments, GToTree will run in "conservative mode". And if a genome has more than one hit to a particular HMM, that genome won't contribute a gene for that HMM to the alignment.