-
Notifications
You must be signed in to change notification settings - Fork 26
Things to consider
Phylogenetics is an incredibly complicated and well-researched field, and things become even more complicated when working with many concatenated genes as is the case with phylogenomics. GToTree is meant to be a relatively high-throughput, user-friendly, and reproducible workflow, something I believe is useful due to the high volumes of sequencing data and genomes we are often working with these days. But anything designed this way needs to inherently sacrifice something in terms of flexibility, options, precision, etc. It is important that users new to this arena understand that many things impact the outcome of a phylogenetic/genomic analysis, particularly including the alignment algorithm used, and the model and program used for tree construction. Currently, GToTree employs only one alignment tool, and two options for tree construction. Users can also take the concatenated alignment output by GToTree (and the partitions file if they'd like) and use that with many other tree construction tools. But please keep in mind that phylogenetic analysis is complicated, and no one program or tool is an "absolute answer" or the "truth" – another way to think of this is with the old adage "all models are wrong".
GToTree is very useful in many situations, but not all. For example, it is useful if you want to make a large-scale Tree of Life spanning all 3 domains including a lot of genomes (like demonstrated here). And it is useful if you want to infer evolutionary relationships between some newly recovered genomes and references on a smaller scale (like the Alteromonas example here). But even if you use a specific marker-gene set to make a tree of all the organisms of interest (like the Gammaproteobacteria set in the Alteromonas example), this is only useful at the level of resolution those marker genes provide. Often that may be enough for your purposes, but sometimes you might need or want to go deeper. In cases like this, you may want to use GToTree to figure out where your new genomes fit in with, say, 500 reference genomes, and then you could use that tree to identify which reference genomes you actually want to include in a pangenomic analysis with your new genomes. Then using something like anvi'o's pangenomic workflow, you can identify many single-copy genes that are specific to the subset of genomes you are focusing on (with some excellent filtering metrics), and eventually generate a phylogenomic tree that is highly specific to what you are working on.
No, GToTree is not a tool for taxonomic assignment. For assigning taxonomy to genomes there are dedicated programs that work very well like the Genome Taxonomy Database Toolkit (GTDB-Tk). I would typically assign taxonomy to my new genomes with GTDB-Tk, and then I would use that information to figure out which reference genomes I'd want to include in a de novo phylogenomic tree I'd build with GToTree. You can however make a de novo tree, look at where your new genomes fall on that tree, and see which references they are more closely related to based on that tree. For example, with the Alteromonas example here, we start there "knowing" the new genome is an Alteromonas, and we build a de novo tree with all RefSeq reference Alteromonas genomes and our new one. Ahead of where that example starts, I may have figured out that my new genome was an Altermonas by using GTDB-Tk.
The default setting for this value (set with -c
) is 0.2. This means if the median length of all genes selected as best hits to marker-gene A is 100, genes that were hits to marker-gene A that are greater in length than 120 or shorter in length than 80 will be removed from the analysis. This seems to work well in my experience, but only when there are enough genes in the gene set to give somewhat of a representative distribution of the lengths of genes that exist within that target gene set. Meaning, at the extreme end, if we only had 3 genes to consider, and their lengths were 100, 100, and 121, the 121-length gene would be filtered out, but maybe it shouldn't be. If running GToTree with very few genomes, you might consider increasing this threshold and/or visually inspecting some of the alignments.
The default setting for this value (set with -G
) is set to 0.5, meaning that if you are searching for 100 genes, genomes with hits to less than 50 will be dropped from the analysis. This seems to me to be reasonable when creating a tree than spans a lot of diversity, like all 3 domains, but you may want to increase this threshold when working with a more closely related set of organisms.
By default, if a given genome has more than one hit to a specific HMM profile (target gene), GToTree won't include a sequence for that target gene from that genome in the final concatenated alignment (it will insert a gap-sequence just as would be the case if that genome had 0 hits to the target). This is a conservative way to go, because if there are multiple copies of a target SCG present within a genome, the copies may not all be under the same evolutionary pressures, and which one we choose may impact the alignment and tree in ways we do not want it to. So I figure in general, being conservative is better for default settings. But if you'd like, you can specify the -B
flag with no arguments to tell GToTree to run in "best-hit" mode. In this case, when a given genome has more than one hit to a specific target gene, GToTree will take the best hit and add it to the alignment.
If only using highly conserved ribosomal proteins, (like those in the Tree of Life example using the Hug et al. 2016 SCG-set), and/or if all genes are already identified (e.g. the input source is an NCBI accession with gene calls or a GenBank file with gene calls), then GToTree is suitable for working with Eukaryotes in addition to Bacteria and Archaea. If no gene-calls are available, then GToTree is likely not suitable for eukaryotic genomes as the only gene-caller currently implemented is prodigal.
Home -- What is GToTree? -- Installation -- Example Usage -- User Guide -- SCG-sets -- Things to Consider
- Home
- What is GToTree?
- Installation
- Example usage
- User Guide
- SCG-sets
- Things to consider