-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Universal ToL and Eukaryotes #98
Comments
Hey there, Molly :) Thanks for writing in! So the main thing you hit on (regarding how GToTree deals with multi-hits) has changed since i put the original paper out, so that is likely the primary cause of the discrepancy you saw. Looking through the release notes, it seems it was Jan of 2021 with v1.1.1 when i implemented this change. Since then, if there are multiple hits to a target, GToTree takes the more conservative route by default and excludes that target-gene overall since it wasn't found in exactly 1 copy (therefore potentially leading to more dropped genomes for not hitting the minimum proportion of target genes, On to the rest of your message:
YES! I would love to implement your work to improve things on this front :) The first easier step I can imagine is to just add the HMMs as a dedicated eukarya set if folks are making a tree with just eukarya. And then i think a second step, which would take a bit more work, would be handling when working across domains and having some upfront way of determining which inputs are euks and should be treated differently like you're saying. I don't think i'd want to employ it always when there are multiple hits to something, because that can be the case in non-euks too. But maybe i'm misunderstanding or mis-thinking something at the moment. But yea, i'd love to talk more! Maybe let's email and then can coordinate a time to hop on a call? You can reach me at [email protected] And thanks again for writing in about this! |
Hi Mike, Thanks for your quick response, and the note about best-hit mode, that clears things up a lot! Yes, I'd be very happy to chat about this topic, will reach out shortly :) |
Hi both, I have been playing around with GToTree and have encountered similar issues to those described by @molly-kholodova. Reading your discussion has been informative - thank you!
This sounds like a great solution - would you be willing to share your approach to building these HMMs, and/or the HMMs themselves? Thanks! |
Hi Ezra, Sure thing! I've uploaded them to a repository [https://github.com/molly-kholodova/HMMs]. In terms of constructing them:
An important note for ribosomal genes: databases are often not clear on the true orthologs between prokaryotes/eukaryotes. Gene names differ between groups. InterPro has attempted to map them to common entries but there are still discrepancies. In some cases you have to use a separate IPR entry to get the euk ortholog otherwise it will include mitochondria/chloroplast genes. If it helps, I've attached a spreadsheet with the relevant info: These have been validated in downstream analyses (e.g., MSAs). Hope this helps! Molly PS: I'm currently doing a semester abroad in the UK. Interested in learning about Esox and potential bioinformatics job opportunities. Would you be free to connect sometime? |
Awesome, thank you @molly-kholodova! Good to know regarding the IPR entries and great that you labelled them within the spreadsheet. Have you tried using these HMMs with GToTree? The data I was working with today only included a few eukaryotes, however the tool found no genes for them using the HMMs (incl. a Saccharomyces cerevisiae reference genome). Yes, it'd be great to have a chat! Please reach out via https://www.esoxbiologics.com/contact and mention my name, then we can use email to arrange a time. |
No problem @ezherman! Appreciate the contact info. Interesting that you got no hits with the default HMM set, usually the problem is getting too many. S cerivisae has a bunch of ribosomal gene duplications I think. I have not tried the euk specific ones with GToTree since I was testing another tree software, but I would expect it to work. If hmmsearch is giving you hits for this set but not the other, maybe it's an issue with cutoffs. |
Heya, @ezherman :) Do you still get no hits if you run it in "best-hit mode" by adding the -B flag? Or just looking at the count summary table produced from earlier runs, is it saying 0 for everything? |
Hi @AstrobioMike! Yep, I had no hist with |
Ah, i see the deal now, @ezherman. GToTree relies on "gathering thresholds" (GA) to be included in the HMM files like this: And Molly was probably using a different method of filtering out spurious hits for these HMMs, as they don't have a GA included. GAs are a manually curated cutoff, as noted in this article in the "Thresholds and compatibility" section:
A GA cutoff can just be manually added to any HMM file, it just needs to be empirically derived with some testing to feel confident about the chosen cutoff score. Sorry for the confusion! |
Hi Mike,
I'm a PhD student in Laura Hug's lab currently working on a side project related to constructing universal ToLs. I was following your ToL example using the universal marker gene set and noticed a big discrepancy when I try to run the same code.
I downloaded your code and original files (https://figshare.com/articles/dataset/GToTree_ToL_example_data/19372322) and ran it myself, so it should be using the exact same accessions that you fetched in 2018 (1698 genomes). However, the resulting tree has more organisms that were excluded on the basis of having too few hits (Genomes_removed_for_too_few_hits.tsv). Your run had 7, mine had 17.
ToL_test_MC_Aug2024.zip
Taking a closer look at the NCBI_genomes_summary_info.tsv file, if I look up a eukaryotic accession from the Genomes_removed_for_too_few_hits.tsv file, it actually has numbers >1 in many of the columns. From what I can tell, it treats multiple hits for a gene as no hits. When I ran it again using
-G 0
instead of-G 0.4
(I think this is the minimum genes for inclusion?) it was able to include the genomes again in the tree.I understand you have a disclaimer about working with eukaryotes in GToTree in terms of completion and redundancy, but currently even reference eukaryotic genomes are unable to be placed, which is problematic for creating universal trees.
As a side note, an additional issue to consider when working with eukaryotic genomes is the possibility that mitochondrial or chloroplast genes could be mistaken for marker genes, potentially leading to misplacement or conflicting data. In cases of multiple hits, how does GToTree decide which one to use? If it is based on best match/lowest E value, then it would prioritize mitochondrial and chloroplast genes over eukaryotic versions. I ran into this a few times in my own project where it would place a big group of plants and photosynthetic eukaryotes next to the cyanobacteria.
I am implementing a solution right now by building a separate HMM for eukaryotic ribosomal genes and running that separately for eukaryotic genomes. This does require sorting the genomes beforehand, but perhaps you could incorporate it into GToTree as an option for dealing with genomes with multiple SCG hits.
Let me know if you want more info on anything I've mentioned here or would like to discuss further.
The text was updated successfully, but these errors were encountered: