Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Universal ToL and Eukaryotes #98

Open
molly-kholodova opened this issue Sep 17, 2024 · 9 comments
Open

Universal ToL and Eukaryotes #98

molly-kholodova opened this issue Sep 17, 2024 · 9 comments

Comments

@molly-kholodova
Copy link

Hi Mike,

I'm a PhD student in Laura Hug's lab currently working on a side project related to constructing universal ToLs. I was following your ToL example using the universal marker gene set and noticed a big discrepancy when I try to run the same code.

I downloaded your code and original files (https://figshare.com/articles/dataset/GToTree_ToL_example_data/19372322) and ran it myself, so it should be using the exact same accessions that you fetched in 2018 (1698 genomes). However, the resulting tree has more organisms that were excluded on the basis of having too few hits (Genomes_removed_for_too_few_hits.tsv). Your run had 7, mine had 17.
ToL_test_MC_Aug2024.zip

Taking a closer look at the NCBI_genomes_summary_info.tsv file, if I look up a eukaryotic accession from the Genomes_removed_for_too_few_hits.tsv file, it actually has numbers >1 in many of the columns. From what I can tell, it treats multiple hits for a gene as no hits. When I ran it again using -G 0 instead of -G 0.4 (I think this is the minimum genes for inclusion?) it was able to include the genomes again in the tree.

I understand you have a disclaimer about working with eukaryotes in GToTree in terms of completion and redundancy, but currently even reference eukaryotic genomes are unable to be placed, which is problematic for creating universal trees.

As a side note, an additional issue to consider when working with eukaryotic genomes is the possibility that mitochondrial or chloroplast genes could be mistaken for marker genes, potentially leading to misplacement or conflicting data. In cases of multiple hits, how does GToTree decide which one to use? If it is based on best match/lowest E value, then it would prioritize mitochondrial and chloroplast genes over eukaryotic versions. I ran into this a few times in my own project where it would place a big group of plants and photosynthetic eukaryotes next to the cyanobacteria.

I am implementing a solution right now by building a separate HMM for eukaryotic ribosomal genes and running that separately for eukaryotic genomes. This does require sorting the genomes beforehand, but perhaps you could incorporate it into GToTree as an option for dealing with genomes with multiple SCG hits.

Let me know if you want more info on anything I've mentioned here or would like to discuss further.

@AstrobioMike
Copy link
Owner

AstrobioMike commented Sep 17, 2024

Hey there, Molly :)

Thanks for writing in!

So the main thing you hit on (regarding how GToTree deals with multi-hits) has changed since i put the original paper out, so that is likely the primary cause of the discrepancy you saw.

Looking through the release notes, it seems it was Jan of 2021 with v1.1.1 when i implemented this change. Since then, if there are multiple hits to a target, GToTree takes the more conservative route by default and excludes that target-gene overall since it wasn't found in exactly 1 copy (therefore potentially leading to more dropped genomes for not hitting the minimum proportion of target genes, -G, that you adjusted). And i changed this default behavior for the very reasons you mention 👍 The option to run it the old way exists by turning on "best-hit mode" by adding the -B flag.

On to the rest of your message:

I am implementing a solution right now by building a separate HMM for eukaryotic ribosomal genes and running that separately for eukaryotic genomes. This does require sorting the genomes beforehand, but perhaps you could incorporate it into GToTree as an option for dealing with genomes with multiple SCG hits.

YES! I would love to implement your work to improve things on this front :)

The first easier step I can imagine is to just add the HMMs as a dedicated eukarya set if folks are making a tree with just eukarya.

And then i think a second step, which would take a bit more work, would be handling when working across domains and having some upfront way of determining which inputs are euks and should be treated differently like you're saying.

I don't think i'd want to employ it always when there are multiple hits to something, because that can be the case in non-euks too. But maybe i'm misunderstanding or mis-thinking something at the moment. But yea, i'd love to talk more! Maybe let's email and then can coordinate a time to hop on a call? You can reach me at [email protected]

And thanks again for writing in about this!

@molly-kholodova
Copy link
Author

Hi Mike,

Thanks for your quick response, and the note about best-hit mode, that clears things up a lot!

Yes, I'd be very happy to chat about this topic, will reach out shortly :)

@ezherman
Copy link

Hi both,

I have been playing around with GToTree and have encountered similar issues to those described by @molly-kholodova. Reading your discussion has been informative - thank you!

I am implementing a solution right now by building a separate HMM for eukaryotic ribosomal genes and running that separately for eukaryotic genomes. This does require sorting the genomes beforehand, but perhaps you could incorporate it into GToTree as an option for dealing with genomes with multiple SCG hits.

This sounds like a great solution - would you be willing to share your approach to building these HMMs, and/or the HMMs themselves?

Thanks!

@molly-kholodova
Copy link
Author

Hi Ezra,

Sure thing! I've uploaded them to a repository [https://github.com/molly-kholodova/HMMs].

In terms of constructing them:

  • I went on the InterPro site and looked up (Search -> by Text) each gene by the PFAM ID (PFAM got merged into InterPro with other databases).
  • Then found the corresponding InterPro entry (starts with IPR___)
  • Clicked on 'proteins' for that entry to get a list of all protein seqs in the database. You can filter by 'reviewed' to get better quality ones.
  • Unfortunately I couldn't figure out how to filter by taxonomy, but you could probably do it after downloading. For my purposes, because these genes are well conserved I didn't need a ton of representative sequences. So I just searched for some model organisms: homo sapiens, arabidopsis, aspergillus, c elegans, dictyostelium, paramecium.
  • Copied the sequences into a textfile and ran hmm-build on them.

An important note for ribosomal genes: databases are often not clear on the true orthologs between prokaryotes/eukaryotes. Gene names differ between groups. InterPro has attempted to map them to common entries but there are still discrepancies. In some cases you have to use a separate IPR entry to get the euk ortholog otherwise it will include mitochondria/chloroplast genes.

If it helps, I've attached a spreadsheet with the relevant info:
universal_ribosomal_genes.xlsx

These have been validated in downstream analyses (e.g., MSAs).

Hope this helps!

Molly

PS: I'm currently doing a semester abroad in the UK. Interested in learning about Esox and potential bioinformatics job opportunities. Would you be free to connect sometime?

@ezherman
Copy link

Awesome, thank you @molly-kholodova! Good to know regarding the IPR entries and great that you labelled them within the spreadsheet.

Have you tried using these HMMs with GToTree? The data I was working with today only included a few eukaryotes, however the tool found no genes for them using the HMMs (incl. a Saccharomyces cerevisiae reference genome).

Yes, it'd be great to have a chat! Please reach out via https://www.esoxbiologics.com/contact and mention my name, then we can use email to arrange a time.

@molly-kholodova
Copy link
Author

No problem @ezherman! Appreciate the contact info.

Interesting that you got no hits with the default HMM set, usually the problem is getting too many. S cerivisae has a bunch of ribosomal gene duplications I think.

I have not tried the euk specific ones with GToTree since I was testing another tree software, but I would expect it to work. If hmmsearch is giving you hits for this set but not the other, maybe it's an issue with cutoffs.

@AstrobioMike
Copy link
Owner

Heya, @ezherman :)

Do you still get no hits if you run it in "best-hit mode" by adding the -B flag? Or just looking at the count summary table produced from earlier runs, is it saying 0 for everything?

@ezherman
Copy link

Hi @AstrobioMike!

Yep, I had no hist with -B: all 0s in SCG_hit_counts.tsv

@AstrobioMike
Copy link
Owner

AstrobioMike commented Nov 26, 2024

Ah, i see the deal now, @ezherman. GToTree relies on "gathering thresholds" (GA) to be included in the HMM files like this:
image

And Molly was probably using a different method of filtering out spurious hits for these HMMs, as they don't have a GA included.

GAs are a manually curated cutoff, as noted in this article in the "Thresholds and compatibility" section:

The gathering threshold is the bit score threshold that a sequence must match or exceed in order for it to be deemed significant and therefore belonging to that family. In Pfam, we manually define our thresholds such that no known false positives are permitted in any family.

A GA cutoff can just be manually added to any HMM file, it just needs to be empirically derived with some testing to feel confident about the chosen cutoff score.

Sorry for the confusion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants