Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sylph unable to profile low coverage SAGs #40

Open
lingrongjin opened this issue Jan 2, 2025 · 5 comments
Open

Sylph unable to profile low coverage SAGs #40

lingrongjin opened this issue Jan 2, 2025 · 5 comments

Comments

@lingrongjin
Copy link

I'm trying to use sylph to profile single cell amplified genomes (SAGs); however, I found that many of my SAG sample files do not pass the profiling threshold. I tried profiling over pre-built gtdb database and sylsp database built from MAGs and SAGs assembled from the same samples, but only 2700 out of ~17000 SAGs with >10000 clean reads can be profiled by sylph. I'm wondering what can be the causes of the low classification rate - from my understanding, most SAGs represent single-species bacterial genomes, and with around ~10000 reads, sylph should be able to classify them if they are represented in the database?

@bluenote-1577
Copy link
Owner

@lingrongjin there are a few things that come to mind

  • I've never profiled with single cell sequencing, but the coverage distribution is -- from my understanding -- very skewed compared to metagenomics. sylph assumes a metagenomics-like read coverage distribution across the genome
  • are you sure the SAGs have a species-level representative in GTDB? sylph can only do species-level profiling well, so if your SAG is a new species, sylph won't work.
  • you can try doing -m 85; this will check if there are genomes in the database present with > 85% ANI to your SAG (very approximate/rough)

@lingrongjin
Copy link
Author

Hi Jim, thanks for the suggestions. I tried setting -m 85 and it did increase the number of SAGs classified significantly. I'm wondering by setting -m 85, can we assume that resulting classification is roughly accurate at the genus level?

@bluenote-1577
Copy link
Owner

@lingrongjin

If ANI is > 85%, it is almost certainly genus-level. I think ANI can dip < 75% for within-genus organisms, so you may still miss detection.

@lingrongjin
Copy link
Author

That makes sense. I'm wondering when you say that sylph assumes a "metagenomics-like read coverage distribution across the genome", do you mean that genome completeness may affect the detection by sylph? I found lowering the ANI threshold increases the number of SAGs classified using both my custom genome database and the gtdb database, but a significant portion still remains not profiled at 85% ANI. I'm wondering besides that the database may not be comprehensive enough, could the low genome completeness of many SAGs (i.e. <25%) affect their detection as well?

@bluenote-1577
Copy link
Owner

bluenote-1577 commented Jan 7, 2025

I don't mean completeness necessarily -- sylph should work with even < 10% completeness -- however, this depends on the how the reads are sequenced. I believe single-cell amplified genomes have uneven coverage across the genome compared to whole genome sequencing. This may negatively affect sylph's results.

If your SAG are from an environment that is not well characterized, database incompleteness is the likely issue. Perhaps check out SingleM https://github.com/wwood/singlem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants