Sylph unable to profile low coverage SAGs #40

lingrongjin · 2025-01-02T07:28:05Z

I'm trying to use sylph to profile single cell amplified genomes (SAGs); however, I found that many of my SAG sample files do not pass the profiling threshold. I tried profiling over pre-built gtdb database and sylsp database built from MAGs and SAGs assembled from the same samples, but only 2700 out of ~17000 SAGs with >10000 clean reads can be profiled by sylph. I'm wondering what can be the causes of the low classification rate - from my understanding, most SAGs represent single-species bacterial genomes, and with around ~10000 reads, sylph should be able to classify them if they are represented in the database?

bluenote-1577 · 2025-01-02T08:35:28Z

@lingrongjin there are a few things that come to mind

I've never profiled with single cell sequencing, but the coverage distribution is -- from my understanding -- very skewed compared to metagenomics. sylph assumes a metagenomics-like read coverage distribution across the genome
are you sure the SAGs have a species-level representative in GTDB? sylph can only do species-level profiling well, so if your SAG is a new species, sylph won't work.
you can try doing -m 85; this will check if there are genomes in the database present with > 85% ANI to your SAG (very approximate/rough)

lingrongjin · 2025-01-07T01:57:08Z

Hi Jim, thanks for the suggestions. I tried setting -m 85 and it did increase the number of SAGs classified significantly. I'm wondering by setting -m 85, can we assume that resulting classification is roughly accurate at the genus level?

bluenote-1577 · 2025-01-07T03:57:03Z

@lingrongjin

If ANI is > 85%, it is almost certainly genus-level. I think ANI can dip < 75% for within-genus organisms, so you may still miss detection.

lingrongjin · 2025-01-07T06:45:48Z

That makes sense. I'm wondering when you say that sylph assumes a "metagenomics-like read coverage distribution across the genome", do you mean that genome completeness may affect the detection by sylph? I found lowering the ANI threshold increases the number of SAGs classified using both my custom genome database and the gtdb database, but a significant portion still remains not profiled at 85% ANI. I'm wondering besides that the database may not be comprehensive enough, could the low genome completeness of many SAGs (i.e. <25%) affect their detection as well?

bluenote-1577 · 2025-01-07T22:38:11Z

I don't mean completeness necessarily -- sylph should work with even < 10% completeness -- however, this depends on the how the reads are sequenced. I believe single-cell amplified genomes have uneven coverage across the genome compared to whole genome sequencing. This may negatively affect sylph's results.

If your SAG are from an environment that is not well characterized, database incompleteness is the likely issue. Perhaps check out SingleM https://github.com/wwood/singlem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sylph unable to profile low coverage SAGs #40

Sylph unable to profile low coverage SAGs #40

lingrongjin commented Jan 2, 2025

bluenote-1577 commented Jan 2, 2025

lingrongjin commented Jan 7, 2025

bluenote-1577 commented Jan 7, 2025

lingrongjin commented Jan 7, 2025

bluenote-1577 commented Jan 7, 2025 •

edited

Loading

Sylph unable to profile low coverage SAGs #40

Sylph unable to profile low coverage SAGs #40

Comments

lingrongjin commented Jan 2, 2025

bluenote-1577 commented Jan 2, 2025

lingrongjin commented Jan 7, 2025

bluenote-1577 commented Jan 7, 2025

lingrongjin commented Jan 7, 2025

bluenote-1577 commented Jan 7, 2025 • edited Loading

bluenote-1577 commented Jan 7, 2025 •

edited

Loading