-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sylph unable to profile low coverage SAGs #40
Comments
@lingrongjin there are a few things that come to mind
|
Hi Jim, thanks for the suggestions. I tried setting -m 85 and it did increase the number of SAGs classified significantly. I'm wondering by setting -m 85, can we assume that resulting classification is roughly accurate at the genus level? |
If ANI is > 85%, it is almost certainly genus-level. I think ANI can dip < 75% for within-genus organisms, so you may still miss detection. |
That makes sense. I'm wondering when you say that sylph assumes a "metagenomics-like read coverage distribution across the genome", do you mean that genome completeness may affect the detection by sylph? I found lowering the ANI threshold increases the number of SAGs classified using both my custom genome database and the gtdb database, but a significant portion still remains not profiled at 85% ANI. I'm wondering besides that the database may not be comprehensive enough, could the low genome completeness of many SAGs (i.e. <25%) affect their detection as well? |
I don't mean completeness necessarily -- sylph should work with even < 10% completeness -- however, this depends on the how the reads are sequenced. I believe single-cell amplified genomes have uneven coverage across the genome compared to whole genome sequencing. This may negatively affect sylph's results. If your SAG are from an environment that is not well characterized, database incompleteness is the likely issue. Perhaps check out SingleM https://github.com/wwood/singlem |
I'm trying to use sylph to profile single cell amplified genomes (SAGs); however, I found that many of my SAG sample files do not pass the profiling threshold. I tried profiling over pre-built gtdb database and sylsp database built from MAGs and SAGs assembled from the same samples, but only 2700 out of ~17000 SAGs with >10000 clean reads can be profiled by sylph. I'm wondering what can be the causes of the low classification rate - from my understanding, most SAGs represent single-species bacterial genomes, and with around ~10000 reads, sylph should be able to classify them if they are represented in the database?
The text was updated successfully, but these errors were encountered: