Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding how kmers are counted #24

Open
priyanka-surana opened this issue Oct 26, 2022 · 0 comments
Open

Understanding how kmers are counted #24

priyanka-surana opened this issue Oct 26, 2022 · 0 comments

Comments

@priyanka-surana
Copy link

I want to understand how kmers are counted in FastK and how that affects the totals in merquryFK calculations.

Why do the total values in completeness.stats and qv files differ so much? What do they represent and how they relate to each other? I run merquryfk with a single genome assembled using Pacbio HiFi and HiC data, and run against an Illumina kmer dataset.

# mMelMel1_T1.qv 
Assembly	No Support	Total	Error %	QV
GCA_922984935.2.subset	6005	7999890	0.0024	46.2

# mMelMel1_T1.completeness.stats 
Assembly	Region	Found	Total	% Covered
GCA_922984935.2.subset	all	2268391877	2268397787	100.00

From Merqury, marbl/merqury#84

The Total in QV are kmers that are 'present' in the assembly. So if there is one specific kmer found 3 times in the assembly, but never in the reads, it is counted as 3 error kmers (no suppurt). The 3 error kmers are part of the Total.

The Total in completeness are distinct solid kmers in the reads. In other words, a kmer that is present over a certain frequency in the reads is counted as one kmer. I forgot how exactly the Total is computed in MerquryFK completeness. It's likely that it is only filtering out kmers with frequency of 1, which is the default in FastK? Might be a good question for Gene.

We expected the opposite because the total for QC is ~8M whereas the total for Completeness is ~2.2B.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant