Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outputting homozygous reference #213

Open
ghost opened this issue Mar 12, 2020 · 6 comments
Open

Outputting homozygous reference #213

ghost opened this issue Mar 12, 2020 · 6 comments

Comments

@ghost
Copy link

ghost commented Mar 12, 2020

Hello,

Is there any way to output the homozygous reference bases in the pVCF? Can I have a pVGCF, with one line per base in my reference genome?

thanks

@mlin
Copy link
Contributor

mlin commented Mar 12, 2020

One line per base would lead to impractically large output files for GLnexus' main use cases. There is a proposal under discussion in GA4GH about standardizing a multi-sample GVCF format, which would summarize reference coverage in between variant sites. We are monitoring developments there but it will take some time yet to work its way through that process.

@ghost
Copy link
Author

ghost commented Mar 12, 2020

I was asking because it seems GATK joint caller has an "all sites" option.
I understand, however, that GLnexus has a strong emphasis on computation efficiency

@mlin
Copy link
Contributor

mlin commented Mar 12, 2020

Yea it's not something GLnexus' main users have requested or would seem likely to use. If you were really dedicated, you could synthesize a GVCF exhibiting a fake variant with good quality metrics at every position and feed that in, causing GLnexus to generate a pVCF site for every position. (I'm not recommending this to be clear -- I think it would work in principle, but there are always unforeseen problems)

@ghost
Copy link
Author

ghost commented Mar 13, 2020

Okay,
just in case that option might be useful for people calculating mutation rates, as we divide by the genome size, but actually we divide by the number of "callable bases" in the genome, i.e sites that are homozygous but that wouldn't have been filtered out if they hadn't been homozygous.

@mlin
Copy link
Contributor

mlin commented Mar 13, 2020

Thanks -- happy to leave this ticket open for others to +1 or comment

@ZexuanZhao
Copy link

This feature will be of importance to calculate some population genomic statistics that are sensitive to total base pair mapped, like Pi.

See here: https://pixy.readthedocs.io/en/latest/generating_invar/generating_invar.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants