-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to VCFv4.3 to codify local-allele parlance for sparse VCF #420
Conversation
Hi @yfarjoun, all,
WDYT, did I miss some close entanglement of these two ideas? |
It is implicit that REF is part of any ``local" context, and it always has index 0, even if the genotype is compound HET. | ||
LAA is required in order to interpret LAD and LPL. | ||
\item RBS(Integer): An integer describing the size of this genotype's reference block. | ||
The size is the difference between the last position (inclusive) of the reference block and POS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section should probably mention how it interacts with the checkpointing scheme.
@@ -503,6 +517,18 @@ \subsubsection{Genotype fields} | |||
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. | |||
If the genotype in the GT field is unphased, the corresponding PS field is ignored. | |||
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). | |||
\item Local Alleles (*): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a footnote or a totally separate section? It's kind of weird to have it inline with a list of fields. Maybe there should be a small section separated out for describing local alleles and checkpointing in one place?
@mlin I have no objection to splitting this, or discussing both here. There's nothing strongly connecting them other than history. I believe you need to implement both in order to see linear file growth though, neither one is sufficient on its own. There hasn't been a lot of engagement with either yet though, so maybe discussing both together won't be too complicated. |
@lbergelson Naturally, I suggested splitting them because I have in mind a substantive critique of one and not the other 😅 Here it is, but if it triggers a discussion then I think the two topics can/should be forked in order to avoid suppressing potential comments on the other one. tl;dr the RBS attribute is "overfit" to GVCF merging as the approach for producing the joint VCF. The algorithm generating the joint VCF is generally progressing steadily along the reference genome. Providing the RBS attribute necessitates it have information about the "future" in so doing -- i.e. how many (tens, hundreds) of bases ahead the reference depth remains within some band. GVCF of course serves up exactly that information precomputed, but it is not so readily available for any other approach. It's computable of course, but potentially burdensome. An alternative might be to omit RBS and say one should write |
I agree with @mlin about the split. I can see local alleles working perfectly fine as they are described right now, but would require some more thinking about the reference blocks. |
@mlin I agree that it's tightly fit to gvcf merging, but it's not clear to me that that's a problem. It's essentially a proposal for embedding the entire gvcf in a format column without any repetition. This is a major use case which we do not currently have a good solution for. Do you have pipelines which generate something which is not gvcf-like but which does include reference confidence information? I'm not sure I really understand the objection about "future" information. You can always break a band arbitrarily at any location if you want to limit the window you have to hold in memory. It does introduce a pain point about reading random access files by forcing you to look back to a checkpoint to safely identify reference blocks. Is that what you mean? I don't understand how adding writing a missing row without an block solves the problem of how do I efficiently encode reference confidence data for many samples. How do you distinguish the case of "this is the same confidence as somewhere above" vs "this is no data" ? It seems like we would still have the problem of non-local information because of having to do a reverse lookup in order to find the actual data site. |
This PR defines teresm required to enable writing less data into large callsets.
The LAA, LPL, and LAD Format tags enable the users to write PL and AD values only for the alleles that are implicated for the sample in question, while the RBS and the REFERENCE_BLOCK meta line enable the reinterpretation of the missing genotype as part of an upstream reference block.
By using these tags and conventions users can drastically reduce the size of VCFs generated from callsets with large number of samples (>20000)