Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to VCFv4.3 to codify local-allele parlance for sparse VCF #420

Closed
wants to merge 1 commit into from

Conversation

yfarjoun
Copy link
Contributor

This PR defines teresm required to enable writing less data into large callsets.

The LAA, LPL, and LAD Format tags enable the users to write PL and AD values only for the alleles that are implicated for the sample in question, while the RBS and the REFERENCE_BLOCK meta line enable the reinterpretation of the missing genotype as part of an upstream reference block.

By using these tags and conventions users can drastically reduce the size of VCFs generated from callsets with large number of samples (>20000)

@yfarjoun yfarjoun requested review from cyenyxe and lbergelson June 14, 2019 17:07
@hts-specs-bot
Copy link

Changed PDFs as of 4b60bc8: VCFv4.3 (diff).

@jkbonfield jkbonfield added the vcf label Jun 18, 2019
@mlin
Copy link
Member

mlin commented Jun 21, 2019

Hi @yfarjoun, all,
I'd like to suggest this ought to be split into two PRs that can be discussed ~independently:

  1. Local Alleles: reformulation of the FORMAT fields for multi-allelic VCF sites to prevent quadratic blowup of PL (and less acutely, linear expansion of AD and other fields) in the number of alleles
  2. RBS and the sparse convention for reducing the excessive entropy spent on reference-identical entries

WDYT, did I miss some close entanglement of these two ideas?

It is implicit that REF is part of any ``local" context, and it always has index 0, even if the genotype is compound HET.
LAA is required in order to interpret LAD and LPL.
\item RBS(Integer): An integer describing the size of this genotype's reference block.
The size is the difference between the last position (inclusive) of the reference block and POS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section should probably mention how it interacts with the checkpointing scheme.

@@ -503,6 +517,18 @@ \subsubsection{Genotype fields}
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set.
If the genotype in the GT field is unphased, the corresponding PS field is ignored.
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required).
\item Local Alleles (*):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a footnote or a totally separate section? It's kind of weird to have it inline with a list of fields. Maybe there should be a small section separated out for describing local alleles and checkpointing in one place?

@lbergelson
Copy link
Member

@mlin I have no objection to splitting this, or discussing both here. There's nothing strongly connecting them other than history. I believe you need to implement both in order to see linear file growth though, neither one is sufficient on its own.

There hasn't been a lot of engagement with either yet though, so maybe discussing both together won't be too complicated.

@mlin
Copy link
Member

mlin commented Jul 12, 2019

@lbergelson Naturally, I suggested splitting them because I have in mind a substantive critique of one and not the other 😅 Here it is, but if it triggers a discussion then I think the two topics can/should be forked in order to avoid suppressing potential comments on the other one.

tl;dr the RBS attribute is "overfit" to GVCF merging as the approach for producing the joint VCF.

The algorithm generating the joint VCF is generally progressing steadily along the reference genome. Providing the RBS attribute necessitates it have information about the "future" in so doing -- i.e. how many (tens, hundreds) of bases ahead the reference depth remains within some band. GVCF of course serves up exactly that information precomputed, but it is not so readily available for any other approach. It's computable of course, but potentially burdensome.

An alternative might be to omit RBS and say one should write GT:DP=./.:0 or GT:DP=./.:. in the first row where there's a coverage gap or other lack of information. This would make emitting the format more straightforward, by necessitating information only from the past/present rather than the "future."

@cyenyxe
Copy link
Member

cyenyxe commented Jul 24, 2019

I agree with @mlin about the split. I can see local alleles working perfectly fine as they are described right now, but would require some more thinking about the reference blocks.

@lbergelson
Copy link
Member

@mlin I agree that it's tightly fit to gvcf merging, but it's not clear to me that that's a problem. It's essentially a proposal for embedding the entire gvcf in a format column without any repetition. This is a major use case which we do not currently have a good solution for. Do you have pipelines which generate something which is not gvcf-like but which does include reference confidence information?

I'm not sure I really understand the objection about "future" information. You can always break a band arbitrarily at any location if you want to limit the window you have to hold in memory. It does introduce a pain point about reading random access files by forcing you to look back to a checkpoint to safely identify reference blocks. Is that what you mean?

I don't understand how adding writing a missing row without an block solves the problem of how do I efficiently encode reference confidence data for many samples. How do you distinguish the case of "this is the same confidence as somewhere above" vs "this is no data" ? It seems like we would still have the problem of non-local information because of having to do a reverse lookup in order to find the actual data site.

@yfarjoun
Copy link
Contributor Author

closing in favor of #435 and #434

@yfarjoun yfarjoun closed this Jul 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants