Changes to VCFv4.3 to codify local-allele parlance for sparse VCF #420

yfarjoun · 2019-06-14T17:06:55Z

This PR defines teresm required to enable writing less data into large callsets.

The LAA, LPL, and LAD Format tags enable the users to write PL and AD values only for the alleles that are implicated for the sample in question, while the RBS and the REFERENCE_BLOCK meta line enable the reinterpretation of the missing genotype as part of an upstream reference block.

By using these tags and conventions users can drastically reduce the size of VCFs generated from callsets with large number of samples (>20000)

hts-specs-bot · 2019-06-14T17:08:54Z

Changed PDFs as of 4b60bc8: VCFv4.3 (diff).

mlin · 2019-06-21T01:35:21Z

Hi @yfarjoun, all,
I'd like to suggest this ought to be split into two PRs that can be discussed ~independently:

Local Alleles: reformulation of the FORMAT fields for multi-allelic VCF sites to prevent quadratic blowup of PL (and less acutely, linear expansion of AD and other fields) in the number of alleles
RBS and the sparse convention for reducing the excessive entropy spent on reference-identical entries

WDYT, did I miss some close entanglement of these two ideas?

lbergelson · 2019-06-17T16:15:08Z

VCFv4.3.tex

+  It is implicit that REF is part of any ``local" context, and it always has index 0, even if the genotype is compound HET. 
+  LAA is required in order to interpret LAD and LPL.
+  \item RBS(Integer): An integer describing the size of this genotype's reference block. 
+  The size is the difference between the last position (inclusive) of the reference block and POS. 


This section should probably mention how it interacts with the checkpointing scheme.

lbergelson · 2019-06-18T13:40:36Z

VCFv4.3.tex

@@ -503,6 +517,18 @@ \subsubsection{Genotype fields}
  All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set.
  If the genotype in the GT field is unphased, the corresponding PS field is ignored.
  The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required).
+  \item Local Alleles (*): 


Should this be a footnote or a totally separate section? It's kind of weird to have it inline with a list of fields. Maybe there should be a small section separated out for describing local alleles and checkpointing in one place?

lbergelson · 2019-07-08T18:03:23Z

@mlin I have no objection to splitting this, or discussing both here. There's nothing strongly connecting them other than history. I believe you need to implement both in order to see linear file growth though, neither one is sufficient on its own.

There hasn't been a lot of engagement with either yet though, so maybe discussing both together won't be too complicated.

mlin · 2019-07-12T05:34:23Z

@lbergelson Naturally, I suggested splitting them because I have in mind a substantive critique of one and not the other 😅 Here it is, but if it triggers a discussion then I think the two topics can/should be forked in order to avoid suppressing potential comments on the other one.

tl;dr the RBS attribute is "overfit" to GVCF merging as the approach for producing the joint VCF.

The algorithm generating the joint VCF is generally progressing steadily along the reference genome. Providing the RBS attribute necessitates it have information about the "future" in so doing -- i.e. how many (tens, hundreds) of bases ahead the reference depth remains within some band. GVCF of course serves up exactly that information precomputed, but it is not so readily available for any other approach. It's computable of course, but potentially burdensome.

An alternative might be to omit RBS and say one should write GT:DP=./.:0 or GT:DP=./.:. in the first row where there's a coverage gap or other lack of information. This would make emitting the format more straightforward, by necessitating information only from the past/present rather than the "future."

cyenyxe · 2019-07-24T13:03:03Z

I agree with @mlin about the split. I can see local alleles working perfectly fine as they are described right now, but would require some more thinking about the reference blocks.

lbergelson · 2019-07-24T20:05:21Z

@mlin I agree that it's tightly fit to gvcf merging, but it's not clear to me that that's a problem. It's essentially a proposal for embedding the entire gvcf in a format column without any repetition. This is a major use case which we do not currently have a good solution for. Do you have pipelines which generate something which is not gvcf-like but which does include reference confidence information?

I'm not sure I really understand the objection about "future" information. You can always break a band arbitrarily at any location if you want to limit the window you have to hold in memory. It does introduce a pain point about reading random access files by forcing you to look back to a checkpoint to safely identify reference blocks. Is that what you mean?

I don't understand how adding writing a missing row without an block solves the problem of how do I efficiently encode reference confidence data for many samples. How do you distinguish the case of "this is the same confidence as somewhere above" vs "this is no data" ? It seems like we would still have the problem of non-local information because of having to do a reverse lookup in order to find the actual data site.

yfarjoun · 2019-07-24T20:47:21Z

closing in favor of #435 and #434

ready for PR

4b60bc8

yfarjoun requested review from cyenyxe and lbergelson June 14, 2019 17:07

pd3 approved these changes Jun 17, 2019

View reviewed changes

jkbonfield added the vcf label Jun 18, 2019

lbergelson reviewed Jun 21, 2019

View reviewed changes

mlin mentioned this pull request Jul 18, 2019

Missing PL fields for 0/0 calls dnanexus-rnd/GLnexus#173

Closed

yfarjoun mentioned this pull request Jul 24, 2019

Define Local Alleles in VCF to allow for sparser format #434

Closed

yfarjoun mentioned this pull request Jul 24, 2019

add reference blocksize and checkpointing to VCF #435

Closed

yfarjoun closed this Jul 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to VCFv4.3 to codify local-allele parlance for sparse VCF #420

Changes to VCFv4.3 to codify local-allele parlance for sparse VCF #420

yfarjoun commented Jun 14, 2019

hts-specs-bot commented Jun 14, 2019

mlin commented Jun 21, 2019

lbergelson Jun 17, 2019

lbergelson Jun 18, 2019

lbergelson commented Jul 8, 2019

mlin commented Jul 12, 2019

cyenyxe commented Jul 24, 2019

lbergelson commented Jul 24, 2019

yfarjoun commented Jul 24, 2019

Changes to VCFv4.3 to codify local-allele parlance for sparse VCF #420

Changes to VCFv4.3 to codify local-allele parlance for sparse VCF #420

Conversation

yfarjoun commented Jun 14, 2019

hts-specs-bot commented Jun 14, 2019

mlin commented Jun 21, 2019

lbergelson Jun 17, 2019

Choose a reason for hiding this comment

lbergelson Jun 18, 2019

Choose a reason for hiding this comment

lbergelson commented Jul 8, 2019

mlin commented Jul 12, 2019

cyenyxe commented Jul 24, 2019

lbergelson commented Jul 24, 2019

yfarjoun commented Jul 24, 2019