-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCF 4.5 Future of VCF scaling improvements #758
Conversation
Changed PDFs as of 4ebc5aa: VCFv4.5.draft (diff). |
Options:
@danking thoughts? (1) is the most backwards compatible, (2) has an elegant simplicity to it, (3) makes local alleles explicit and allows custom fields to be defined. I personally favour (2) as non-local to local is a lossless transition but the consensus for LAD et al seem to be that having explicitly different fields was better because causeing errors/crashes in non-local-aware VCF consumers was preferable to the possibility of (probably silent) data misinterpretation. (3) has the advantage of having a generalising principle for all R A G fields that we could encode in the specs: if you have a R/A/G field, the local allele equivalent has an |
Co-authored-by: John Marshall <[email protected]>
Changed PDFs as of 099fe19: VCFv4.5.draft (diff). |
Following on from discussions on the call yesterday, here is the Hail description of SVCR: https://hail.is/docs/0.2/vds/index.html#the-scalable-variant-call-representation-svcr Specifically they have this example
They're using LA instead of LAA as Locale Allele vs Local Alternate Allele. The key difference being it includes REF in the list, and LGT 0/1 is 0th element of LA and 1st element of LA. I believe this is in keeping with the changes you made here vs #434, which I believe most are in agreement with (given the inherent impossibilities of storing zero-length arrays in BCF). However, as discussed, we have tools already implementing LAA using #434 semantics, and so we need a clear separation so people that get LAA in a file know what the data means. Adopting Hail's "LA" name seems like the ideal approach. (I know you're already in agreement Daniel, but this is here for others who weren't on the call and as a formal place to note this proposed edit.) Pinging @pd3 for comment as you're one of the existing implementers and can far better understand the nuances than I. |
Yes, given that it's now 0-based and we need to explicitly include REF, |
VCFv4.5.draft.tex
Outdated
\subsubsection{Multi-sample REF-only blocks} | ||
When handling VCFs with multiple samples, the length of the $<$*$>$ reference blocks can differ. | ||
To account for this, a sample-specific END can be specified via the FORMAT END field. | ||
If any FORMAT END value exists, the INFO END must be present and equal the largest FORMAT END value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this ensures that a single sample VCF using ref blocks is a valid GVCF?
I'm having a bit of trouble imagining any other reason to want to know the max of the ENDs (though the min does seem perhaps useful: that's the next place some ref block will start).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing everything a second time, may I propose this instead?
If the VCF contains exactly one sample column, the INFO END field must be present and equal to the FORMAT END value at every locus. Otherwise (zero or two or more samples), the INFO END must be missing.
The INFO END, as a single value, will always be misinterpreted by tools written for GVCF. They will either treat some reference genotypes as missing (if INFO END is the min over all samples) or treat some missing or variant genotypes as reference genotypes (if INFO END is the max over all samples). Given this, I prefer that INFO END is required to be missing unless there is exactly one sample.
VCFv4.5.draft.tex
Outdated
\begin{flushleft} | ||
\begin{tabular}{ l l l l l l l l } | ||
POS & REF & ALT & INFO & FORMAT & SampleA & SampleB \\ | ||
4370 & G & $<$*$>$ & END=4416 & LGT:LAA:END & 0/0:0,1:4388 & 0/0:0,1:4416 \\ | ||
4389 & T & TC & . & LGT:LAA:END & 0/1:0,1:. & . \\ | ||
4390 & C & $<$*$>$ & END=4416 & LGT:LAA:END & 0/0:0,1:4416 & . \\ | ||
\end{tabular} | ||
\end{flushleft} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should include all the rows from above to make absolutely clear that this just adds
another sample.
\begin{flushleft}
\begin{tabular}{ l l l l l l l l }
POS & REF & ALT & INFO & FORMAT & SampleA & SampleB \\
4370 & G & $<$*$>$ & END=4416 & LGT:LAA:END & 0/0:0,1:4383 & 0/0:0,1:4416 \\
4384 & C & $<$*$>$ & END=4388 & LGT:LAA:END & 0/0:0,1:4388 & . \\
4389 & T & TC & . & LGT:LAA:END & 0/1:0,1:. & . \\
4390 & C & $<$*$>$ & END=4390 & LGT:LAA:END & 0/0:0,1:4390 & . \\
4391 & C & $<$*$>$ & END=4395 & LGT:LAA:END & 0/0:0,1:4395 & . \\
4396 & C & $<$*$>$ & . & LGT:LAA:END & 0/0:0,1:. & . \\
4397 & C & $<$*$>$ & END=4416 & LGT:LAA:END & 0/0:0,1:4416 & . \\
\end{tabular}
\end{flushleft}
Three more potentially more controversial changes:
-
Use a different END for SampleB to make completely clear that SampleB's END has nothing to do
SampleA's END. -
Do not use LGT or LAA. Two reasons: (a) make the multi-sample ref-block changes most prominent and (b) clarify that ref-blocking and local alleles are orthogonal compression techniques [1].
-
Preserve the star allele on line 4389 to make clear that removing it is not necessary.
\begin{flushleft}
\begin{tabular}{ l l l l l l l l }
POS & REF & ALT & INFO & FORMAT & SampleA & SampleB \\
4370 & G & $<$*$>$ & END=4500 & GT:END & 0/0:4383 & 0/0:4500 \\
4384 & C & $<$*$>$ & END=4388 & GT:END & 0/0:4388 & . \\
4389 & T & TC,$<$*$>$ & . & GT:END & 0/1:. & . \\
4390 & C & $<$*$>$ & END=4390 & GT:END & 0/0:4390 & . \\
4391 & C & $<$*$>$ & END=4395 & GT:END & 0/0:4395 & . \\
4396 & C & $<$*$>$ & . & GT:END & 0/0:. & . \\
4397 & C & $<$*$>$ & END=4416 & GT:END & 0/0:4416 & . \\
\end{tabular}
\end{flushleft}
[1] It's true that O(samples) scaling is only achieved with both local alleles and ref blocking;
however, they're independent ideas and, I think, easier to first understand in isolation.
Numbers
(1) is fine but feels like a cop out for a standard. I oppose (2) for the reason you mentioned (I prefer errors/crashes to bad data). (3) feels somehow too clever, but I prefer it to (1) and (2). I prefer the verbose "LOCAL-A", etc. to (3), but I would grumpily implement (3) and then never think about it again. LAA vs LA
Agreed. LEN vs ENDI only have anecdotes from Tim Poterba, not hard data, but: we saw ~10% reduction in size of the Hail VDS when we switched from END to LEN. I assume all the benefit comes from feeding Zstd a lower entropy bytestream. I'd leave LEN vs END as a possible future enhancement. A VCF with local alleles and ref blocks scales linearly in samples with or without LEN. Etc.No objections to your other choices. Checkpointing or an END-aware index is key for analysis but not for interchange. I view VCF as primarily an interchange format. cc'ing @chrisvittal and @patrick-schultz as I'm no longer Broad/Hail affiliated. FWIW, I think the SVCR paper is a little more detailed than the Hail docs currently. https://www.biorxiv.org/content/10.1101/2024.01.09.574205v1.full.pdf |
The reason for defining INFO END as I did was to avoid breaking VCF indexing. VCF indexing already uses INFO END to define the end of SV and gVCF records so reusing this means that checkpointing is unnecessary as the information needed for determining maximal record overlap is already contained within the index.
If we're erring on the side of crashing instead of data corruption then deprecating INFO END for gVCF would be the way to go. Unfortunately, taking that approach will silently break VCF indexing (since it relies on INFO END). There's no good option here as keeping INFO END results in silent gVCF misinterpretation and dropping it results in silent record loss on indexed lookups.
How much value is there in being able to define caller-specific local allele fields? (3) generalises local alleles not only for scaling, but it allows lossless merging. What do implementation do with the
That seems significant enough to prefer LEN. We've already made SVLEN preferable to END for v4.4 SVs so doing it for symbolic star alleles for 4.5 is consistent with that. On a related note, what is everyone's thoughts about recommending tools writing VCFv4.5 use local allele notation for all VCFs? Useful? Unnecessary? |
Mmm. Good points. Defining INFO END as the max seems fine then.
I don't have experience with variant callers so I can't comment on the first question. Maybe I'm misreading but I think any of the three options for encoding local allele indexed fields facilitates lossless merging. I agree that (global allele indexed) GL, PL, and AD are hard to sensibly define when merging two VCFs. In particular, I'm unclear on when DP does not equal sum(AD) and the implications thereof when merging two VCF rows with global allele indexed fields.
Would we change the GVCF section as well? So we'd have INFO LEN, FORMAT LEN, and (no longer preferred? deprecated?) INFO END?
Soft against. It's more complex and not necessary for plenty of useful cases. For example, seems silly to local allele index fields in a VCF that has been fully split into biallelic variants. |
I don't have enough experience either, particularly with multi-sample calling and gVCF, but just as some background I do understand why AD and DP can be very different as I recently fixed over counting of AD in bcftools. Imagine this scenario:
The depth at that point is 8 sequences. It's an AT STR, with 6 of them spanning it so the copy number has been confirmed. Of those we have 4 with 4 copies and 2 with 3 copies, so AD 4,2 and DP 8. The missing two have been aligned against the reference and exhibit "reference bias" as they naively confirm REF but there is no evidence of that confirmation being true and they shouldn't be added to the REF AD score. (There is an option however to distribute the uncontributed alignments to the observered REF/ALT in proportion to their observed frequencies, in order to boost AD, but it's not enabled by default as it can bring other issues and potentially over confidence in results.) How you marry this together when merging I've no idea. Sorry! :) Edit: you could make an argument for DP=6, but that's misleading as we do have 8 alignments spanning this variant and if we were attempting to do depth analysis it's wrong to down-play DP this way as that may lead to other incorrect copy-number assessments. |
A VCF with custom R/A/G FORMAT fields cannot be losslessly converted to local allele notation unless there is the ability to translate arbitrary R/A/G fields to an equivalent r/a/g version. (3) achieves this by generalisating the local allele convention to all fields - even those not reserved by the specs. |
FORMAT LEN and retain INFO END as a purely computed field to prevent breaking VCF indexing. We already have INFO SVLEN causing recalculation of INFO END so this simplifies the spec as INFO END becomes an entirely computed field that a validator can verify than INFO END=POS + max(FORMAT LEN & INFO SVLEN). Similar to the SV changes in v4.4, just having INFO END would be deprecated but a 4.5 compliant implementation should infer the value of a single missing FORMAT LEN from INFO END. |
Hi, sorry I am a bit late to this thread, or maybe I was on it too early, depending on how you look at it The ideas proposed in the original pull request (#434) in July 2019 were implemented in bcftools in May 2020 (samtools/bcftools@e645749). Currently there are big files out there that follow the original proposal, see for example UK Biobank which generated VCFs with 490,541 samples using DRAGEN. (Note that DRAGEN implementation has some bugs, please contact me directly if you are interested to learn more.) I have several comments related to the current proposal:
Missing genotype can be encoded simply as
What is the benefit of introducing LA other than using more space and adding uninformative "0"?? The localized alleles are meant only to narrow the context of the alleles so that Number=G,R,A tags can be expressed in a more parsimonious way. They were not intended to make statements about sequencing or biological observations made for a sample. |
Hi @pd3 ! I try to explain the decisions below but see also my final paragraph. I agree with all your comments on 1 and, in fact, this was our perspective when we first built tools using local alleles and ref blocks. The issue is not of complexity or speed; it’s of practical user experience. IIRC (we should just ask them) gnomAD team had a mistake arise from neglecting to update a GT field when an allele was filtered. They reasonably asked: why is GT special? There’s less cognitive overhead for them as method developers if all the fields use local indices. AFAIU, this change just allows users who prefer to work with LGT to do so. I suppose the practical concern is that people will start generating LGT containing VCFs so tools will need to support that? IIRC, on your second issue, folks wanted to distinguish “only the reference allele is local” from “LAA is missing”. I suggested LA as a way to appease this concern. — All that said, the only thing I seek (and I think Hail team, with which I’m no longer affiliated, also seeks) is an interchange format that scales linearly in samples. Sparse columns (ref blocks) and sparse allele-indexed arrays (local alleles) achieve this. Whatever details make it into the standard we’ll just deal with on import/export. |
Minor nit:
Hail always used LA, so it’s not really this PR that created the problem. |
Agreed. We were chatting about this a few days ago. Basically there are already two competing implementations: LA and LAA. Both have existed for a considerable length of time. If I had to weigh up the usage I'd say it's Hail and it's users (Broad and AllOfUs being the biggest maybe) vs Illumina DRAGEN and Bcftools with BioBank being the biggest user there. There are many more users of each ecosystem too, including probably very large ones I'm not aware of. So making a decision purely on usage doesn't really help, and it needs to be evaluated on complexity and fitting to the data modelling. My initial thought was that we needed LA because we can't have an empty (non I do have confusion over LGT vs GT though. If we're using LAA instead of LA, then LGT doesn't make sense because a GT |
VCFv4.5.draft.tex
Outdated
LAA & . & Integer & Strictly increasing indices into REF and ALT, indicating which alternate alleles are relevant (local) for the current sample \\ | ||
LAD & . & Integer & Read depth for each of the local alternate alleles listed in LAA \\ | ||
LGT & . & String & Genotype against the local alleles \\ | ||
LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the local alternative alleles listed in LAA \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these all number .
?
We had explicit dimensions before: A, R, G, 1 (GT), etc. We now have local equivalents for these, but it feels like there should be some better language to explain that rather than just doing a shrug.
Particularly LGT which surely is the same dimension as GT: Type=String, Number=1. (Although personally I think it was a mistake to shoehorn in multiple data types to the one GT field with an entirely new set of language syntax for one field only)
If we don't want to define them as terms in the grammar, then at least explicitly specify the dimensions within the text. LPL gives no clue at all as to how many values there are. I understand what's going on, but only because I already understand what's happening...
VCFv4.5.draft.tex
Outdated
LAA is the strictly increasing index into REF and ALT, pointing out the alleles that are actually in-play for that sample. | ||
0 indicates the REF allele and should always be included with the subsequent values being 1-based indexes into ALT. | ||
LAD is the depth of the local alleles, | ||
LPL is subset of the PL array that pertains to the alleles that are referred to by LAA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here. Eg in a classic world we may have:
GT:PL:AD 0/1:187,0,28:2,7
In local world, I'm struggling and the above explanations seem to be stating something wrong. With LAA 0
(one ALT), LPL would refer to both REF and ALT#0, not ALT#0 only. Hence "pertains to the alleles that are referred to by LAA" should be "pertains to the alleles that are referred to by LAA and REF".
I assume similarly true for LAD which claims to be "depth of the local alleles". Isn't it "depth of REF and the local alleles"?
Basically, the only way I can see this working is if all the local versions are REF + whatever is listed in LA and we drop LAA. So I would argue that perhaps it is the right thing to switch to LA instead as then all of the above statements start becoming true once more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text of this proposal is incorrect. It says LAA but describes LA instead.
LAA is a list of increasing, 1-based indices into ALT alleles. The original pull request states this correctly #434.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Thanks. That explains my confusion over numbering then. I hadn't realised it had changed LAA this way. It's a bad idea. EIther we should keep LAA as it was or switch to LA, but using LA semantics in LAA will just break a whole bunch of existing software and make it extremely tricky to parse existing files.
The definition of LGT in the context of LAA is consistent. LAA lists locally relevant alternate alleles, their indexing is 1-based, the same as in GT. Heterozygous genotypes would be expressed as LAA=1 GT=0/1 LGT=0/1. Note that despite defending its logic here, I am strongly against codifying LGT for the reasons I explained above (#758 (comment)). |
@pd3, I’d like to make a final nudge on LGT, but after this I’ll be quiet about it. I agree with your points one and two: LGT is not relevant to VCF size or analysis speed. I also agree with point three: a VCF with only LGT is meaningless to tools only aware of GTs. My counter-point is about users—not engineering constraints. Mixing GT with LPL and LAD introduces cognitive overhead and, in our practical experience, leads to mistakes. I argue in favor of allowing LGT. If two scientific groups want to exchange data in this format we should let them and let them deal with the ensuing incompatibilities with legacy tools. Any group publishing ”final” data for an extremely wide audience (e.g. AoU, UKB) will likely choose to maximize compatibility with existing tools. Groups with other constraints can choose what’s best for them. |
I don't find cognitive overhead a good argument. Few users look manually inside VCFs with hundreds of thousands of samples. Backward compatibility is a real concern, what matters is that programs continue to work with such files. With PLs there is no other way, sites with many indel alleles have no choice but to use LPL, and for most analyses it does not matter if programs skip such sites when they don't find PL, they are relatively rare. But there is no reason to confuse programs with GT/LGT. Having said that, VCF format is open and nothing stops you from using arbitrary tags. But the use of LGT and LA should not be encouraged by including them in the specification, for reasons given above. |
I like and support this proposal. @a9ill Unrelated to this, but for the lack of a better channel, there are two bugs in DRAGEN's implementation of local alleles (found in UKB VCFs)
|
I’m glad we’ve arrived at what seems a key difference in perspective. The gnomAD team & AoU team don’t look at textual VCFs but they indeed directly interact with the data model. They aren’t the majority. |
Co-authored-by: Dan King <[email protected]>
Changed PDFs as of 9b472b4: VCFv4.5.draft (diff). |
Changed PDFs as of d769847: VCFv4.5.draft (diff). |
Changed PDFs as of abf2531: VCFv4.5.draft (diff). |
Changed PDFs as of 26a9326: VCFv4.5.draft (diff). |
Changed PDFs as of 3e73e10: VCFv4.5.draft (diff). |
@d-cameron a bit delayed but: thank you for your tenacious commitment to this change. This is critical for the future of the community. I appreciate everyone's efforts at finding consensus! |
Please review and provide feedback.
Logic behind going with FORMAT END:
RBS
PR and existing<*>
functionality<*>
via FORMAT END is a minimal change to the specs.END
used instead ofRBS
for naming convention consistency.LEN
if someone produces data showing a non-trival difference in file size##REFERENCE_BLOCK_CHECKPOINT
adds a new form of header parsing. We want to keep it simple##REFERENCE_BLOCK_CHECKPOINT
style spliting everyPOS % N
bases. The spec gets to stay agnostic on these details.