-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add reference blocksize and checkpointing to VCF #435
Conversation
@yfarjoun thanks for making the split! In those infrequent cases where a reference block ends and no downstream reference block (or variant) abuts it, leave an explicit indication of "no data" on the next line, rather than having RBS subtly imply it. The current proposal might write,
indicating Alice is confidently 0/0 at positions 200 and 300, with no information at the remaining sites. I'd have us write instead,
Reading this is simpler and less error-prone: the interpretation of an empty cell is to look up to the last filled-in cell, with no position check needed. Usually, another reference block (or variant site) picks up where the last one left off, so these "no data" markers would be rare and wouldn't materially inflate the representation. Actually I think it'd be smaller without the RBS values, but we needn't forbid them, of course. Here are two non-GVCF-merging use cases I think we shouldn't disenfranchise:
RBS is less convenient for the first application because of the need to see well downtream of the current position being processed. The same is true in the second application and in that one, the RBS information may not even be completely there in the input. I believe the suggestion above simplifies both of these cases as well as the reader/decoder, without really hurting the GVCF merging use case. In the case of GVCF merging, the RBS information could be useful for incremental/hierarchical merging, so I'm not saying it should be banished, merely not required in all spec-compliant implementations. As a nitpick, GVCF provides an END position for each reference block, so why switch to block size at this late stage? |
First, regarding the nit: END is an info field, not format, so this isn't a "change". In particular, RBS will be much smaller numbers and will thus compress much better than END unless we define a special delta encoding for it so instead I thought it would be better (space wise) to use delta directly. Regarding your proposal, I don't understand how to interpret the '.' in position 600, can you elaborate? how do I know if it's a missing, or a ref block? |
This PR looks like it is adding an interesting feature. However, nowhere is the term "reference block" clearly defined. RBS is defined, but only in terms of a reference block. I'll speculate and try to put the puzzle this out from the text...
This raises a few questions since I'm not grasping the intent here:
Apologies again for being dense, but what is the difference between "missing" and "truly missing"? |
@yfarjoun position 600 adopts the contents of the first nonempty cell found above it, which is the This is just how spVCF uses the double-quote marker (@kevinjacobs-progenity that link may help provide historical context, albeit slightly different as we're now reconciling parallel efforts). I like |
@mlin: Thanks for the background reading. I see the intent now. I also understand the aesthetic appeal of using '.' as a minimal indicator of sparse content that doesn't require adding any new encoding methods. I still think it is a bad choice because '.' already means Here is what your example above would look like based on my suggestion:
I don't have a strong opinion on whether the explicit RBS tag is sufficiently valuable to include. It should be easy enough to implement a coherent random-access scheme either way. |
The compatibility opportunity of @kevinjacobs-progenity idea is difficult to dispute!
The compressed overhead would be slightly more than 'negligible' I think, because of relatively increased entropy across the row (flipping between ./. and 0/0 GT). However I do think the low-hanging fruit would remain picked. |
I'm not convinced.
EDIT: Having thought about it more, I think that using 0/0 has enough upsides that it should be used. thanks @kevinjacobs-progenity for the suggestion and patience. EDIT of the EDIT: the 0/0 suggestion causes problems in the implementation since it requires a GT FORMAT field, but it isn't clear if we will end up having a GT or a LGT (Local GenoType) and thus this will create a mess in terms of dependencies between records. |
|
|
|
|
@lbergelson pressed me on a good corner case of a haploid, GT-only VCF: then it's unclear how to write the end-of-reference-block marker. Maybe it needs a dedicated FORMAT field? |
To explicate the last idea that arose from the discussion with @lbergelson last month, The 'reference block no longer applies here' marker could be a FORMAT flag, perhaps "EOR" for "end of run":
The interpretation of a And let me reiterate I think it's a very good idea for GVCF pipelines to include the RBS or equivalent information; but with the above, I suggest it shouldn't otherwise be required to write or read it. The lively discussion on #437 should highlight a bit of awkwardness in standardizing a form of combined GVCF, when the basic GVCF way of supplying reference identity isn't really standardized itself. |
note: filtering the EOR will make the reference "block" longer. |
As a general principle, as a specifications we should err on the side of data loss, rather than data corruption. In this particular instance, EOR will result in data corruption when the VCF is filtered thus I strongly prefer the representation that encodes the start and end in the same record. |
To illustrate how RBS overfits GVCF pipelines, I'd like to focus the use case of making a sparse version of an existing project VCF file, such as this one: Given this input, we do not know the GVCF reference blocks and it will be awkward to write RBS without implications that weren't present originally. That's because the project VCF does not, at face value, supply the reference coverage "in between" its sites. What we do have to work with, are repetitive runs of reference-homozygous calls down the column for each sample. I've approached our current juncture from the perspective of succinctly encoding these runs as they appear in the text of an existing multi-sample VCF file. I believe RBS approaches from a different perspective, merging a multitude of GVCF files into a new kind of "combined" GVCF. My proposed encoding is quite awkward when viewed from the latter perspective, and the reverse is also true. (Note: the degree of repetition in the public file with "only" N=2.5K samples is much less than fonud in the N>100K datasets many of us work behind the scenes. Nonetheless we can find many examples of entries exactly equal to the one above, including all FORMAT values. Furthermore we can ask what FORMAT value fluctuations are worth keeping at huge scale.) The distinction is further illustrated by the action item stated above to "Add paragraph describing reference block," needed because that concept does not exist in the VCF spec (though the defined fields accommodate it). In this PR we'd essentially standardize (the main concept of) GVCF and a new kind of combined GVCF — with the curiosity that one de facto uses END and the other uses RBS. I think my suggested perspective carries less baggage, as it's just based on repetition found plainly in the existing multi-sample VCF format, such as the above-linked file. But I'm prepared to agree to disagree on which perspective the group should move ahead with, so long as whomever calls the play has thoughtfully considered both. And if we're to proceed with standardizing GVCF concepts, then I'd think that would involve a wider interest group. @d-cameron It's true that filtering the run-encoded form is trickier. It's not surprising that deleting stuff from a run-encoded format, without trying to fix up the encoding, leads to corruption. This would be a fine reason to strongly prefer a length encoding all other things being equal — but I hope the above clarifies that there are other key differences right now, especially the concept of "reference block." In fact, from my perspective we could look at ways to use a length encoding for "next L lines in this file" as opposed to "next D bp of reference genome." |
@@ -410,8 +421,8 @@ \subsubsection{Genotype fields} | |||
PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\ | |||
PQ & 1 & Integer & Phasing quality \\ | |||
PS & 1 & Integer & Phase set \\ | |||
RBS & 1 & Integer & Reference Block Size\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RBS & 1 & Integer & Reference Block Size\\ | |
RBS & 1 & Integer & Reference block size\\ |
@@ -253,6 +253,17 @@ \subsubsection{Pedigree field format} | |||
##pedigreeDB=URL | |||
\end{verbatim} | |||
|
|||
|
|||
\subsubsection{Reference block checkpoint} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be placed after the final sentence (“See [PedigreeInDetail] for details”) of the preceding section.
Given the ability to interpret missing genotypes as either truly missing or as part of a reference block (see RBS in Genotype tags below), it can be useful to limit the genomic distance required to scan in order to find the "top" of the reference block. | ||
To enable this, one may specify a Reference Block Checkpoint scheme: | ||
\begin{verbatim} | ||
##REFERENCE_BLOCK=<CHECKPOINT=1000> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is invalid as a structured meta-information line, because it does not have an ID=…
subfield.
Are there plans to expand this in future with additional subfields within the <…>
? If not, I'd suggest reworking this as a plain unstructured meta-information line, e.g.,
##REFERENCE_BLOCK_CHECKPOINT=1000
Notes for Future of VCF review meeting:
|
Given the ability to interpret missing genotypes as either truly missing or as part of a reference block (see RBS in Genotype tags below), it can be useful to limit the genomic distance required to scan in order to find the "top" of the reference block. | ||
To enable this, one may specify a Reference Block Checkpoint scheme: | ||
\begin{verbatim} | ||
##REFERENCE_BLOCK=<CHECKPOINT=1000> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
##REFERENCE_BLOCK=<CHECKPOINT=1000> | |
##REFERENCE_BLOCK_CHECKPOINT=1000 |
just reifying this as a one-click GitHub suggestion.
|
||
For example (with CHROM, ID, REF, ALT, QUAL, FILTER, INFO fields/columns removed for brevity \& clarity): | ||
|
||
\#\#REFERENCE\_BLOCK=\textless CHECKPOINT=1000\textgreater\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\#\#REFERENCE\_BLOCK=\textless CHECKPOINT=1000\textgreater\\ | |
\#\#REFERENCE\_BLOCK\_CHECKPOINT=1000\\ |
See above comment by jmarshall
To disambiguate a `.' between being truly missing and part of a reference block, one would therefore need to "look up" and find the previous RBS FORMAT value in that sample. | ||
In addition, any non-missing value (including `.:.' or `./.') would effectively break a reference block, and should be treated as a violation of the specification if RBS is specified, or an implicit end of the block if RBS is unknown. | ||
When reading the file from top to bottom, an implementation can simply remember what the RBS is for each sample, however when using the index to ``seek" to a particular point of the reference, one may need to seek to an unknown location in the file. | ||
To assist in seeking, the \verb!##REFERENCE_BLOCK! header line may define the \verb!CHECKPOINT! multiple at which a reference block will be included for all samples. In the presence of a checkpoint value, an implementation can read back from the last checkpoint and on and be assured that it will find a reference block that overlaps the current position, if it exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To assist in seeking, the \verb!##REFERENCE_BLOCK! header line may define the \verb!CHECKPOINT! multiple at which a reference block will be included for all samples. In the presence of a checkpoint value, an implementation can read back from the last checkpoint and on and be assured that it will find a reference block that overlaps the current position, if it exists. | |
To assist in seeking, the \verb!##REFERENCE_BLOCK_CHECKPOINT! header line defines a multiple at which a reference block will be included for all samples. In the presence of a checkpoint value, an implementation can read back from the last checkpoint and on and be assured that it will find a reference block that overlaps the current position, if it exists. |
This is the reference-block part of #420