Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF 4.5 Future of VCF scaling improvements #758

Merged
merged 14 commits into from
Apr 20, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added VCFv4.5.draft.pdf
Binary file not shown.
65 changes: 63 additions & 2 deletions VCFv4.5.draft.tex
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
\renewcommand{\thefootnote}{\fnsymbol{footnote}}

\begin{document}
\input{VCFv4.5.ver}
\input{VCFv4.5.draft.ver}
\title{\huge \color{red} DRAFT SPEC SUBJECT TO CHANGE \\ The Variant Call Format Specification \\ \vspace{0.5em} \large VCFv4.5 and BCFv2.2}
\date{\headdate}
\maketitle
Expand Down Expand Up @@ -441,6 +441,7 @@ \subsubsection{Genotype fields}
First a FORMAT field is given specifying the data types and order (colon-separated FORMAT keys matching the regular expression \texttt{\^{}[A-Za-z\_][0-9A-Za-z\_.]*\$}, duplicate keys are not allowed).
This is followed by one data block per sample, with the colon-separated data corresponding to the types specified in the format.
The first key must always be the genotype (GT) if it is present.
If LGT key is present, it must be after GT (if also present) and before all others.
There are no required keys.
Additional Genotype keys can be defined in the meta-information, however, software support for them is not guaranteed.

Expand Down Expand Up @@ -476,12 +477,17 @@ \subsubsection{Genotype fields}
ADR & R & Integer & Read depth for each allele on the reverse strand \\
DP & 1 & Integer & Read depth \\
EC & A & Integer & Expected alternate allele counts \\
END & 1 & Integer & End position on CHROM (used with multi-sample $<$*$>$ alleles) \\
FT & 1 & String & Filter indicating if this genotype was ``called'' \\
GL & G & Float & Genotype likelihoods \\
GP & G & Float & Genotype posterior probabilities \\
GQ & 1 & Integer & Conditional genotype quality \\
GT & 1 & String & Genotype \\
HQ & 2 & Integer & Haplotype quality \\
LAA & . & Integer & Strictly increasing indices into REF and ALT, indicating which alternate alleles are relevant (local) for the current sample \\
LAD & . & Integer & Read depth for each of the local alternate alleles listed in LAA \\
LGT & . & String & Genotype against the local alleles \\
LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the local alternative alleles listed in LAA \\
Copy link
Contributor

@jkbonfield jkbonfield Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these all number .?

We had explicit dimensions before: A, R, G, 1 (GT), etc. We now have local equivalents for these, but it feels like there should be some better language to explain that rather than just doing a shrug.

Particularly LGT which surely is the same dimension as GT: Type=String, Number=1. (Although personally I think it was a mistake to shoehorn in multiple data types to the one GT field with an entirely new set of language syntax for one field only)

If we don't want to define them as terms in the grammar, then at least explicitly specify the dimensions within the text. LPL gives no clue at all as to how many values there are. I understand what's going on, but only because I already understand what's happening...

MQ & 1 & Integer & RMS mapping quality \\
PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\
PP & G & Integer & Phred-scaled genotype posterior probabilities rounded to the closest integer \\
Expand All @@ -499,6 +505,7 @@ \subsubsection{Genotype fields}
\item DP (Integer): Read depth at this position for this sample.
\item EC (Integer): Comma separated list of expected alternate allele counts for each alternate allele in the same order as listed in the ALT field.
Typically used in association analyses.
\item END (Integer): end position of the $<$*$>$ reference block for this sample.
\item FT (String): Sample genotype filter indicating if this genotype was ``called'' (similar in concept to the FILTER field).
Again, use PASS to indicate that all filters have been passed, a semicolon-separated list of codes for filters that fail, or `.' to indicate that filters have not been applied.
These values should be described in the meta-information in the same way as FILTERs.
Expand Down Expand Up @@ -578,6 +585,38 @@ \subsubsection{Genotype fields}
\end{itemize}

\item HQ (Integer): Haplotype qualities, two comma separated phred qualities.
\item LAA is a sorted list of $n$ distinct integers, where $0 \le n \le \left|\mathrm{ALT}\right|$, giving the indices of the alleles that are observed in the sample.
In callsets with many samples, sites may grow to include numerous alternate alleles at the same POS.
Usually, few of these alleles are actually observed in any one sample, but each genotype must supply fields like PL and AD for all of the alleles---a very inefficient representation as PL's size is quadratic in the allele count.
Similarly, in rare sites, which can be the bulk of the sites, the vast majority of the samples are reference.
To prevent this growth in VCF size, one can choose to specify the genotype, allele depth and the genotype likelihood against a subset of ``Local Alleles''.
LAA is the strictly increasing index into REF and ALT, pointing out the alleles that are actually in-play for that sample.
0 indicates the REF allele and should always be included with the subsequence values being 1-based indexes into ALT.
d-cameron marked this conversation as resolved.
Show resolved Hide resolved
LAD is the depth of the local alleles,
LPL is subset of the PL array that pertains to the alleles that are referred to by LAA,
Copy link
Contributor

@jkbonfield jkbonfield Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here. Eg in a classic world we may have:

GT:PL:AD        0/1:187,0,28:2,7

In local world, I'm struggling and the above explanations seem to be stating something wrong. With LAA 0 (one ALT), LPL would refer to both REF and ALT#0, not ALT#0 only. Hence "pertains to the alleles that are referred to by LAA" should be "pertains to the alleles that are referred to by LAA and REF".

I assume similarly true for LAD which claims to be "depth of the local alleles". Isn't it "depth of REF and the local alleles"?

Basically, the only way I can see this working is if all the local versions are REF + whatever is listed in LA and we drop LAA. So I would argue that perhaps it is the right thing to switch to LA instead as then all of the above statements start becoming true once more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text of this proposal is incorrect. It says LAA but describes LA instead.

LAA is a list of increasing, 1-based indices into ALT alleles. The original pull request states this correctly #434.

Copy link
Contributor

@jkbonfield jkbonfield Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Thanks. That explains my confusion over numbering then. I hadn't realised it had changed LAA this way. It's a bad idea. EIther we should keep LAA as it was or switch to LA, but using LA semantics in LAA will just break a whole bunch of existing software and make it extremely tricky to parse existing files.

LGT is the genotype but referencing the local alleles rather than the global ones.
For example, if REF is G, ALT is A,C,T,\verb!<*>! and a genotype only has information about G, C, and \verb!<*>!, one can have LAA=[0,2,4] and thus LPL will be interpreted as pertaining to the alleles [G, C, \verb!<*>!] and not contain likelihood values for genotypes that involve A or T.
In this case LGT=0/1 means that the sample is G/C.
GQ is still the genotype quality, even when the genotype is given against the local alleles.
Note that reordering might be required and care need to be taken to reorder LAD and LPL appropriately.
LAA is required in order to interpret LAD, LPL, and LGT.
In the following example, the records with the same POS encode the same information (some columns removed for clarity):
\begin{tabular}[l]{llllll}
POS &REF& ALT&FORMAT&sample\\
1&G&A,C,T,\textless*\textgreater& LAA:LGT:LAD:LPL& 2,4:1/1:20,30,10:90,80,0,100,110,120\\
1&G&A,C,T,\textless*\textgreater& GT:AD:PL& 2/2:20,.,30,.,10:90,.,.,80,.,0,.,.,.,.,100,.,110,.,120\\
d-cameron marked this conversation as resolved.
Show resolved Hide resolved
2&A&C,G,T,\textless*\textgreater& LAA:LGT:LAD:LPL& 0,3:0/1:15,25:40,0,80\\
2&A&C,G,T,\textless*\textgreater& GT:AD:PL&0/3:15,.,.,25,.:40,.,.,.,.,.,0,.,.,80,.,.,.,.\\
d-cameron marked this conversation as resolved.
Show resolved Hide resolved
3&C&G,T,\textless*\textgreater& LAA:LGT:LAD:LPL& 0,4:0/0:30,1:0,30,80\\
3&C&G,T,\textless*\textgreater& GT:AD:PL& 0/0:30,.,.1:0,.,.,.,.,.,.,.,.,.,30,.,.,.,80\\
d-cameron marked this conversation as resolved.
Show resolved Hide resolved
4&G&A,T,\textless*\textgreater& LAA:LGT:LAD:LPL& 0:0/0:30:0\\
4&G&A,T,\textless*\textgreater& GT:AD:PL& 0/0:30,.,..:0,.,.,.,.,.,.,.,.,.,.,.,.,.,.\\
d-cameron marked this conversation as resolved.
Show resolved Hide resolved
\end{tabular}
\item LAD: is a list of $n$ integers giving read depths (as per AD) for each of the local alleles as listed in LAA.
\item LGT: is the genotype, encoded as allele indexes separated by either of $/$ or $\mid$, as with GT, however, the indexes are into the alleles referenced by LAA.
So that in the case that LAA is 0,2,3, LGT=0/2 is equivalent to GT=0/3 and LGT=1/2 is equivalent to GT=2/3 (see example above).
\item LPL: is a list of $n \choose \mathrm{Ploidy}$ integers giving phred-scaled genotype likelihoods (rounded to the closest integer; as per PL) for all possible genotypes given the set of alleles defined in the LAA local alleles.
The precise ordering is defined in the GL paragraph.
\item MQ (Integer): RMS mapping quality, similar to the version in the INFO field.
\item PL (Integer): The phred-scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field.
\item PP (Integer): The phred-scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field.
Expand All @@ -589,7 +628,7 @@ \subsubsection{Genotype fields}
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set.
If the genotype in the GT field is unphased, the corresponding PS field is ignored.
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required).
\item PSL (List of Strings): The list of phase sets, one for each allele specified in the {\tt GT}.
\item PSL (List of Strings): The list of phase sets, one for each allele specified in the {\tt GT} or {\tt LGT}.
Unphased alleles (without a $\mid$ separator before them) must have the value '$.$' in their corresponding position in the list.
Unlike {\tt PS} (which is defined per CHROM), records with different CHROM but the same phase-set name are considered part of the same phase set.
If an implementation cannot guarantee uniqueness of phase-set names across the VCF (for example, phasing a streaming VCF or each CHROM is processed independently in parallel), new phase-set names should be of the format CHROM*POS*ALLELE-NUMBER of the ``first'' allele which is included in this set, with ALLELE-NUMBER being the index of the allele in the {\tt GT} field, since multiple distinct phase-sets could start at the same position. \footnote{The `*' character is used as a separator since `:' is not reserved in the CHROM column.}
Expand Down Expand Up @@ -1702,6 +1741,26 @@ \subsection{Representing unspecified alleles and REF-only blocks (gVCF)}
\normalsize


\subsubsection{Multi-sample REF-only blocks}
When handling VCFs with multiple samples, the length of the $<$*$>$ reference blocks can differ.
To account for this, a sample-specific END can be specified via the FORMAT END field.
If any FORMAT END value exists, the INFO END must be present and equal the largest FORMAT END value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this ensures that a single sample VCF using ref blocks is a valid GVCF?

I'm having a bit of trouble imagining any other reason to want to know the max of the ENDs (though the min does seem perhaps useful: that's the next place some ref block will start).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing everything a second time, may I propose this instead?

If the VCF contains exactly one sample column, the INFO END field must be present and equal to the FORMAT END value at every locus. Otherwise (zero or two or more samples), the INFO END must be missing.

The INFO END, as a single value, will always be misinterpreted by tools written for GVCF. They will either treat some reference genotypes as missing (if INFO END is the min over all samples) or treat some missing or variant genotypes as reference genotypes (if INFO END is the max over all samples). Given this, I prefer that INFO END is required to be missing unless there is exactly one sample.

Positions implicitly called by a preceding $<$*$>$ for a sample must have $GT$/$LGT$ set to the missing value (`.') and have no other FORMAT fields present.
If $LAA$ is present and a reference block is defined for a given sample, the $<$*$>$ allele must be included as an $LAA$ allele for that sample even though the $LGT$ is $0/0$.

For example, the genotype-only version of the above example with a second sample with no variants:
\scriptsize
\begin{flushleft}
\begin{tabular}{ l l l l l l l l }
POS & REF & ALT & INFO & FORMAT & SampleA & SampleB \\
4370 & G & $<$*$>$ & END=4416 & LGT:LAA:END & 0/0:0,1:4388 & 0/0:0,1:4416 \\
4389 & T & TC & . & LGT:LAA:END & 0/1:0,1:. & . \\
4390 & C & $<$*$>$ & END=4416 & LGT:LAA:END & 0/0:0,1:4416 & . \\
\end{tabular}
\end{flushleft}
Copy link
Contributor

@danking danking Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should include all the rows from above to make absolutely clear that this just adds
another sample.

\begin{flushleft}
\begin{tabular}{ l l l l l l l l }
POS & REF & ALT & INFO & FORMAT & SampleA & SampleB \\
4370 & G & $<$*$>$ & END=4416 & LGT:LAA:END & 0/0:0,1:4383 & 0/0:0,1:4416 \\
4384 & C & $<$*$>$ & END=4388 & LGT:LAA:END & 0/0:0,1:4388 & . \\
4389 & T & TC & . & LGT:LAA:END & 0/1:0,1:. & . \\
4390 & C & $<$*$>$ & END=4390 & LGT:LAA:END & 0/0:0,1:4390 & . \\
4391 & C & $<$*$>$ & END=4395 & LGT:LAA:END & 0/0:0,1:4395 & . \\
4396 & C & $<$*$>$ & . & LGT:LAA:END & 0/0:0,1:. & . \\
4397 & C & $<$*$>$ & END=4416 & LGT:LAA:END & 0/0:0,1:4416 & . \\
\end{tabular}
\end{flushleft}

Three more potentially more controversial changes:

  1. Use a different END for SampleB to make completely clear that SampleB's END has nothing to do
    SampleA's END.

  2. Do not use LGT or LAA. Two reasons: (a) make the multi-sample ref-block changes most prominent and (b) clarify that ref-blocking and local alleles are orthogonal compression techniques [1].

  3. Preserve the star allele on line 4389 to make clear that removing it is not necessary.

\begin{flushleft}
\begin{tabular}{ l l l l l l l l }
POS & REF & ALT & INFO & FORMAT & SampleA & SampleB \\
4370 & G & $<$*$>$ & END=4500 & GT:END & 0/0:4383 & 0/0:4500 \\
4384 & C & $<$*$>$ & END=4388 & GT:END & 0/0:4388 & . \\
4389 & T & TC,$<$*$>$ & . & GT:END & 0/1:. & . \\
4390 & C & $<$*$>$ & END=4390 & GT:END & 0/0:4390 & . \\
4391 & C & $<$*$>$ & END=4395 & GT:END & 0/0:4395 & . \\
4396 & C & $<$*$>$ & . & GT:END & 0/0:. & . \\
4397 & C & $<$*$>$ & END=4416 & GT:END & 0/0:4416 & . \\
\end{tabular}
\end{flushleft}
Screenshot 2024-03-04 at 3 45 21 PM

[1] It's true that O(samples) scaling is only achieved with both local alleles and ref blocking;
however, they're independent ideas and, I think, easier to first understand in isolation.

\normalsize


\pagebreak
\subsection{Representing copy number variation}
\label{cnv}
Expand Down Expand Up @@ -2552,6 +2611,8 @@ \section{List of changes}
\subsection{Changes between VCFv4.5 and VCFv4.4}

\begin{itemize}
\item Added local allele support (FORMAT LAA, LGT, LAD, LPL) to reduce the size of multi-sample VCFs and enable lossless merging.
\item Added FORMAT END to support sample-specific $<$*$>$ alleles.
\end{itemize}

\subsection{Changes between VCFv4.4 and VCFv4.3}
Expand Down
Loading