diff --git a/VCFv4.5.draft.pdf b/VCFv4.5.draft.pdf new file mode 100644 index 000000000..8818a2ae8 Binary files /dev/null and b/VCFv4.5.draft.pdf differ diff --git a/VCFv4.5.draft.tex b/VCFv4.5.draft.tex index a4fd4c519..6af0254fb 100644 --- a/VCFv4.5.draft.tex +++ b/VCFv4.5.draft.tex @@ -17,7 +17,7 @@ \renewcommand{\thefootnote}{\fnsymbol{footnote}} \begin{document} -\input{VCFv4.5.ver} +\input{VCFv4.5.draft.ver} \title{\huge \color{red} DRAFT SPEC SUBJECT TO CHANGE \\ The Variant Call Format Specification \\ \vspace{0.5em} \large VCFv4.5 and BCFv2.2} \date{\headdate} \maketitle @@ -189,7 +189,14 @@ \subsubsection{Individual format field format} \end{verbatim} Possible Types for FORMAT fields are: Integer, Float, Character, and String (this field is otherwise defined precisely as the INFO field). -The Number field is defined as per the INFO Number field. +The Number field is defined as per the INFO Number field with the following additional possibilities: + +\begin{itemize} + \item LA: Identical to A except the only alternate alleles defined in the $LAA$ field are considered present. + \item LR: Identical to R except the only alternate alleles defined in the $LAA$ field are considered present. + \item LG: Identical to G except the only alternate alleles defined in the $LAA$ field are considered present. + \item P: The field has one value for each allele value defined in $GT$/$LGT$. +\end{itemize} \subsubsection{Alternative allele field format} \label{altfield} ALT meta-information lines are structured lines with require fields of ID and Description that describe the possible symbolic alternate alleles in the ALT column of the VCF records: @@ -413,7 +420,7 @@ \subsubsection{Fixed fields} CIGAR & A & String & Cigar string describing how to align an alternate allele to the reference allele \\ DB & 0 & Flag & dbSNP membership \\ DP & 1 & Integer & Combined depth across samples \\ - END & 1 & Integer & End position on CHROM (used with symbolic alleles; see below) \\ + END & 1 & Integer & Deprecated. Present for backwards compatibility with earlier versions of VCF. \\ H2 & 0 & Flag & HapMap2 membership \\ H3 & 0 & Flag & HapMap3 membership \\ MQ & 1 & Float & RMS mapping quality \\ @@ -427,12 +434,15 @@ \subsubsection{Fixed fields} \begin{itemize} \renewcommand{\labelitemii}{$\circ$} -\item END: End reference position (1-based), indicating the variant spans positions POS--END on reference/contig CHROM. -Normally this is the position of the last base in the REF allele, so it can be derived from POS and the length of REF, and no END INFO field is needed. -However when symbolic alleles are used, e.g.\ in gVCF or structural variants, an explicit END INFO field provides variant span information that is otherwise unknown. -If a record containing a symbolic structural variant allele does not have an END field, it must be computed from the SVLEN field as per Section \ref{sv-info-keys}. +\item END: Deprecated. +Retained for backwards compatibility with earlier versions of VCF and older VCF indexing software which rely on this field being present. + +This is a computed field that, when present, must be set to the maximum end reference position (1-based) of: +the position of the final base of the REF allele, +the end position corresponding to the SVLEN of a symbolic SV allele, +and the end positions calculated from FORMAT LEN for the $<$*$>$ symbolic allele. -This field is used to compute BCF's {\tt rlen} field (see~\ref{BcfSiteEncoding}) and is important when indexing VCF/BCF files to enable random access and querying by position. +The computed value of this field is used to compute BCF's {\tt rlen} field (see~\ref{BcfSiteEncoding}) and is important when indexing VCF/BCF files to enable random access and querying by position. \end{itemize} @@ -441,6 +451,8 @@ \subsubsection{Genotype fields} First a FORMAT field is given specifying the data types and order (colon-separated FORMAT keys matching the regular expression \texttt{\^{}[A-Za-z\_][0-9A-Za-z\_.]*\$}, duplicate keys are not allowed). This is followed by one data block per sample, with the colon-separated data corresponding to the types specified in the format. The first key must always be the genotype (GT) if it is present. +If LGT key is present, it must precede all fields other than GT. +If any local-allele field is present, LAA must also be present and precede all fields other than GT and LGT. There are no required keys. Additional Genotype keys can be defined in the meta-information, however, software support for them is not guaranteed. @@ -448,6 +460,7 @@ \subsubsection{Genotype fields} For example if the FORMAT is GT:GQ:DP:HQ then $0\mid0:.:23:23,34$ indicates that GQ is missing. If a field contains a list of missing values, it can be represented either as a single MISSING value (`.') or as a list of missing values (e.g.\ `.,.,.' if the field was Number=3). Trailing fields can be dropped, with the exception of the GT field, which should always be present if specified in the FORMAT field. +If a field and it's local-allele equivalent (including GT/LGT) are both defined they must encode identical information or one must ignored by containing the MISSING value or omitted. As with the INFO field, there are several common, reserved keywords that are standards across the community. @@ -471,25 +484,37 @@ \subsubsection{Genotype fields} \caption{Reserved genotype keys} \label{table:reserved-genotypes} \endlastfoot - AD & R & Integer & Read depth for each allele \\ - ADF & R & Integer & Read depth for each allele on the forward strand \\ - ADR & R & Integer & Read depth for each allele on the reverse strand \\ - DP & 1 & Integer & Read depth \\ - EC & A & Integer & Expected alternate allele counts \\ - FT & 1 & String & Filter indicating if this genotype was ``called'' \\ - GL & G & Float & Genotype likelihoods \\ - GP & G & Float & Genotype posterior probabilities \\ - GQ & 1 & Integer & Conditional genotype quality \\ - GT & 1 & String & Genotype \\ - HQ & 2 & Integer & Haplotype quality \\ - MQ & 1 & Integer & RMS mapping quality \\ - PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\ - PP & G & Integer & Phred-scaled genotype posterior probabilities rounded to the closest integer \\ - PQ & 1 & Integer & Phasing quality \\ - PS & 1 & Integer & Phase set \\ - PSL & P & String & Phase set list \\ - PSO & P & Integer & Phase set list ordinal \\ - PSQ & P & Integer & Phase set list quality \\ + AD & R & Integer & Read depth for each allele \\ + ADF & R & Integer & Read depth for each allele on the forward strand \\ + ADR & R & Integer & Read depth for each allele on the reverse strand \\ + DP & 1 & Integer & Read depth \\ + EC & A & Integer & Expected alternate allele counts \\ + LEN & 1 & Integer & Length of $<$*$>$ reference block \\ + FT & 1 & String & Filter indicating if this genotype was ``called'' \\ + GL & G & Float & Genotype likelihoods \\ + GP & G & Float & Genotype posterior probabilities \\ + GQ & 1 & Integer & Conditional genotype quality \\ + GT & 1 & String & Genotype \\ + HQ & 2 & Integer & Haplotype quality \\ + LA & . & Integer & Reserved \\ + LAA & . & Integer & 1-based indices into ALT, indicating which alleles are relevant (local) for the current sample \\ + LAD & LR & Integer & Local-allele representation of AD \\ + LADF & LR & Integer & Local-allele representation of ADF \\ + LADR & LR & Integer & Local-allele representation of ADR \\ + LEC & LA & Integer & Local-allele representation of EC \\ + LGL & LG & Integer & Local-allele representation of GL \\ + LGP & LG & Integer & Local-allele representation of GP \\ + LGT & 1 & String & Local-allele representation of GT \\ + LPL & LG & Integer & Local-allele representation of PL \\ + LPP & LG & Integer & Local-allele representation of PP \\ + MQ & 1 & Integer & RMS mapping quality \\ + PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\ + PP & G & Integer & Phred-scaled genotype posterior probabilities rounded to the closest integer \\ + PQ & 1 & Integer & Phasing quality \\ + PS & 1 & Integer & Phase set \\ + PSL & P & String & Phase set list \\ + PSO & P & Integer & Phase set list ordinal \\ + PSQ & P & Integer & Phase set list quality \\ \end{longtable} @@ -499,6 +524,7 @@ \subsubsection{Genotype fields} \item DP (Integer): Read depth at this position for this sample. \item EC (Integer): Comma separated list of expected alternate allele counts for each alternate allele in the same order as listed in the ALT field. Typically used in association analyses. + \item LEN (Integer): length of the $<$*$>$ reference block for this sample. \item FT (String): Sample genotype filter indicating if this genotype was ``called'' (similar in concept to the FILTER field). Again, use PASS to indicate that all filters have been passed, a semicolon-separated list of codes for filters that fail, or `.' to indicate that filters have not been applied. These values should be described in the meta-information in the same way as FILTERs. @@ -578,6 +604,35 @@ \subsubsection{Genotype fields} \end{itemize} \item HQ (Integer): Haplotype qualities, two comma separated phred qualities. + \item LAA is a list of $n$ distinct integers, giving the 1-based indices of the ALT alleles that are observed in the sample. + In callsets with many samples, sites may grow to include numerous alternate alleles at the same POS. + Usually, few of these alleles are actually observed in any one sample, but each genotype must supply fields like PL and AD for all of the alleles---a very inefficient representation as PL's size is quadratic in the allele count. + Similarly, in rare sites, which can be the bulk of the sites, the vast majority of the samples are reference. + To prevent this growth in VCF size, one can choose to specify the genotype, allele depth and the genotype likelihood against a subset of ``Local Alleles''. + LAA is the 1-based index into ALT, defining the alleles that are actually in-play for that sample and the order in which they are interpreted. + LAA is required when interpreting local-allele fields and must be present if any local-allele fields are neither omitted nor MISSING. + Since BCF encodes zero length vectors as MISSING, a LAA containing the MISSING value should be treated as the empty vector (i.e. a REF-only site) if any local-allele fields are neither omitted nor MISSING. + All specifications-defined A, R and G FORMAT fields have a local-allele equivalent that should be interpreted in the same manner as it's matching field except for the ALT alleles considered present and the order in which they are interpreted. + For example, if REF is G, ALT is A,C,T,\verb!<*>! and a genotype only has information about G, C, and \verb!<*>!, one can have LAA=[2,4] and thus LPL will be interpreted as pertaining to the alleles [G, C, \verb!<*>!] and not contain likelihood values for genotypes that involve A or T. + In this case LGT=0/1 means that the sample is G/C. + GQ is still the genotype quality, even when the genotype is given against the local alleles. + In the following example, the records with the same POS encode the same information (some columns removed for clarity): + \begin{tabular}[l]{llllll} + POS &REF& ALT&FORMAT&sample\\ + 1&G&A,C,T,\textless*\textgreater& LGT:LAA:LAD:LPL& 1/1:2,4:20,30,10:90,80,0,100,110,120\\ + 1&G&A,C,T,\textless*\textgreater& GT:AD:PL& 2/2:20,.,30,.,10:90,.,.,80,.,0,.,.,.,.,100,.,110,.,120\\ + 2&A&C,G,T,\textless*\textgreater& GT:LAA:LAD:LPL& 0/3:3:15,25:40,0,80\\ + 2&A&C,G,T,\textless*\textgreater& GT:AD:PL&0/3:15,.,.,25,.:40,.,.,.,.,.,0,.,.,80,.,.,.,.,.\\ + 3&C&G,T,\textless*\textgreater& LGT:LAA:LAD:LPL& 0/0:3:30,1:0,30,80\\ + 3&C&G,T,\textless*\textgreater& GT:AD:PL& 0/0:30,.,.,1:0,.,.,.,.,.,30,.,.,80\\ + 4&G&A,T,\textless*\textgreater& LGT:LAA:LAD:LPL& 0/0::30:0\\ + 4&G&A,T,\textless*\textgreater& GT:AD:PL& 0/0:30,.,.,.:0,.,.,.,.,.,.,.,.,.\\ + \end{tabular} + Due to BCF encoding empty vectors as missing, implementation-defined Number=LA local-allele fields should not be used if distinguishing between zero-length data and missing data is required at REF-only sites. + \item LGT: is the genotype, encoded as allele indexes separated by either of $/$ or $\mid$, as with GT, however, the indexes are into the alleles referenced by LAA. + So that in the case that LAA is 2,3, LGT=0/2 is equivalent to GT=0/3 and LGT=1/2 is equivalent to GT=2/3 (see example above). + \item LPL: is a list of $n \choose \mathrm{Ploidy}$ integers giving phred-scaled genotype likelihoods (rounded to the closest integer; as per PL) for all possible genotypes given the set of alleles defined in the LAA local alleles. + The precise ordering is defined in the GL paragraph. \item MQ (Integer): RMS mapping quality, similar to the version in the INFO field. \item PL (Integer): The phred-scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field. \item PP (Integer): The phred-scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field. @@ -589,7 +644,7 @@ \subsubsection{Genotype fields} All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. If the genotype in the GT field is unphased, the corresponding PS field is ignored. The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). - \item PSL (List of Strings): The list of phase sets, one for each allele specified in the {\tt GT}. + \item PSL (List of Strings): The list of phase sets, one for each allele value specified in the {\tt GT} or {\tt LGT}. Unphased alleles (without a $\mid$ separator before them) must have the value '$.$' in their corresponding position in the list. Unlike {\tt PS} (which is defined per CHROM), records with different CHROM but the same phase-set name are considered part of the same phase set. If an implementation cannot guarantee uniqueness of phase-set names across the VCF (for example, phasing a streaming VCF or each CHROM is processed independently in parallel), new phase-set names should be of the format CHROM*POS*ALLELE-NUMBER of the ``first'' allele which is included in this set, with ALLELE-NUMBER being the index of the allele in the {\tt GT} field, since multiple distinct phase-sets could start at the same position. \footnote{The `*' character is used as a separator since `:' is not reserved in the CHROM column.} @@ -643,7 +698,7 @@ \section{Understanding the VCF format and the haplotype representation} In essence, the VCF record specifies a-REF-t and the alternative haplotypes are a-ALT-t for each alternative allele. \subsection{VCF tag naming conventions} -Several tag names follow conventions indicating how their values are represented numerically: +Several tag names follow conventions which should be used for implementation-defined tag as well: \begin{itemize} \item The `L' suffix means \emph{likelihood} as log-likelihood in the sampling distribution, $\log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$. Likelihoods are represented as $\log_{10}$ scale, thus they are negative numbers (e.g.\ GL, CNL). @@ -654,6 +709,8 @@ \subsection{VCF tag naming conventions} \item The `Q' suffix means \emph{quality} as log-complementary-phred-scale posterior probability, $-10 \log_{10} \Pr(\mathrm{Data}|\mathrm{Model})$, where the model is the most likely genotype that appears in the GT field. Examples are GQ, CNQ. The fixed site-level QUAL field follows the same convention (represented as a phred-scaled number). + + \item The `L' prefix indicates the local-allele equivalent of a Number=A, R or G field. \end{itemize} @@ -679,25 +736,11 @@ \section{INFO keys used for structural variants} \footnotesize \begin{verbatim} ##INFO= -##INFO= +##INFO= \end{verbatim} \normalsize -$END$ position of the longest variant described in this record. -The END of each allele is defined as: - -Non-symbolic alleles: $\mbox{POS} + \mbox{length of REF allele} - 1$. - -$<$INS$>$ symbolic structural variant alleles: $\mbox{POS} + \mbox{length of REF allele} - 1$. - -$<$DEL$>$, $<$DUP$>$, $<$INV$>$, and $<$CNV$>$ symbolic structural variant alleles:, $\mbox{POS} + \mbox{SVLEN}$. - -$<$*$>$ symbolic allele: the last reference call position. - -END must be present for all records containing the $<$*$>$ symbolic allele and, for backwards compatibility, should be present for records containing any symbolic structural variant alleles. - -To prevent loss of information, any VCF record containing the $<$*$>$ symbolic allele must have END set to the last reference call position of the $<$*$>$ symbolic allele. -When a record contains both the $<$*$>$ symbolic allele, the END position of the longest allele should be used as the record end position for indexing purposes. +$END$ has been deprecated in favour of INFO SVLEN and FORMAT LEN. \footnotesize \begin{verbatim} @@ -722,7 +765,7 @@ \section{INFO keys used for structural variants} SVLEN is defined for $CNV$ symbolic alleles as the length of the segment over which the copy number variant is defined. The missing value $.$ should be used for all other ALT alleles, including ALT alleles using breakend notation. -For backwards compatibility, a missing SVLEN should be inferred from the $END$ field of VCF records whose $ALT$ field contains a single symbolic allele. +For backwards compatibility, a missing SVLEN should be inferred from the $END$ field. For backwards compatibility, the absolute value of SVLEN should be taken and a negative SVLEN should be treated as positive values. @@ -746,7 +789,7 @@ \section{INFO keys used for structural variants} \footnotesize \begin{verbatim} -##INFO= +##INFO= \end{verbatim} \normalsize @@ -1199,7 +1242,6 @@ \subsection{Encoding Structural Variants} ##ALT= ##ALT= ##INFO= -##INFO= ##INFO= ##INFO= ##INFO= @@ -1212,13 +1254,13 @@ \subsection{Encoding Structural Variants} ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample chrA 2 . TGC T . . EVENT=DEL_seq GT 0/1 -chrA 2 . T . . SVLEN=2;SVCLAIM=DJ;EVENT=DEL_symbolic;END=4 GT 0/1 +chrA 2 . T . . SVLEN=2;SVCLAIM=DJ;EVENT=DEL_symbolic GT 0/1 chrA 2 delbp1 T T[chrA:5[ . . MATEID=delbp2;EVENT=DEL_split_bp_cn GT 0/1 chrA 2 delbp2 A ]chrA:2]A . . MATEID=delbp1;EVENT=DEL_split_bp_cn GT 0/1 -chrA 2 . T . . SVLEN=2;SVCLAIM=D;EVENT=DEL_split_bp_cn;END=4 GT 0/1 +chrA 2 . T . . SVLEN=2;SVCLAIM=D;EVENT=DEL_split_bp_cn GT 0/1 chrA 5 . G GAAA . . EVENT=homology_seq GT 1/1 chrA 5 . G . . SVLEN=3;CIPOS=0,5;EVENT=homology_dup GT 0/1 -chrA 14 . T . . IMPRECISE;SVLEN=100;CILEN=-50,50;CIPOS=-10,10;END=14 GT 0/1 +chrA 14 . T . . IMPRECISE;SVLEN=100;CILEN=-50,50;CIPOS=-10,10 GT 0/1 chrA 14 . G .CCCCCCG . . EVENT=single_breakend GT 0/1 \end{verbatim} \end{landscape} @@ -1461,7 +1503,7 @@ \subsubsection{Inversions} \small \begin{tabular}{ l l l l l l l l } \#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\ -2 & 321681 & INV0 & T & $<$INV$>$ & 6 & PASS & END=421681 \\ +2 & 321681 & INV0 & T & $<$INV$>$ & 6 & PASS & SVLEN=100000 \\ \end{tabular} \normalsize \vspace{0.3cm} @@ -1544,7 +1586,7 @@ \subsubsection{Single breakends} \begin{tabular}{ l l l l l l l l } \#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\ 3 & 12665 & bnd\_X & A & .A & 6 & PASS & CIPOS=-50,50 \\ -3 & 12665 & . & A & $<$DUP$>$ & 14 & PASS & END=13686;CIPOS=-50,50;CIEND=-50,50 \\ +3 & 12665 & . & A & $<$DUP$>$ & 14 & PASS & SVCLAIM=D;SVLEN=1021;CIPOS=-50,50;CIEND=-50,50 \\ 3 & 13686 & bnd\_Y & T & T. & 6 & PASS & CIPOS=-50,50 \\ \end{tabular} \normalsize @@ -1557,7 +1599,7 @@ \subsubsection{Single breakends} \begin{tabular}{ l l l l l l l l } \#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\ 3 & 12665 & bnd\_X & A & .TGCA & 6 & PASS & CIPOS=-50,50 \\ -3 & 12665 & . & A & $<$DUP$>$ & 14 & PASS & END=13686;CIPOS=-50,50;CIEND=-50,50 \\ +3 & 12665 & . & A & $<$DUP$>$ & 14 & PASS & SVCLAIM=D;SVLEN=1021;CIPOS=-50,50;CIEND=-50,50 \\ 3 & 13686 & bnd\_Y & T & TCC. & 6 & PASS & CIPOS=-50,50 \\ \end{tabular} \normalsize @@ -1679,29 +1721,36 @@ \subsubsection{Clonal derivation relationships} \pagebreak \subsection{Representing unspecified alleles and REF-only blocks (gVCF)} \label{unspecified-allele} -In order to report sequencing data evidence for both variant and non-variant positions in the genome, the VCF specification allows to represent blocks of reference-only calls in a single record using the END INFO tag, an idea originally introduced by the gVCF file format\footnote{\url{https://help.basespace.illumina.com/articles/descriptive/gvcf-files/}}. - +In order to report sequencing data evidence for both variant and non-variant positions in the genome, the VCF specification allows to represent blocks of reference-only calls in a single record using the $<$*$>$ allele and the FORMAT LEN field. The convention adopted here is to represent reference evidence as likelihoods against an unknown alternate allele represented as $<$*$>$. Think of this as the likelihood for reference as compared to any other possible alternate allele (both SNP, indel, or otherwise). -The $<$*$>$ representation is preferred over the symbolic allele $<$NON\_REF$>$. -Example records are given below: +Positions implicitly called by a preceding $<$*$>$ for a sample must have $GT$/$LGT$ set to the missing value (`.') and have no FORMAT fields other than $LAA$ present. +If $LAA$ is present and a reference block start is being defined for a given sample, the $<$*$>$ allele must be included as an $LAA$ allele for that sample even though the $GT$/$LGT$ is $0/0$. + +Reference blocks were originally introduced by the gVCF file format\footnote{\url{https://help.basespace.illumina.com/articles/descriptive/gvcf-files/}}. +Unfortunately, gVCF has issues scaling to many samples as the use of INFO END to encode the reference block length requires the reference block length to be the same for all samples. + +To retain backwards compatibility with with gVCF, +the symbolic allele $<$NON\_REF$>$ should be treated as an alias of $<$*$>$ +and a missing FORMAT LEN field should be inferred from the INFO END tag if present. + +An example with both FORMAT LEN and a redundant INFO END is given below: \scriptsize \begin{flushleft} \begin{tabular}{ l l l l l l l l l l } \#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO & FORMAT & Sample \\ -1 & 4370 & . & G & $<$*$>$ & . & . & END=4383 & GT:DP:GQ:MIN\_DP:PL & 0/0:25:60:23:0,60,900 \\ -1 & 4384 & . & C & $<$*$>$ & . & . & END=4388 & GT:DP:GQ:MIN\_DP:PL & 0/0:25:45:25:0,42,630 \\ -1 & 4389 & . & T & TC,$<$*$>$ & 213.73 & . & . & GT:DP:GQ:PL & 0/1:23:99:51,0,36,93,92,86 \\ -1 & 4390 & . & C & $<$*$>$ & . & . & END=4390 & GT:DP:GQ:MIN\_DP:PL & 0/0:26:0:26:0,0,315 \\ -1 & 4391 & . & C & $<$*$>$ & . & . & END=4395 & GT:DP:GQ:MIN\_DP:PL & 0/0:27:63:27:0,63,945 \\ -1 & 4396 & . & G & C,$<$*$>$ & 0 & . & . & GT:DP:GQ:P & 0/0:24:52:0,52,95,66,95,97 \\ -1 & 4397 & . & T & $<$*$>$ & . & . & END=4416 & GT:DP:GQ:MIN\_DP:PL & 0/0:22:14:22:0,15,593 \\ +1 & 4370 & . & G & $<$*$>$ & . & . & END=4383 & GT:DP:GQ:MIN\_DP:PL:LEN & 0/0:25:60:23:0,60,900;14 \\ +1 & 4384 & . & C & $<$*$>$ & . & . & END=4388 & GT:DP:GQ:MIN\_DP:PL:LEN & 0/0:25:45:25:0,42,630;4 \\ +1 & 4389 & . & T & TC,$<$*$>$ & 213.73 & . & . & GT:DP:GQ:PL:LEN & 0/1:23:99:51,0,36,93,92,86 \\ +1 & 4390 & . & C & $<$*$>$ & . & . & END=4390 & GT:DP:GQ:MIN\_DP:PL:LEN & 0/0:26:0:26:0,0,315;1 \\ +1 & 4391 & . & C & $<$*$>$ & . & . & END=4395 & GT:DP:GQ:MIN\_DP:PL:LEN & 0/0:27:63:27:0,63,945;4 \\ +1 & 4396 & . & G & C,$<$*$>$ & 0 & . & . & GT:DP:GQ:MIN\_DP:PL:LEN & 0/0:24:52:0,52,95,66,95,97 \\ +1 & 4397 & . & T & $<$*$>$ & . & . & END=4416 & GT:DP:GQ:MIN\_DP:PL:LEN & 0/0:22:14:22:0,15,593;19 \\ \end{tabular} \end{flushleft} \normalsize - \pagebreak \subsection{Representing copy number variation} \label{cnv} @@ -1717,7 +1766,7 @@ \subsection{Representing copy number variation} \footnotesize \begin{verbatim} - chr1 100 . T , . . END=130;SVLEN=30,30;CN=1,2 GT:CN 1/2:3 + chr1 100 . T , . . SVLEN=30,30;CN=1,2 GT:CN 1/2:3 \end{verbatim} \normalsize @@ -1730,7 +1779,7 @@ \subsection{Representing copy number variation} \footnotesize \begin{verbatim} - chr1 100 . T . . END=130;SVLEN=30 GT:CN .:3 + chr1 100 . T . . SVLEN=30 GT:CN .:3 \end{verbatim} \normalsize @@ -1776,7 +1825,6 @@ \subsection{Representing tandem repeats} \begin{landscape} \begin{verbatim} ##fileformat=VCFv4.5 -##INFO= ##INFO= ##INFO= ##INFO= @@ -1792,7 +1840,7 @@ \subsection{Representing tandem repeats} ##FORMAT= ##ALT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample -chr1 100 cnv_notation T , . . END=130;SVLEN=30,30;CN=3,0.9666;RUS=CAG,CAG,CA,CAG;RN=1,3;RB=90,15,2,12 GT:PS:CN 1|2:100:3.9666 +chr1 100 cnv_notation T , . . SVLEN=30,30;CN=3,0.9666;RUS=CAG,CAG,CA,CAG;RN=1,3;RB=90,15,2,12 GT:PS:CN 1|2:100:3.9666 chr1 117 precise_alt2 AG A . . GT:PS 0|1:100 chr1 130 precise_alt1 G GCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG . . GT:PS 1|0:100 \end{verbatim} @@ -1821,7 +1869,7 @@ \subsection{Representing tandem repeats} \item RUL should be omitted when RUS is present (as it is redundant when RS is present). \item RUS or RUL must be specified for each $<$CNV:TR$>$. \item Support for multiple levels of repeat nesting (such as STRs within VNTRs) is limited to the RUL repeat unit length field which allows the overall length of each top-level repeat unit to be encoded. - \item The POS and END of $<$CNV:TR$>$ records should match the STR/VNTR reference catalog sizes for catalog-based callers. + \item The POS and SVLEN of $<$CNV:TR$>$ records should match the STR/VNTR reference catalog sizes for catalog-based callers. \item Variant normalisation has limited utility in regions of low complexity as almost identical haplotypes can have very different normalised representations. \end{itemize} @@ -1844,7 +1892,7 @@ \subsection{Representing tandem repeats} \footnotesize \begin{verbatim} -chr1 100 . T . . END=130;SVLEN=30;CN=6.5;RUS=CAG;RUC=65;CIRUC=-15,. GT ./. +chr1 100 . T . . SVLEN=30;CN=6.5;RUS=CAG;RUC=65;CIRUC=-15,. GT ./. \end{verbatim} \normalsize @@ -1863,7 +1911,7 @@ \subsection{Representing tandem repeats} \footnotesize \begin{verbatim} -chr1 1000000 . T . . END=20000;SVLEN=20000;CN=1.25;RUL=10000;RUC=5;RUB=10000,10500,11000,11500,12000 GT ./. +chr1 1000000 . T . . SVLEN=20000;CN=1.25;RUL=10000;RUC=5;RUB=10000,10500,11000,11500,12000 GT ./. \end{verbatim} \normalsize @@ -2012,7 +2060,7 @@ \subsubsection{Site encoding} POS & int32\_t & 0-based leftmost coordinate \\ \hline rlen & int32\_t & Length of the record as projected onto the reference sequence. Must be the maximum of the length of the REF allele and the lengths - inferred from the SVLEN/END of any symbolic alleles \\ \hline + inferred from the SVLEN/LEN of any symbolic alleles \\ \hline QUAL & float & Variant quality; 0x7F800001 for a missing value \\ \hline n\_info & uint16\_t & The number of INFO fields in this record \\ \hline n\_allele & uint16\_t & The number of REF+ALT alleles in this record \\ \hline @@ -2552,6 +2600,10 @@ \section{List of changes} \subsection{Changes between VCFv4.5 and VCFv4.4} \begin{itemize} + \item Added Number=P support for fields with cardinality matching sample ploidy/local copy number. + \item Added local allele support (Number=LA, LG, LR; FORMAT LAA, LAD, LADF, LADR, LEC, LGL, LGP, LGT, LPL, LPP) to reduce the size of multi-sample VCFs and enable lossless merging. + \item Deprecated INFO END. It is now a computed field written only for backwards compatibility with older versions of VCF. + \item Added FORMAT LEN to support sample-specific $<$*$>$ alleles. \end{itemize} \subsection{Changes between VCFv4.4 and VCFv4.3} diff --git a/test/vcf/4.5/passed/zero_length_LAA.vcf b/test/vcf/4.5/passed/zero_length_LAA.vcf new file mode 100644 index 000000000..fd951420f --- /dev/null +++ b/test/vcf/4.5/passed/zero_length_LAA.vcf @@ -0,0 +1,10 @@ +##fileformat=VCFv4.5 +##FORMAT= +##FORMAT= +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT homref het +1 100 zero_length_EC C T . . . LAA:LEC : 1:1 +1 200 missing_EC C T . . . LAA:LEC :. 1:1 +1 400 omitted_EC C T . . . LAA:LEC . 1:1 +1 300 missing_LAA C T . . . LAA:LEC .:. 1:1 +1 500 omitted_or_zero_LAA C T . . . LAA:LEC 1:1 +1 600 inferred_LAA C T . . . LAA:LEC .: 1:1 \ No newline at end of file