-
Notifications
You must be signed in to change notification settings - Fork 41
Reading GLnexus pVCFs
One of GLnexus' main functions is to generate a population-wide "project" VCF (pVCF) based on the input gVCFs for each individual sample. To that end, a key challenge is to derive a unified representation of the sites and alleles discovered across the population, navigating the numerous "degrees of freedom" VCF admits in expressing multi-allelic sites (further reading). Here we'll explain how GLnexus approaches this, important information for interpreting generated pVCFs.
The pVCF can be viewed as a matrix with variant sites down the rows, each having two or more alleles, and samples across the columns, with the matrix filled in with all the individual genotypes. Ideally, this matrix should provide a partitioned representation of all the discovered genetic variation, where all the sites' genome positions are mutually non-overlapping, and all the alleles at each site are mutually exclusive of the others. Such partitioning facilitates downstream analysis, e.g. population genetic statistics, because they can simply tally the independent events reported in the matrix. If, on the other hand, sites are overlapping and/or their alleles entangled, then the event space is trickier to reason about downstream (or such reasoning is skipped, with undefined results).
The nicely partitioned representation is easy for SNPs, but complications arise for indels and MNPs, which can overlap other alleles found in the population at slightly different positions. As a result, it isn't strictly possible to present a partitioned pVCF without omitting high-quality variants. (This is actually true even for the gVCF representation of the two chromosomes in one individual, but it's more acute in the pVCF aggregating many alleles discovered across a population.)
GLnexus first unifies as many alleles as possible into non-overlapping multiallelic sites, calling each sample's diploid genotype as usual. The unification can complicate the representation of variants; for example, if the alternate alleles at a site include both a SNP and an MNP, the SNP is "padded" with reference bases so that it covers the same reference genome range as the MNP. This can be inconvenient, but it follows from VCF's edit representation and the desire to keep sites partitioned wherever possible. And, it's much easier to break apart the multiallelic sites afterwards than to reassemble the diploid genotypes once broken apart.
For the remaining alleles that don't unify nicely, usually because they overlap multiple other sites, we also generate a second tier of "monoallelic" sites to represent each one. These monoallelic sites are marked in the FILTER field for easy recognition, and they have a slightly different format and interpretation. Their GT (genotype) fields indicate the called copy number of the ALT allele only (0, 1, or 2). They don't assert presence/absence of the REF allele or any other, which avoids duplicating information conveyed in the overlapping multiallelic sites.
In this hypothetical example, Alice, Bob, and Carol carry alleles that unify into two non-overlapping, multiallelic sites, some with reference-padded representations. Dave carries a deletion allele that overlaps both of those sites, and thus doesn't unify with them cleanly. Instead, there's a separate monoallelic record indicating its copy number in each sample. Also note Dave's genotype entries for the two simpler sites, which we must be careful not to call 0/0. (The "half-calls" like ./0 are not too often seen with other variant calling methods, but they're useful to convey the most precise assertions possible.)
Empirically, monoallelic sites tend to comprise <1% of the pVCF and most of them arise in homopolymer and tandem repeat sequences, which can be suspect anyway. But they'll typically also include some high-quality variants, especially deletions.
Let's briefly mention possible alternatives to the approach we selected. Instead of generating monoallelic sites only from alleles that don't cleanly unify, we could represent all deletions and MNPs in monoallelic sites. This would treat them more uniformly, and avoid complicating the simpler alleles, but it would reduce the completeness of the conveniently partitioned sites and genotypes. Second, we could take this to the extreme and simply represent all alleles, including the reference, in the monoallelic style (further reading). This more revolutionary approach wouldn't be palatable to our downstream users for the time being.
All entries in the GLnexus pVCF include a Reason for No Call (RNC) field, which is filled in if either GT entry is not called (.). The possible values of RNC include:
Code | Reason | Explanation |
---|---|---|
. | N/A | corresponding allele is called |
M | Missing data | input (gVCF) had no data at this genome position |
P | Partial data | input only partially covered this genome position |
D | Depth | read depth too low to call |
– | deletion | sample carries a deletion allele that couldn't be unified into this site; there may be more information in overlapping monoallelic site(s). |
L | Lost allele | ^ but other than deletion allele |
U | Unphased variants | sample carries multiple non-overlapping variants at this position, whose phase is not known, so the diploid genotype cannot be called safely. There may be more information in overlapping monoallelic site(s). |
O | Overlapping variants | sample carries multiple overlapping variants at this position, so the diploid genotype cannot be called safely. There may be more information in overlapping monoallelic site(s). |
1 | monoallelic | this is a monoallelic site; no assertion about presence/absence of any allele here |