You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're considering a significant rework of GLnexus' output pVCF representation. This issue is here for early show-and-tell and feedback. We have not yet decided to execute on it.
To review briefly, given a set of overlapping alleles discovered in the population, GLnexus currently unifies as many of them as possible into non-overlapping multiallelic sites of mutually exclusive alleles; and for the few remaining alleles (typically <1%) that don't unify cleanly, emits a secondary tier of "monoallelic" records to indicate their copy number, without duplicating reference and other calls in the overlapping sites.
Contrived example:
Alice Bob Carol
chr17 1000 A T,C ./. 1/2 0/0
chr17 1000 ACG A 1/1 ./. ./. *MONOALLELIC
chr17 1001 CGA C,GGA ./. 0/1 1/2
Pro:
overlapping records are minimized
pVCF records are ~independent with no duplication between them
multi-ALT records model all possible "het-ALT" genotypes (heterozygotes carrying two different non-reference alleles) at a site/locus & their full joint PL
Con:
allele padding (to unify with lengthier, overlapping alleles) is common & inconvenient
PL field grows quadratically in # of alleles at a site
large expenditure of representation effort/space to accommodate all possible het-ALT genotypes, few if any of which are usually observed. (In 1KGP chr17 data, there are only 196K distinct het-ALT genotypes observed with 3.6M ALT alleles.)
situations that give rise to "monoallelic" records seem arbitrary to users
downstream programs may badly overestimate AF=AC/AN in monoallelic records
Proposed alternative
A pending VCF spec proposal endorses the splitting of het-ALT genotypes across two overlapping VCF records, thereby relieving pressure to unify all overlapping alleles into a single, multi-ALT pVCF site. GLnexus historically avoided such split/overlapping genotypes, but the field has generally grown more comfortable with this kind of representation since we started in 2015. The key is the use of the "star allele" to symbolize some allele not specified in the current record, but that should be found in another overlapping record.
Several variant callers now use the star allele or something like it, albeit -- of course -- without consensus on important details of its interpretation (especially how to describe reference bases in the vicinity of het-ALT genotypes). We'll therefore still have to make some "artistic" choices about when to use multi-ALT sites versus a series of overlapping records involving star alleles. The most uniform & predictable approach, following bgt, seems to move toward one pVCF record per ALT allele:
Generate one pVCF record for every discovered alternate allele passing quality thresholds.
The record's alleles are 0=REF, 1=ALT, 2=* and are never padded.
het-ALT genotypes are always split across two overlapping records with GT=1/2.
The concept of "site" or "locus" is demoted. The main drawback is making it harder to analyze het-ALT genotypes; even if the reader knows to reconstruct the genotype by examining overlapping records, they're only shown marginals of the joint PL distribution.
Applied to our example:
Alice Bob Carol
chr17 1000 A T,* 2/2 1/2 0/0
chr17 1000 A C,* 2/2 1/2 0/0
chr17 1000 ACG A,* 1/1 2/2 0/0
chr17 1001 C G,* 2/2 0/2 1/2
chr17 1001 CGA C,* 2/2 0/1 1/2
Optional: idea to furthermore keep het-ALT genotypes & PLs
Furthermore generate a multi-ALT pVCF record for each distinct het-ALT genotype actually observed in the cohort above a quality threshold: 0=REF, 1=ALT1, 2=ALT2, 3=*.
Alice Bob Carol
chr17 1000 A T,* 2/2 1/2 0/0
chr17 1000 A C,* 2/2 1/2 0/0
chr17 1000 A T,C,* 3/3 1/2 0/0 *HET-ALT
chr17 1000 ACG A,* 1/1 2/2 0/0
chr17 1001 C G,* 2/2 0/2 1/2
chr17 1001 CGA C,* 2/2 0/1 1/2
chr17 1001 CGA C,GGA,* 3/3 0/1 1/2 *HET-ALT
We can thus include the joint PL for het-ALTs, while avoiding PL quadratic blowup since no one record has more than three ALTs. Duplication of information between the regular and het-ALT record is a concern, as it invites naive double-counting of copy numbers. That's further exacerbated by potential padding of the alleles in het-ALT records, making the redundancy non-trivial to recognize. Possibly we could segregate the het-ALT record in a separate file?
At least the cause of the second class of sites is more intuitive and predictable compared to the monoallelic sites...
The text was updated successfully, but these errors were encountered:
I'd add to the "cons" list that MONOALLELIC records make it impossible to distinguish "does not have this deletion" from "no data".
This leads to many downstream tools, e.g. plink --vcf-half-call m interpreting a MONOALLELIC variant as "missing for all (or almost all) samples" and leaving no usable data--when in fact the correct interpretation is actually "confidently absent in most samples, confidently present in a few samples"
We're considering a significant rework of GLnexus' output pVCF representation. This issue is here for early show-and-tell and feedback. We have not yet decided to execute on it.
Pros and cons of current representation
To review briefly, given a set of overlapping alleles discovered in the population, GLnexus currently unifies as many of them as possible into non-overlapping multiallelic sites of mutually exclusive alleles; and for the few remaining alleles (typically <1%) that don't unify cleanly, emits a secondary tier of "monoallelic" records to indicate their copy number, without duplicating reference and other calls in the overlapping sites.
Contrived example:
Pro:
Con:
Proposed alternative
A pending VCF spec proposal endorses the splitting of het-ALT genotypes across two overlapping VCF records, thereby relieving pressure to unify all overlapping alleles into a single, multi-ALT pVCF site. GLnexus historically avoided such split/overlapping genotypes, but the field has generally grown more comfortable with this kind of representation since we started in 2015. The key is the use of the "star allele" to symbolize some allele not specified in the current record, but that should be found in another overlapping record.
Several variant callers now use the star allele or something like it, albeit -- of course -- without consensus on important details of its interpretation (especially how to describe reference bases in the vicinity of het-ALT genotypes). We'll therefore still have to make some "artistic" choices about when to use multi-ALT sites versus a series of overlapping records involving star alleles. The most uniform & predictable approach, following bgt, seems to move toward one pVCF record per ALT allele:
The concept of "site" or "locus" is demoted. The main drawback is making it harder to analyze het-ALT genotypes; even if the reader knows to reconstruct the genotype by examining overlapping records, they're only shown marginals of the joint PL distribution.
Applied to our example:
Optional: idea to furthermore keep het-ALT genotypes & PLs
We can thus include the joint PL for het-ALTs, while avoiding PL quadratic blowup since no one record has more than three ALTs. Duplication of information between the regular and het-ALT record is a concern, as it invites naive double-counting of copy numbers. That's further exacerbated by potential padding of the alleles in het-ALT records, making the redundancy non-trivial to recognize. Possibly we could segregate the het-ALT record in a separate file?
At least the cause of the second class of sites is more intuitive and predictable compared to the monoallelic sites...
The text was updated successfully, but these errors were encountered: