-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phased GT encoding #1872
Comments
Your code looks incorrect. See bcf_format_gt() to see how to read GT values (up to VCF4.3). Compared with your code, note that only one value is accessed for each GT. It is also not valid C to cast the pointer into the GT data to a wider type, as it may not be correctly aligned. That's why |
@daviesrob , please see this bcf_all_phased. I feel I used the same approach as what it does. However, |
Ah, I see your code was a bit specialised to having two genotypes per sample. If you look at the code for for (isample=0; isample<line->n_sample; isample++) and then all the genotypes for that sample in an inner loop: for (i=0; i<fmt_ptr->n; i++) \ as VCF supports samples with ploidy greater than two. Apart from assuming a diploid record, your code is actually working as expected. The first genotype always appearing unphased comes from the VCF encoding, where the phasing marker always appears between the allele numbers: |
I see your point now. As in the current version, the first position will always be unphased despite the phased info for the rest. Looking forward to the new version. |
I am working a project using <htslib/vcf.h>. In particular, I am trying to access gene type (GT) info. I've shared my findings below.
https://github.com/samtools/htslib/blob/c705bec2af9caedacc6ba80dfba7fe625eb5ba5c/htslib/vcf.h#L1019C1-L1028C1
Later, one can also decode its
val
back toidx
and phase. However, when I am parsing an example phased bcf file as attached, I found the followingThe first record
chr1:54490, GT=1|0; ./.
, according to the above encode algorithm, my projected values should be(((1) + 1) << 1 | 1) = 2 * 2 + 1 = 5,
(((0) + 1) << 1 | 1) = 1 * 2 + 1 = 3.
However, the observed values are
4
and3
from(int8_t *)fmt->p
.The second record
chr1:634553, GT=./.;0:1
. Similarly, the projected values are3
and5
. However, the observed values are2
and5
.This leads to my conjecture that
(int8_t *)fmt->p [0]
is not been considered when encoding somewhere.Then, I checked for the unphased same bcf file, the encoded values are
(4, 2)
and(2,4)
, which is consistent with the above code snippets.Albeit, the current coding scheme doesn’t affect
bcf_gt_allele
at all, and has some effect when evaluatingbcf_gt_is_phased(val)
.I’m not sure if there is any special consideration for position 0. Otherwise, I would be more than happy to work with you to look into the details. I have attached the code I used to read.
test_read_bcf.txt
The text was updated successfully, but these errors were encountered: