Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LPL no smaller than PL #277

Closed
jeromekelleher opened this issue Jul 23, 2024 · 16 comments
Closed

LPL no smaller than PL #277

jeromekelleher opened this issue Jul 23, 2024 · 16 comments
Labels
bug Something isn't working
Milestone

Comments

@jeromekelleher
Copy link
Contributor

After running conversion on 1000 Genomes chr2 data, I'm not seeing any reduction in the size of PL fields. Here's the ICF:

$ vcf2zarr inspect 1kg_chr2_lpl.icf/ | grep PL
FORMAT/PL                 Integer      6906  430.58 GiB  60.6 GiB           28  0          3.2e+05
FORMAT/LPL                Integer      6906  430.58 GiB  60.57 GiB          28  -1         3.2e+05
$ vcf2zarr inspect 1kg_chr2_lpl.icf/ | grep 'LAA'
FORMAT/LAA                Integer      4534  282.07 GiB  315.18 MiB          6  -2         6

and on the VCZ: (for a small number of variants using --max-variant-chunks)

$ vcf2zarr inspect 1kg_chr2_lpl.vcz/ | grep PL
/call_LPL                     int32    8.27 MiB    342.01 MiB   41             40  8.55 MiB      211.78 KiB          (1000, 3202, 28)  (100, 1000, 28)  Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_PL                      int32    8.27 MiB    342.01 MiB   41             40  8.55 MiB      211.78 KiB          (1000, 3202, 28)  (100, 1000, 28)  Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
$ vcf2zarr inspect 1kg_chr2_lpl.vcz/ | grep LAA
/call_LAA                     int8     102.69 KiB  18.32 MiB   180             40  469.04 KiB    2.57 KiB            (1000, 3202, 6)   (100, 1000, 6)   Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None

So, there's no reduction in the maximum dimension and the storage sizes are essentially identical.

Have you any ideas what might be going on here @Will-Tyler?

I think it would be really worthwhile getting some truth data for LPL that we could compare with. It does seem that getting running Hail is the only way to do this, so probably worth the effort.

@Will-Tyler
Copy link
Contributor

Thanks for sharing your findings. My suspicion is that there is a relatively small subset of entries that are using a large number of alternate alleles, forcing all of the entries to have a large number of values, which would mostly be fill values. I would be curious to know which samples are using a lot of alternate alleles and why. If those entries have a lot of positive AD values but not a lot of positive PL values, then we could consider not considering the AD field when creating the LAA field. If instead those entries have a lot of positive PL values, then we may want to consider a lossy approach such as only considering the alleles listed in the genotype field when creating the LAA field.

Is the dataset that you are using public? I was having trouble finding it on the IGSR website. If you can provide a link to it, I would like to see if I can download it and evaluate it. In the meantime, I will try to find another small VCF dataset, experiment with it, and try using Hail to create some local-allele fields.

@Will-Tyler
Copy link
Contributor

Will-Tyler commented Jul 23, 2024

I was able to download chrY from this 1000 Genomes folder and filter for variants with many alleles. I was then able to run explode and encode and load the dataset using sgkit.

I found that there are many entries where the PL value is full of positive values. Thus, vcf2zarr explode's current implementation counts all the alternate alleles as being observed in those entries, and then puts all of the PL values in the LPL field for those entries. My guess is that something similar is happening in your dataset.

Because the PL values are full of positive values, I think vcf2zarr needs to use a lossy approach to shrink the size of the PL field. One option is to just include the PL values corresponding to the alleles referenced in the GT field. This is easy to implement, and I verified that doing this results in an LPL field with shape (1164, 2504, 6) compared with the PL field's shape of (1164, 2504, 28) in my example dataset. Please let me know what you think. I would be happy to implement this change.


I also loaded the chrY VCF into Hail and converted it to a Variant Dataset using transform_gvcf. The result VDS appears to have all the LPL values for most of the entries. The entries that I am seeing that have values have 28 values. For example,

+------------------------------------------------------------------------------+
| 'HG00096'.LPL                                                                |
+------------------------------------------------------------------------------+
| array<int32>                                                                 |
+------------------------------------------------------------------------------+
| [0,36,540,36,540,540,36,540,540,540,36,540,540,540,540,36,540,540,540,540... |
| [0,18,270,18,270,270,18,270,270,270,18,270,270,270,270,18,270,270,270,270... |
| [349,349,349,24,24,0,349,349,24,349,349,349,24,349,349,349,349,24,349,349... |
| [0,33,382,33,382,382,33,382,382,382,33,382,382,382,382,33,382,382,382,382... |
| [0,0,242,0,242,242,0,242,242,242,0,242,242,242,242,0,242,242,242,242,242,... |
| [0,0,119,0,119,119,0,119,119,119,0,119,119,119,119,0,119,119,119,119,119,... |
| [0,28,610,28,610,610,28,610,610,610,28,610,610,610,610,28,610,610,610,610... |
| [0,0,20,0,20,20,0,20,20,20,0,20,20,20,20,0,20,20,20,20,20,0,20,20,20,20,2... |
| [0,0,117,0,117,117,0,117,117,117,0,117,117,117,117,0,117,117,117,117,117,... |
| [330,330,330,330,330,330,330,330,330,330,30,30,30,30,0,330,330,330,330,30... |
+------------------------------------------------------------------------------+

The Hail code that creates the local-allele fields does not appear to be doing anything sophisticated. For example, for the LA field, it just creates a range of numbers based on the allele count instead of determining which alleles are observed in the entry.

@jeromekelleher
Copy link
Contributor Author

That's excellent @Will-Tyler, well done! If all 28 values are non-zero, then there's not a great deal we can do, as you say. I don't know how useful these values are, or whether anyone really uses them, but that's a different story.

I'm surprised that this approach is so ineffective, when the SVCR paper pushes it so hard (and people have put so much trouble into getting it into the VCF standard).

I've been looking at data from the same pipeline as you (1000 Genomes, 2020) so maybe it's a property of the variant caller used there? Maybe different programs that output PLs are more discriminating? I've had a quick look at it seems that 1000 Genomes phase 1 and phase 3 didn't use PL values.

So, maybe we need to figure out which variant callers emit PLs and see if we can track down a dataset that has them?

@Will-Tyler
Copy link
Contributor

Will-Tyler commented Jul 24, 2024

From the VCF file, the variant caller appears to be GATK. I thought more about this and did some reading on variant calling today. First, after reading this GATK article on PL calculation, it is no longer surprising to me that the PL values are mostly positive with some zeroes. PL is 10 log 10 ( P ( genotype | sequencing data ) ) , which is then normalized by subtracting the smallest PL value from the other PL values so that the smallest PL value is always zero. Genotypes with a PL of zero are the most-likely genotypes in the entry, and the other positive PL values correspond to less likely genotypes (the higher the PL, the less likely the corresponding genotype). Thus, the called genotype usually has a PL value of zero, while the other less-likely genotypes have positive PL values.

Perhaps vcf2zarr can only consider PL values below above (less-likely than) a certain threshold (such as 60, which corresponds to 1 in a million) for the LAA and LPL fields. Alternatively, vcf2zarr could rely on other fields such as AD for determining which alleles should be considered local. It seems reasonable to me to ignore PL values for genotypes that have an allele with zero statistically-significant reads (according to the AD field).

References

@jeromekelleher
Copy link
Contributor Author

Thanks for doing the background reading @Will-Tyler, that's very helpful.

It seems reasonable to me to ignore PL values for genotypes that have an allele with zero statistically-significant reads (according to the AD field).

I agree. I guess a first pass would be to only count alleles with AD>0, and see if this gets us any gains?

@tomwhite
Copy link
Contributor

Is it worth making --no-local-alleles the default for the moment?

@Will-Tyler
Copy link
Contributor

I guess a first pass would be to only count alleles with AD>0, and see if this gets us any gains?

Good idea—let me try this. There are 346 entries where all of the AD values are positive, so the LAA field and the LPL field have the same shape as before ((1164, 2504, 6) and (1164, 2504, 28), respectively). But the storage size is smaller in ICF:

vcf2zarr inspect data/1kg_chrY_limited.icf | grep PL
FORMAT/PL                 Integer         7  311.46 MiB  20.11 MiB          28  0          4.6e+04
FORMAT/LPL                Integer         4  75.81 MiB   10.33 MiB          28  -1         4.6e+04

And smaller in VCF Zarr as well:

vcf2zarr inspect data/1kg_chrY_limited.vcz | grep PL
/call_PL                      int32    21.22 MiB   311.32 MiB   15              3  103.77 MiB    7.07 MiB            (1164, 2504, 28)  (10000, 1000, 28)  Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_LPL                     int32    13.17 MiB   311.32 MiB   24              3  103.77 MiB    4.39 MiB            (1164, 2504, 28)  (10000, 1000, 28)  Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None

Presumably, less alleles are being counted as local, resulting in less of the PL values being copied over and the fill value being used instead.

@jeromekelleher
Copy link
Contributor Author

That is a reasonable saving all right. Hardly seems worth all the effort though, if I'm honest, as it's the dimensions of the PL field that cause problems more than than the overall storage size.

Truncating to a max of 127 seems entirely reasonable anyway, so that'll reduce our storage 4 fold here. Spending all that space just to encode precisely how unlikely the variant caller thinks a combination of alleles that have no reads are seems pretty silly.

What happens if we truncate to just keep zero and the next-most likely value?

I must do some reading to figure out how people use PLs...

@tomwhite
Copy link
Contributor

Not sure if you are aware of this, but I found this page that mentions that bcftools has some support for local alleles, namely in the merge command, and the tag2tag plugin.

@Will-Tyler
Copy link
Contributor

Thank you for sharing this @tomwhite! The tag2tag plugin does not support localization yet (see here), but the merge command has some interesting functionality that I had not considered before. I played around with the merge command using the --force-single and --local-alleles options.

The user tells the merge command to use a certain amount of local alleles. The merge command then decides which alleles to include by calculating, for each alternate allele, the sum of the probabilities of the genotypes that reference the allele. The merge command uses the alleles with the highest genotype probability sums as the local alleles. (Source: init_local_alleles.)

The merge command's approach only uses the PL field and the user's input to determine which alleles to make local. This approach also controls the shape of the output data. If we decide to implement the same approach, we can test against the merge command's implementation. If we want to adopt this approach in bio2zarr, I would certainly be happy to work on it.

@jeromekelleher
Copy link
Contributor Author

jeromekelleher commented Oct 23, 2024

After doing some digging, I'm pretty sure this isn't working as intended.

If we look at one of our examples with lots of alleles, tests/data/vcf/1kg_2020_chrM.vcf.gz we have the following bit of VCF:

chrM    55      .       TA      TAA,T,CA,AA     2.31138e+06     VQSRTrancheINDEL99.00to100.00   AC=0,0,0,0;AF=0.0003123,0.00406,0.001874,0.0006246;AN=6;BaseQRankSum=0.733;ClippingRankSum=0.019;DP=1505275;ExcessHet=-0;FS=0;InbreedingCoeff=0.9998;MLEAC=2,26,12,4;MLEAF=0.0003123,0.00406,0.001874,0.0006246;MQ=58.79;MQ0=0;MQRankSum=1.1;QD=29.67;ReadPosRankSum=0.906;SOR=0.497;VQSLOD=-426;culprit=DP   GT:AD:DP:GQ:PGT:PID:PL  0/0:446,0,0,0,0:446:99:.:.:0,120,1800,120,1800,1800,120,1800,1800,1800,120,1800,1800,1800,1800        0/0:393,0,0,0,0:393:99:.:.:0,120,1800,120,1800,1800,120,1800,1800,1800,120,1800,1800,1800,1800  0/0:486,0,0,0,0:486:99:.:.:0,120,1800,120,1800,1800,120,1800,1800,1800,120,1800,1800,1800,1800

Cutting out the interesting bits, we have three samples, all of which have 0/0 genotypes and the AD values are e.g., 486,0,0,0,0. That is, there's an allele depth of 486 for the reference allele, and nothing at all for the other values. So, based on this, our LAA values should be [] for each of thse samples, but it's coming out as

array([[1, 2, 3, 4],
       [1, 2, 3, 4],
       [1, 2, 3, 4]], dtype=int8)

The problem is that we're treating non-zero PL values as necessary, but actually larger values mean more unlikely. Our PL values here are

[   0,  120, 1800,  120, 1800, 1800,  120, 1800, 1800, 1800,  120,
        1800, 1800, 1800, 1800],

This says that the probability of the 0/0 call being correct is 1 (PL=0), and other genotypes have a probability of 10^-12 or 10^-180 of being correct. There's no information in the various combinations of alleles, it's just some arbitrary number assigned based on the allele depths.

So, I think the correct output here for us is

GT=0/0 LAA=[] LAD=[446] LPL=[0,120,1800]

I think as a first pass we should take the local alleles to be those that have non-zero AD values, and see how that goes.

What do you think @Will-Tyler - am I missing something?

@jeromekelleher jeromekelleher added the bug Something isn't working label Oct 23, 2024
@jeromekelleher
Copy link
Contributor Author

See also #185

@Will-Tyler
Copy link
Contributor

I commented on your PR—#285 (comment). But overall, I think your understanding is correct. We tried only localizing the alleles with positive AD values, but that approach doesn't improve the shape. Only using the alleles that are called in the genotypes seems reasonable.

Happy to finish the fix you started.

@Will-Tyler
Copy link
Contributor

I'm interested to see how much actual information we lose here wrt to PL values.

vcf2zarr explode ../vcztools/performance/data/chr22.vcf.gz data/chr22.icf --local-alleles
vcf2zarr inspect data/chr22.icf | grep PL
FORMAT/PL                 Integer        15  835.24 MiB  109.54 MiB         28  0          5.2e+04
FORMAT/LPL                Integer        12  636.61 MiB  21.85 MiB           6  -2         4.4e+04
vcf2zarr encode data/chr22.icf data/chr22.vcz
vcf2zarr inspect data/chr22.vcz | grep PL
/call_PL                      int32    146.51 MiB  5.62 GiB      39              9  639.34 MiB    16.28 MiB           (21514, 2504, 28)  (10000, 1000, 28)  Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None
/call_LPL                     int32    22.58 MiB   1.2 GiB       55              9  137 MiB       2.51 MiB            (21514, 2504, 6)   (10000, 1000, 6)   Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0)   None

I used the performance testing dataset from vcztools. The LPL field has the correct shape and is a fraction of the PL field size.

@jeromekelleher
Copy link
Contributor Author

Digging into this again, I think the only reasonable approach to doing this for us is to localise based on the observed genotypes (that is, we have a max of 2 local alleles). It's important to observe that this is a lossy operation, as it will result in dropping allele depth information and non-recoverable PL values in corner cases. Consider this example from 1000G data:

alleles = ['G', 'A', 'GTGTA', 'GTGTATATATA', 'GTA', 'GTGTATATATATA',  'GTGTATATATATATATATA']
GT = 1/5
AD = [1, 5, 4, 0, 0, 9, 0]

so, the actual call is A/GTGTATATATATA, and these have allele depths of 5/9. But, there's also an allele depth of 4 for GTA, so the call is really quite marginal and throwing away the data for AD on the non-called alleles is quite lossy, if one were ever going to go back and query this call.

What's not clear to me here is what Hail does in this case. There's no indication anywhere in the documentation or writeup that localising is lossy, but there's also no mention of any awkward cases like this. I think I'll email Konrad about it and ask him.

@jeromekelleher
Copy link
Contributor Author

Closing this as done in #299

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants