-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LPL no smaller than PL #277
Comments
Thanks for sharing your findings. My suspicion is that there is a relatively small subset of entries that are using a large number of alternate alleles, forcing all of the entries to have a large number of values, which would mostly be fill values. I would be curious to know which samples are using a lot of alternate alleles and why. If those entries have a lot of positive AD values but not a lot of positive PL values, then we could consider not considering the AD field when creating the LAA field. If instead those entries have a lot of positive PL values, then we may want to consider a lossy approach such as only considering the alleles listed in the genotype field when creating the LAA field. Is the dataset that you are using public? I was having trouble finding it on the IGSR website. If you can provide a link to it, I would like to see if I can download it and evaluate it. In the meantime, I will try to find another small VCF dataset, experiment with it, and try using Hail to create some local-allele fields. |
I was able to download chrY from this 1000 Genomes folder and filter for variants with many alleles. I was then able to run explode and encode and load the dataset using sgkit. I found that there are many entries where the PL value is full of positive values. Thus, vcf2zarr explode's current implementation counts all the alternate alleles as being observed in those entries, and then puts all of the PL values in the LPL field for those entries. My guess is that something similar is happening in your dataset. Because the PL values are full of positive values, I think vcf2zarr needs to use a lossy approach to shrink the size of the PL field. One option is to just include the PL values corresponding to the alleles referenced in the GT field. This is easy to implement, and I verified that doing this results in an LPL field with shape I also loaded the chrY VCF into Hail and converted it to a Variant Dataset using
The Hail code that creates the local-allele fields does not appear to be doing anything sophisticated. For example, for the LA field, it just creates a range of numbers based on the allele count instead of determining which alleles are observed in the entry. |
That's excellent @Will-Tyler, well done! If all 28 values are non-zero, then there's not a great deal we can do, as you say. I don't know how useful these values are, or whether anyone really uses them, but that's a different story. I'm surprised that this approach is so ineffective, when the SVCR paper pushes it so hard (and people have put so much trouble into getting it into the VCF standard). I've been looking at data from the same pipeline as you (1000 Genomes, 2020) so maybe it's a property of the variant caller used there? Maybe different programs that output PLs are more discriminating? I've had a quick look at it seems that 1000 Genomes phase 1 and phase 3 didn't use PL values. So, maybe we need to figure out which variant callers emit PLs and see if we can track down a dataset that has them? |
From the VCF file, the variant caller appears to be GATK. I thought more about this and did some reading on variant calling today. First, after reading this GATK article on PL calculation, it is no longer surprising to me that the PL values are mostly positive with some zeroes. PL is Perhaps vcf2zarr can only consider PL values References
|
Thanks for doing the background reading @Will-Tyler, that's very helpful.
I agree. I guess a first pass would be to only count alleles with AD>0, and see if this gets us any gains? |
Is it worth making |
Good idea—let me try this. There are 346 entries where all of the AD values are positive, so the LAA field and the LPL field have the same shape as before (
And smaller in VCF Zarr as well:
Presumably, less alleles are being counted as local, resulting in less of the PL values being copied over and the fill value being used instead. |
That is a reasonable saving all right. Hardly seems worth all the effort though, if I'm honest, as it's the dimensions of the PL field that cause problems more than than the overall storage size. Truncating to a max of 127 seems entirely reasonable anyway, so that'll reduce our storage 4 fold here. Spending all that space just to encode precisely how unlikely the variant caller thinks a combination of alleles that have no reads are seems pretty silly. What happens if we truncate to just keep zero and the next-most likely value? I must do some reading to figure out how people use PLs... |
Thank you for sharing this @tomwhite! The tag2tag plugin does not support localization yet (see here), but the merge command has some interesting functionality that I had not considered before. I played around with the merge command using the The user tells the merge command to use a certain amount of local alleles. The merge command then decides which alleles to include by calculating, for each alternate allele, the sum of the probabilities of the genotypes that reference the allele. The merge command uses the alleles with the highest genotype probability sums as the local alleles. (Source: The merge command's approach only uses the PL field and the user's input to determine which alleles to make local. This approach also controls the shape of the output data. If we decide to implement the same approach, we can test against the merge command's implementation. If we want to adopt this approach in bio2zarr, I would certainly be happy to work on it. |
After doing some digging, I'm pretty sure this isn't working as intended. If we look at one of our examples with lots of alleles,
Cutting out the interesting bits, we have three samples, all of which have 0/0 genotypes and the AD values are e.g., 486,0,0,0,0. That is, there's an allele depth of 486 for the reference allele, and nothing at all for the other values. So, based on this, our LAA values should be [] for each of thse samples, but it's coming out as
The problem is that we're treating non-zero PL values as necessary, but actually larger values mean more unlikely. Our PL values here are
This says that the probability of the 0/0 call being correct is 1 (PL=0), and other genotypes have a probability of 10^-12 or 10^-180 of being correct. There's no information in the various combinations of alleles, it's just some arbitrary number assigned based on the allele depths. So, I think the correct output here for us is
I think as a first pass we should take the local alleles to be those that have non-zero AD values, and see how that goes. What do you think @Will-Tyler - am I missing something? |
See also #185 |
I commented on your PR—#285 (comment). But overall, I think your understanding is correct. We tried only localizing the alleles with positive AD values, but that approach doesn't improve the shape. Only using the alleles that are called in the genotypes seems reasonable. Happy to finish the fix you started. |
I used the performance testing dataset from vcztools. The LPL field has the correct shape and is a fraction of the PL field size. |
Digging into this again, I think the only reasonable approach to doing this for us is to localise based on the observed genotypes (that is, we have a max of 2 local alleles). It's important to observe that this is a lossy operation, as it will result in dropping allele depth information and non-recoverable PL values in corner cases. Consider this example from 1000G data:
so, the actual call is A/GTGTATATATATA, and these have allele depths of 5/9. But, there's also an allele depth of 4 for GTA, so the call is really quite marginal and throwing away the data for AD on the non-called alleles is quite lossy, if one were ever going to go back and query this call. What's not clear to me here is what Hail does in this case. There's no indication anywhere in the documentation or writeup that localising is lossy, but there's also no mention of any awkward cases like this. I think I'll email Konrad about it and ask him. |
Closing this as done in #299 |
After running conversion on 1000 Genomes chr2 data, I'm not seeing any reduction in the size of PL fields. Here's the ICF:
and on the VCZ: (for a small number of variants using --max-variant-chunks)
So, there's no reduction in the maximum dimension and the storage sizes are essentially identical.
Have you any ideas what might be going on here @Will-Tyler?
I think it would be really worthwhile getting some truth data for LPL that we could compare with. It does seem that getting running Hail is the only way to do this, so probably worth the effort.
The text was updated successfully, but these errors were encountered: