-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapt hl.de_novo
function
#760
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding these functions! I have some questions (many because I'm not that familiar with this work) and suggestions
gnomad/sample_qc/relatedness.py
Outdated
de_novo_prior=de_novo_prior, | ||
) | ||
|
||
# Determine genomic context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than running this in this function and in calculate_de_novo_post_prob
, why not switch the order and run get_genomic_context
first and pass in the three return values to calculate_de_novo_post_prob
as function arguments?
I also forgot to ask in my initial review -- could you add tests for the new functions? |
Co-authored-by: Katherine Chao <[email protected]>
I just saw your comment on adding tests, I will work on that now. |
Back to you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding tests! I have some more questions and suggestions
- `fail`: Boolean indicating if the variant fails any checks. | ||
- `fail_reason`: Set of strings with reasons for failure. | ||
""" | ||
p_de_novo = calculate_de_novo_post_prob( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense to me to keep the actual code in gnomad_qc, but can you add this documentation back to the function? I took this from Julia's function:
.. warning::
This method assumes that the PL and AD fields are present in the genotype fields
of the child and parents. If they are missing, this method will not work.
Many of our larger datasets have the PL and AD fields intentionally removed to
save storage space. If this is the reason that the PL and AD fields are
missing, the only way to use this method is to set them to their approximate
values:
.. code-block:: python
PL=hl.or_else(PL, [0, GQ, 2 * GQ])
AD=hl.or_else(AD, [DP, 0])
this is very clear documentation for our users both as a warning that this function needs AD and PL and how to approximate them if they also need to do so
gnomad/sample_qc/relatedness.py
Outdated
using the likelihoods of the proband's and parents' genotypes and the population | ||
frequency prior for the variant. | ||
|
||
Based on Kaitlin Samocha's [de novo caller]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also add a version of this documentation back to the docstring?
The simplified version is the same as Hail's methods when using the
`ignore_in_sample_allele_frequency` parameter. The main difference is that this
mode should be used when families larger than a single trio are in the dataset, in
which an allele might be de novo in a parent and transmitted to a child in the
dataset. This mode will not consider the allele count (AC) in the dataset, and will
only consider the Phred-scaled likelihoods (PL) of the child and parents,
allele balance (AB) of the child and parents, the genotype quality (GQ) of the
child, the depth (DP) of the child and parents, and the population frequency prior.
the explanation that this function behaves the same as Hail's + ignore_in_sample_allele_frequency
and which exact fields the function needs is clear and helpful to have
I rewrote the test simpler, I could parametrize more, but I want you to have a look first. |
gnomad/sample_qc/relatedness.py
Outdated
.or_missing() | ||
) | ||
|
||
parent_sum_ad_0_expr = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I'm looking at the code and data, isn't hl.sum(AD)=DP? Especially it's split. Can we just use parent DP here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good question. I thought I remembered hearing that sum(AD) isn't exactly equal to DP and found this note:
The AD calculation as performed by HaplotypeCaller may not yield exact results because only reads that statistically favor one allele over the other are counted. Due to this fact, the sum of AD may be different than the individual sample depth, especially when there are many non-informative reads.
https://gatk.broadinstitute.org/hc/en-us/articles/360037592411-DepthPerAlleleBySample
could you test to see if sum(AD) == DP
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is that we don't have AD for hom_ref stored, and we're approximating it back by:
AD=hl.or_else(mt.AD, [mt.DP, 0])
The sum(AD) would be DP for those hom ref parent(s) for sure. Do you know where to find the trace (or documentation) that we intentionally removed AD and PL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh that makes sense. given this function depends on the presence of AD/PL, let's stick with AD anyway, even if DP would be equivalent for our dataset. other people using this function might still have AD and PL in their datasets.
finding the documentation or trail on this topic might take a while -- my instincts are checking slack (both Broad and ATGU), project meeting notes, and joint calling meeting notes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for doing the suggested restructuring! the current structure and formatting looks nice. I have some more minor suggestions and some suggestions for the tests
|
||
.. math:: | ||
|
||
P_{dn} = \frac{P(DN \mid \text{data})}{P(DN \mid \text{data}) + P(\text{missed het in parent(s)} \mid \text{data})} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this formula since you've added this comment:
Please refer to these sources for more information on the de novo model.
alternatively, if you feel strongly that these equations should stay in this docstring, particularly given your hard work on the formatting (which looks great!), can you move this
Neither Kaitlin's de novo caller nor Hail's de novo method provide a clear
description on how to calculate for de novo calls for hemizygous genotypes in XY
individuals. These equations are included below:
to after this?
Probability of a de novo mutation given the data for hemizygous calls in XY individuals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revised this to be more logic.
gnomad/sample_qc/relatedness.py
Outdated
|
||
.. math:: | ||
|
||
P(\text{data} \mid DN) = P(\text{hom_ref in mother}) \, P(\text{hom_alt in proband}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see that issue on my big screen but I do see it on my laptop but it will resolve if you zoom in the webpage to 90%. (This got better with \cdot
.
I saw for the multiplication, we can use either \times
or \cdot
(as Hail) or just a small space \,
, *
is not recommended. Hail also used both \,
and \cdot
. To keep it consistent, I change it to \cdot
everywhere except for the DN prior numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"proband_pl, father_pl, mother_pl, diploid, hemi_x, hemi_y, freq_prior, min_pop_prior, expected_p_dn", | ||
[ | ||
( | ||
[73, 0, 161], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than having so many examples, could you have just a few with the following conditions:
- two examples where
calculate_de_novo_post_prob
should return the expectedP(dn)
values - one example where
calculate_de_novo_post_prob
will throw an error becausefreq_prior_expr
is out of the expected range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chatted with Mike, and he pointed me to Kristen's recently written tests as a good example of what our tests should do: https://github.com/broadinstitute/gnomad_methods/blob/main/tests/assessment/test_validity_checks.py.
with my comment above: we should test for valid uses of the function (making sure valid inputs return expected outputs) and incorrect uses of the function (checking that the function returns errors as expected). you've added these suggestions, but it would also be a good idea to add tests for any edge cases you can think of (if possible).
Mike also mentioned that pytest has a built in caplog
functionality to capture logger information, which made me realize these functions don't have any loggers. I'm not sure they're necessary, but I wanted to note it since I thought of it
tests/sample_qc/test_de_novo.py
Outdated
# Assert with floating-point tolerance | ||
assert round(hl.eval(p_dn_expr), 3) == expected_p_dn | ||
|
||
def test_default_get_de_novo_expr_fail_conditions(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_default_get_de_novo_expr_fail_conditions(self): | |
def test_default_get_de_novo_expr(self): |
since this is a test function, I don't think it needs to include "fail_conditions" explicitly in the name. also, similar to the comment above, I think this should test a few scenarios:
- a case like the one you've included, where multiple fail conditions apply
- a case where only one fail condition is true
- a case where the variant has
False
foris_de_novo
- a passing case
Co-authored-by: Katherine Chao <[email protected]>
I'll leave you to resolve the conversations in case I didn't address them as you suggested. |
freq_prior_expr = _get_freq_prior(freq_prior_expr, min_pop_prior) | ||
prior_one_parent_het = 1 - (1 - freq_prior_expr) ** 4 | ||
|
||
# Convert PL to probabilities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this back because your suggestion will need more changes below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few more minor suggestions
gnomad/sample_qc/relatedness.py
Outdated
The method is adapted from Kaitlin Samocha's `de novo caller <https://github.com/ksamocha/de_novo_scripts>`_ | ||
and Hail's `de_novo <https://hail.is/docs/0.2/methods/genetics.html#hail.methods.de_novo>`_ function. | ||
|
||
However, neither approach explicitly defines how to compute *de novo* | ||
probabilities for hemizygous genotypes in XY individuals. To address this, | ||
we provide the full set of equations in this docstring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gnomad/sample_qc/relatedness.py
Outdated
|
||
.. math:: | ||
|
||
P(\text{data} \mid DN) = P(\text{hom_ref in mother}) \, P(\text{hom_alt in proband}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"proband_pl, father_pl, mother_pl, diploid, hemi_x, hemi_y, freq_prior, min_pop_prior, expected_p_dn", | ||
[ | ||
( | ||
[73, 0, 161], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chatted with Mike, and he pointed me to Kristen's recently written tests as a good example of what our tests should do: https://github.com/broadinstitute/gnomad_methods/blob/main/tests/assessment/test_validity_checks.py.
with my comment above: we should test for valid uses of the function (making sure valid inputs return expected outputs) and incorrect uses of the function (checking that the function returns errors as expected). you've added these suggestions, but it would also be a good idea to add tests for any edge cases you can think of (if possible).
Mike also mentioned that pytest has a built in caplog
functionality to capture logger information, which made me realize these functions don't have any loggers. I'm not sure they're necessary, but I wanted to note it since I thought of it
proband_ab = proband_expr.AD[1] / hl.sum(proband_expr.AD) | ||
is_snp = hl.is_snp(alleles_expr[0], alleles_expr[1]) | ||
|
||
is_de_novo = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, sorry, that wasn't clear. this suggestion was relevant when you wanted to filter de novos from the input dataset upfront. in that scheme, you would pass an argument to this function (e.g., filter_to_candidate_dnms
or something similar), and the default for that argument would be True
, which means the function would first filter the input dataset to candidate de novo variants before calculating p-values and confidence levels.
however, in your current function design, where you pass in and return only expressions, I don't think this behavior would make sense; I more wanted to suggest this as an alternative option to your previous structure (filtering the HT to variants with GTs that indicate possibility of being a de novo variant).
Co-authored-by: Katherine Chao <[email protected]>
These are smaller functions modified from Julia's work on combining hl.de_novo and Kaitlin’s code.