Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt hl.de_novo function #760

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open

Adapt hl.de_novo function #760

wants to merge 45 commits into from

Conversation

KoalaQin
Copy link
Contributor

@KoalaQin KoalaQin commented Jan 28, 2025

These are smaller functions modified from Julia's work on combining hl.de_novo and Kaitlin’s code.

@KoalaQin KoalaQin self-assigned this Jan 28, 2025
@KoalaQin KoalaQin requested a review from ch-kr January 28, 2025 20:29
Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding these functions! I have some questions (many because I'm not that familiar with this work) and suggestions

gnomad/sample_qc/relatedness.py Outdated Show resolved Hide resolved
gnomad/sample_qc/relatedness.py Outdated Show resolved Hide resolved
gnomad/sample_qc/relatedness.py Outdated Show resolved Hide resolved
gnomad/sample_qc/relatedness.py Outdated Show resolved Hide resolved
gnomad/sample_qc/relatedness.py Outdated Show resolved Hide resolved
gnomad/sample_qc/relatedness.py Outdated Show resolved Hide resolved
de_novo_prior=de_novo_prior,
)

# Determine genomic context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than running this in this function and in calculate_de_novo_post_prob, why not switch the order and run get_genomic_context first and pass in the three return values to calculate_de_novo_post_prob as function arguments?

gnomad/sample_qc/relatedness.py Outdated Show resolved Hide resolved
gnomad/sample_qc/relatedness.py Show resolved Hide resolved
gnomad/sample_qc/relatedness.py Outdated Show resolved Hide resolved
@ch-kr
Copy link
Contributor

ch-kr commented Jan 29, 2025

I also forgot to ask in my initial review -- could you add tests for the new functions?

@KoalaQin KoalaQin requested a review from ch-kr January 31, 2025 17:19
@KoalaQin
Copy link
Contributor Author

I just saw your comment on adding tests, I will work on that now.

@KoalaQin
Copy link
Contributor Author

KoalaQin commented Feb 3, 2025

Back to you!

Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding tests! I have some more questions and suggestions

gnomad/utils/annotations.py Outdated Show resolved Hide resolved
gnomad/sample_qc/relatedness.py Outdated Show resolved Hide resolved
gnomad/sample_qc/relatedness.py Outdated Show resolved Hide resolved
- `fail`: Boolean indicating if the variant fails any checks.
- `fail_reason`: Set of strings with reasons for failure.
"""
p_de_novo = calculate_de_novo_post_prob(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me to keep the actual code in gnomad_qc, but can you add this documentation back to the function? I took this from Julia's function:

    .. warning::

        This method assumes that the PL and AD fields are present in the genotype fields
        of the child and parents. If they are missing, this method will not work.
        Many of our larger datasets have the PL and AD fields intentionally removed to
        save storage space. If this is the reason that the PL and AD fields are
        missing, the only way to use this method is to set them to their approximate
        values:

        .. code-block:: python

                PL=hl.or_else(PL, [0, GQ, 2 * GQ])
                AD=hl.or_else(AD, [DP, 0])

this is very clear documentation for our users both as a warning that this function needs AD and PL and how to approximate them if they also need to do so

using the likelihoods of the proband's and parents' genotypes and the population
frequency prior for the variant.

Based on Kaitlin Samocha's [de novo caller](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add a version of this documentation back to the docstring?

    The simplified version is the same as Hail's methods when using the
    `ignore_in_sample_allele_frequency` parameter. The main difference is that this
    mode should be used when families larger than a single trio are in the dataset, in
    which an allele might be de novo in a parent and transmitted to a child in the
    dataset. This mode will not consider the allele count (AC) in the dataset, and will
    only consider the Phred-scaled likelihoods (PL) of the child and parents,
    allele balance (AB) of the child and parents, the genotype quality (GQ) of the
    child, the depth (DP) of the child and parents, and the population frequency prior.

the explanation that this function behaves the same as Hail's + ignore_in_sample_allele_frequency and which exact fields the function needs is clear and helpful to have

tests/sample_qc/test_de_novo.py Outdated Show resolved Hide resolved
tests/sample_qc/test_de_novo.py Outdated Show resolved Hide resolved
tests/sample_qc/test_de_novo.py Outdated Show resolved Hide resolved
tests/sample_qc/test_de_novo.py Outdated Show resolved Hide resolved
tests/sample_qc/test_de_novo.py Outdated Show resolved Hide resolved
@KoalaQin KoalaQin requested a review from ch-kr February 6, 2025 16:28
@KoalaQin
Copy link
Contributor Author

KoalaQin commented Feb 6, 2025

I rewrote the test simpler, I could parametrize more, but I want you to have a look first.

.or_missing()
)

parent_sum_ad_0_expr = (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm looking at the code and data, isn't hl.sum(AD)=DP? Especially it's split. Can we just use parent DP here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good question. I thought I remembered hearing that sum(AD) isn't exactly equal to DP and found this note:

The AD calculation as performed by HaplotypeCaller may not yield exact results because only reads that statistically favor one allele over the other are counted. Due to this fact, the sum of AD may be different than the individual sample depth, especially when there are many non-informative reads.

https://gatk.broadinstitute.org/hc/en-us/articles/360037592411-DepthPerAlleleBySample

could you test to see if sum(AD) == DP here?

Copy link
Contributor Author

@KoalaQin KoalaQin Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is that we don't have AD for hom_ref stored, and we're approximating it back by:

AD=hl.or_else(mt.AD, [mt.DP, 0])

The sum(AD) would be DP for those hom ref parent(s) for sure. Do you know where to find the trace (or documentation) that we intentionally removed AD and PL?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that makes sense. given this function depends on the presence of AD/PL, let's stick with AD anyway, even if DP would be equivalent for our dataset. other people using this function might still have AD and PL in their datasets.

finding the documentation or trail on this topic might take a while -- my instincts are checking slack (both Broad and ATGU), project meeting notes, and joint calling meeting notes

Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for doing the suggested restructuring! the current structure and formatting looks nice. I have some more minor suggestions and some suggestions for the tests

gnomad/sample_qc/relatedness.py Outdated Show resolved Hide resolved

.. math::

P_{dn} = \frac{P(DN \mid \text{data})}{P(DN \mid \text{data}) + P(\text{missed het in parent(s)} \mid \text{data})}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this formula since you've added this comment:

Please refer to these sources for more information on the de novo model.

alternatively, if you feel strongly that these equations should stay in this docstring, particularly given your hard work on the formatting (which looks great!), can you move this

Neither Kaitlin's de novo caller nor Hail's de novo method provide a clear
description on how to calculate for de novo calls for hemizygous genotypes in XY
individuals. These equations are included below:

to after this?

Probability of a de novo mutation given the data for hemizygous calls in XY individuals

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised this to be more logic.

gnomad/sample_qc/relatedness.py Outdated Show resolved Hide resolved
gnomad/sample_qc/relatedness.py Outdated Show resolved Hide resolved

.. math::

P(\text{data} \mid DN) = P(\text{hom_ref in mother}) \, P(\text{hom_alt in proband})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this equation extends past the end of the screen for me, and I don't see an option to scroll horizontally -- do you see this as well when you render the docs?
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, in these formulas below, you use x to indicate multiplication, but you don't above, e.g.:
image

can you add the x above or remove them here? also, if you decide to keep a symbol, is there a way to use a dot instead (as in the Hail docs) to save horizontal space?
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that issue on my big screen but I do see it on my laptop but it will resolve if you zoom in the webpage to 90%. (This got better with \cdot.
I saw for the multiplication, we can use either \times or \cdot (as Hail) or just a small space \, , * is not recommended. Hail also used both \, and \cdot. To keep it consistent, I change it to \cdot everywhere except for the DN prior numbers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the \cdot helped a lot, thanks for making the change! I agree with keeping the \times in the de novo prior. an alternative for prior is to display that formula in the same format as the Hail docs:
image
(not a necessary edit, just something to consider -- happy with whichever you prefer)

gnomad/sample_qc/relatedness.py Show resolved Hide resolved
gnomad/sample_qc/relatedness.py Show resolved Hide resolved
"proband_pl, father_pl, mother_pl, diploid, hemi_x, hemi_y, freq_prior, min_pop_prior, expected_p_dn",
[
(
[73, 0, 161],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than having so many examples, could you have just a few with the following conditions:

  • two examples where calculate_de_novo_post_prob should return the expected P(dn) values
  • one example where calculate_de_novo_post_prob will throw an error because freq_prior_expr is out of the expected range?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chatted with Mike, and he pointed me to Kristen's recently written tests as a good example of what our tests should do: https://github.com/broadinstitute/gnomad_methods/blob/main/tests/assessment/test_validity_checks.py.

with my comment above: we should test for valid uses of the function (making sure valid inputs return expected outputs) and incorrect uses of the function (checking that the function returns errors as expected). you've added these suggestions, but it would also be a good idea to add tests for any edge cases you can think of (if possible).

Mike also mentioned that pytest has a built in caplog functionality to capture logger information, which made me realize these functions don't have any loggers. I'm not sure they're necessary, but I wanted to note it since I thought of it

# Assert with floating-point tolerance
assert round(hl.eval(p_dn_expr), 3) == expected_p_dn

def test_default_get_de_novo_expr_fail_conditions(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_default_get_de_novo_expr_fail_conditions(self):
def test_default_get_de_novo_expr(self):

since this is a test function, I don't think it needs to include "fail_conditions" explicitly in the name. also, similar to the comment above, I think this should test a few scenarios:

  • a case like the one you've included, where multiple fail conditions apply
  • a case where only one fail condition is true
  • a case where the variant has False for is_de_novo
  • a passing case

tests/sample_qc/test_de_novo.py Outdated Show resolved Hide resolved
@KoalaQin KoalaQin requested a review from ch-kr February 7, 2025 17:42
@KoalaQin
Copy link
Contributor Author

KoalaQin commented Feb 7, 2025

I'll leave you to resolve the conversations in case I didn't address them as you suggested.

freq_prior_expr = _get_freq_prior(freq_prior_expr, min_pop_prior)
prior_one_parent_het = 1 - (1 - freq_prior_expr) ** 4

# Convert PL to probabilities
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this back because your suggestion will need more changes below.

Copy link
Contributor

@ch-kr ch-kr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more minor suggestions

Comment on lines 1323 to 1328
The method is adapted from Kaitlin Samocha's `de novo caller <https://github.com/ksamocha/de_novo_scripts>`_
and Hail's `de_novo <https://hail.is/docs/0.2/methods/genetics.html#hail.methods.de_novo>`_ function.

However, neither approach explicitly defines how to compute *de novo*
probabilities for hemizygous genotypes in XY individuals. To address this,
we provide the full set of equations in this docstring.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this explanation is great -- it's really clear!

I have a nitpick here: can you combine these into the same paragraph/text block? these sentences logically belong together, so they don't need the extra newline in between
image

gnomad/sample_qc/relatedness.py Show resolved Hide resolved
gnomad/sample_qc/relatedness.py Outdated Show resolved Hide resolved

.. math::

P(\text{data} \mid DN) = P(\text{hom_ref in mother}) \, P(\text{hom_alt in proband})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the \cdot helped a lot, thanks for making the change! I agree with keeping the \times in the de novo prior. an alternative for prior is to display that formula in the same format as the Hail docs:
image
(not a necessary edit, just something to consider -- happy with whichever you prefer)

gnomad/sample_qc/relatedness.py Outdated Show resolved Hide resolved
tests/sample_qc/test_de_novo.py Outdated Show resolved Hide resolved
"proband_pl, father_pl, mother_pl, diploid, hemi_x, hemi_y, freq_prior, min_pop_prior, expected_p_dn",
[
(
[73, 0, 161],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chatted with Mike, and he pointed me to Kristen's recently written tests as a good example of what our tests should do: https://github.com/broadinstitute/gnomad_methods/blob/main/tests/assessment/test_validity_checks.py.

with my comment above: we should test for valid uses of the function (making sure valid inputs return expected outputs) and incorrect uses of the function (checking that the function returns errors as expected). you've added these suggestions, but it would also be a good idea to add tests for any edge cases you can think of (if possible).

Mike also mentioned that pytest has a built in caplog functionality to capture logger information, which made me realize these functions don't have any loggers. I'm not sure they're necessary, but I wanted to note it since I thought of it

proband_ab = proband_expr.AD[1] / hl.sum(proband_expr.AD)
is_snp = hl.is_snp(alleles_expr[0], alleles_expr[1])

is_de_novo = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, sorry, that wasn't clear. this suggestion was relevant when you wanted to filter de novos from the input dataset upfront. in that scheme, you would pass an argument to this function (e.g., filter_to_candidate_dnms or something similar), and the default for that argument would be True, which means the function would first filter the input dataset to candidate de novo variants before calculating p-values and confidence levels.

however, in your current function design, where you pass in and return only expressions, I don't think this behavior would make sense; I more wanted to suggest this as an alternative option to your previous structure (filtering the HT to variants with GTs that indicate possibility of being a de novo variant).

tests/sample_qc/test_de_novo.py Outdated Show resolved Hide resolved
tests/sample_qc/test_de_novo.py Outdated Show resolved Hide resolved
@broadinstitute broadinstitute deleted a comment from ch-kr Feb 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants