Adapt `hl.de_novo` function #760

KoalaQin · 2025-01-28T20:28:51Z

These are smaller functions modified from Julia's work on combining hl.de_novo and Kaitlin’s code.

ch-kr

thanks for adding these functions! I have some questions (many because I'm not that familiar with this work) and suggestions

gnomad/sample_qc/relatedness.py

ch-kr · 2025-01-29T20:07:04Z

gnomad/sample_qc/relatedness.py

+        de_novo_prior=de_novo_prior,
+    )
+
+    # Determine genomic context


rather than running this in this function and in calculate_de_novo_post_prob, why not switch the order and run get_genomic_context first and pass in the three return values to calculate_de_novo_post_prob as function arguments?

gnomad/sample_qc/relatedness.py

ch-kr · 2025-01-29T20:38:20Z

I also forgot to ask in my initial review -- could you add tests for the new functions?

Co-authored-by: Katherine Chao <[email protected]>

KoalaQin · 2025-01-31T17:57:11Z

I just saw your comment on adding tests, I will work on that now.

KoalaQin · 2025-02-03T14:27:08Z

Back to you!

ch-kr

thanks for adding tests! I have some more questions and suggestions

gnomad/utils/annotations.py

gnomad/sample_qc/relatedness.py

ch-kr · 2025-02-03T17:41:40Z

gnomad/sample_qc/relatedness.py

+        - `fail`: Boolean indicating if the variant fails any checks.
+        - `fail_reason`: Set of strings with reasons for failure.
+    """
+    p_de_novo = calculate_de_novo_post_prob(


makes sense to me to keep the actual code in gnomad_qc, but can you add this documentation back to the function? I took this from Julia's function:

.. warning:: This method assumes that the PL and AD fields are present in the genotype fields of the child and parents. If they are missing, this method will not work. Many of our larger datasets have the PL and AD fields intentionally removed to save storage space. If this is the reason that the PL and AD fields are missing, the only way to use this method is to set them to their approximate values: .. code-block:: python PL=hl.or_else(PL, [0, GQ, 2 * GQ]) AD=hl.or_else(AD, [DP, 0])

this is very clear documentation for our users both as a warning that this function needs AD and PL and how to approximate them if they also need to do so

ch-kr · 2025-02-03T17:43:46Z

gnomad/sample_qc/relatedness.py

+    using the likelihoods of the proband's and parents' genotypes and the population
+    frequency prior for the variant.
+
+    Based on Kaitlin Samocha's [de novo caller](


can you also add a version of this documentation back to the docstring?

The simplified version is the same as Hail's methods when using the `ignore_in_sample_allele_frequency` parameter. The main difference is that this mode should be used when families larger than a single trio are in the dataset, in which an allele might be de novo in a parent and transmitted to a child in the dataset. This mode will not consider the allele count (AC) in the dataset, and will only consider the Phred-scaled likelihoods (PL) of the child and parents, allele balance (AB) of the child and parents, the genotype quality (GQ) of the child, the depth (DP) of the child and parents, and the population frequency prior.

the explanation that this function behaves the same as Hail's + ignore_in_sample_allele_frequency and which exact fields the function needs is clear and helpful to have

tests/sample_qc/test_de_novo.py

KoalaQin · 2025-02-06T16:29:50Z

I rewrote the test simpler, I could parametrize more, but I want you to have a look first.

KoalaQin · 2025-02-06T21:19:38Z

gnomad/sample_qc/relatedness.py

+        .or_missing()
+    )
+
+    parent_sum_ad_0_expr = (


Now I'm looking at the code and data, isn't hl.sum(AD)=DP? Especially it's split. Can we just use parent DP here?

this is a good question. I thought I remembered hearing that sum(AD) isn't exactly equal to DP and found this note:

The AD calculation as performed by HaplotypeCaller may not yield exact results because only reads that statistically favor one allele over the other are counted. Due to this fact, the sum of AD may be different than the individual sample depth, especially when there are many non-informative reads.

https://gatk.broadinstitute.org/hc/en-us/articles/360037592411-DepthPerAlleleBySample

could you test to see if sum(AD) == DP here?

I think the problem is that we don't have AD for hom_ref stored, and we're approximating it back by:

AD=hl.or_else(mt.AD, [mt.DP, 0])

The sum(AD) would be DP for those hom ref parent(s) for sure. Do you know where to find the trace (or documentation) that we intentionally removed AD and PL?

oh that makes sense. given this function depends on the presence of AD/PL, let's stick with AD anyway, even if DP would be equivalent for our dataset. other people using this function might still have AD and PL in their datasets.

finding the documentation or trail on this topic might take a while -- my instincts are checking slack (both Broad and ATGU), project meeting notes, and joint calling meeting notes

ch-kr

thanks for doing the suggested restructuring! the current structure and formatting looks nice. I have some more minor suggestions and some suggestions for the tests

gnomad/sample_qc/relatedness.py

ch-kr · 2025-02-06T17:26:06Z

gnomad/sample_qc/relatedness.py

+
+    .. math::
+
+        P_{dn} = \frac{P(DN \mid \text{data})}{P(DN \mid \text{data}) + P(\text{missed het in parent(s)} \mid \text{data})}


I don't think you need this formula since you've added this comment:

Please refer to these sources for more information on the de novo model.

alternatively, if you feel strongly that these equations should stay in this docstring, particularly given your hard work on the formatting (which looks great!), can you move this

Neither Kaitlin's de novo caller nor Hail's de novo method provide a clear
description on how to calculate for de novo calls for hemizygous genotypes in XY
individuals. These equations are included below:

to after this?

Probability of a de novo mutation given the data for hemizygous calls in XY individuals

I revised this to be more logic.

gnomad/sample_qc/relatedness.py

ch-kr · 2025-02-06T17:32:41Z

gnomad/sample_qc/relatedness.py

+
+        .. math::
+
+            P(\text{data} \mid DN) = P(\text{hom_ref in mother}) \, P(\text{hom_alt in proband})


this equation extends past the end of the screen for me, and I don't see an option to scroll horizontally -- do you see this as well when you render the docs?

also, in these formulas below, you use x to indicate multiplication, but you don't above, e.g.:

can you add the x above or remove them here? also, if you decide to keep a symbol, is there a way to use a dot instead (as in the Hail docs) to save horizontal space?

I don't see that issue on my big screen but I do see it on my laptop but it will resolve if you zoom in the webpage to 90%. (This got better with \cdot.
I saw for the multiplication, we can use either \times or \cdot (as Hail) or just a small space \, , * is not recommended. Hail also used both \, and \cdot. To keep it consistent, I change it to \cdot everywhere except for the DN prior numbers.

the \cdot helped a lot, thanks for making the change! I agree with keeping the \times in the de novo prior. an alternative for prior is to display that formula in the same format as the Hail docs:

(not a necessary edit, just something to consider -- happy with whichever you prefer)

gnomad/sample_qc/relatedness.py

ch-kr · 2025-02-06T21:23:20Z

tests/sample_qc/test_de_novo.py

+        "proband_pl, father_pl, mother_pl, diploid, hemi_x, hemi_y, freq_prior, min_pop_prior, expected_p_dn",
+        [
+            (
+                [73, 0, 161],


rather than having so many examples, could you have just a few with the following conditions:

two examples where calculate_de_novo_post_prob should return the expected P(dn) values

one example where calculate_de_novo_post_prob will throw an error because freq_prior_expr is out of the expected range?

I chatted with Mike, and he pointed me to Kristen's recently written tests as a good example of what our tests should do: https://github.com/broadinstitute/gnomad_methods/blob/main/tests/assessment/test_validity_checks.py.

with my comment above: we should test for valid uses of the function (making sure valid inputs return expected outputs) and incorrect uses of the function (checking that the function returns errors as expected). you've added these suggestions, but it would also be a good idea to add tests for any edge cases you can think of (if possible).

Mike also mentioned that pytest has a built in caplog functionality to capture logger information, which made me realize these functions don't have any loggers. I'm not sure they're necessary, but I wanted to note it since I thought of it

ch-kr · 2025-02-06T21:26:18Z

tests/sample_qc/test_de_novo.py

+        # Assert with floating-point tolerance
+        assert round(hl.eval(p_dn_expr), 3) == expected_p_dn
+
+    def test_default_get_de_novo_expr_fail_conditions(self):


Suggested change

def test_default_get_de_novo_expr_fail_conditions(self):

def test_default_get_de_novo_expr(self):

since this is a test function, I don't think it needs to include "fail_conditions" explicitly in the name. also, similar to the comment above, I think this should test a few scenarios:

a case like the one you've included, where multiple fail conditions apply

a case where only one fail condition is true

a case where the variant has False for is_de_novo

a passing case

tests/sample_qc/test_de_novo.py

Co-authored-by: Katherine Chao <[email protected]>

KoalaQin · 2025-02-07T17:54:14Z

I'll leave you to resolve the conversations in case I didn't address them as you suggested.

KoalaQin · 2025-02-07T17:59:51Z

gnomad/sample_qc/relatedness.py

+    freq_prior_expr = _get_freq_prior(freq_prior_expr, min_pop_prior)
+    prior_one_parent_het = 1 - (1 - freq_prior_expr) ** 4
+
+    # Convert PL to probabilities


I changed this back because your suggestion will need more changes below.

ch-kr

a few more minor suggestions

ch-kr · 2025-02-07T18:08:03Z

gnomad/sample_qc/relatedness.py

+    The method is adapted from Kaitlin Samocha's `de novo caller <https://github.com/ksamocha/de_novo_scripts>`_
+    and Hail's `de_novo <https://hail.is/docs/0.2/methods/genetics.html#hail.methods.de_novo>`_ function.
+
+    However, neither approach explicitly defines how to compute *de novo*
+    probabilities for hemizygous genotypes in XY individuals. To address this,
+    we provide the full set of equations in this docstring.


this explanation is great -- it's really clear!

I have a nitpick here: can you combine these into the same paragraph/text block? these sentences logically belong together, so they don't need the extra newline in between

gnomad/sample_qc/relatedness.py

ch-kr · 2025-02-07T18:19:31Z

gnomad/sample_qc/relatedness.py

+
+        .. math::
+
+            P(\text{data} \mid DN) = P(\text{hom_ref in mother}) \, P(\text{hom_alt in proband})


the \cdot helped a lot, thanks for making the change! I agree with keeping the \times in the de novo prior. an alternative for prior is to display that formula in the same format as the Hail docs:

(not a necessary edit, just something to consider -- happy with whichever you prefer)

gnomad/sample_qc/relatedness.py

tests/sample_qc/test_de_novo.py

ch-kr · 2025-02-07T22:43:12Z

tests/sample_qc/test_de_novo.py

+        "proband_pl, father_pl, mother_pl, diploid, hemi_x, hemi_y, freq_prior, min_pop_prior, expected_p_dn",
+        [
+            (
+                [73, 0, 161],


I chatted with Mike, and he pointed me to Kristen's recently written tests as a good example of what our tests should do: https://github.com/broadinstitute/gnomad_methods/blob/main/tests/assessment/test_validity_checks.py.

with my comment above: we should test for valid uses of the function (making sure valid inputs return expected outputs) and incorrect uses of the function (checking that the function returns errors as expected). you've added these suggestions, but it would also be a good idea to add tests for any edge cases you can think of (if possible).

Mike also mentioned that pytest has a built in caplog functionality to capture logger information, which made me realize these functions don't have any loggers. I'm not sure they're necessary, but I wanted to note it since I thought of it

ch-kr · 2025-02-07T22:54:02Z

gnomad/sample_qc/relatedness.py

+    proband_ab = proband_expr.AD[1] / hl.sum(proband_expr.AD)
+    is_snp = hl.is_snp(alleles_expr[0], alleles_expr[1])
+
+    is_de_novo = (


ah yes, sorry, that wasn't clear. this suggestion was relevant when you wanted to filter de novos from the input dataset upfront. in that scheme, you would pass an argument to this function (e.g., filter_to_candidate_dnms or something similar), and the default for that argument would be True, which means the function would first filter the input dataset to candidate de novo variants before calculating p-values and confidence levels.

however, in your current function design, where you pass in and return only expressions, I don't think this behavior would make sense; I more wanted to suggest this as an alternative option to your previous structure (filtering the HT to variants with GTs that indicate possibility of being a de novo variant).

tests/sample_qc/test_de_novo.py

Co-authored-by: Katherine Chao <[email protected]>

KoalaQin added 3 commits January 21, 2025 17:55

Add function to compute post probability of de novos

f24e6a5

confidence and fail check function

3f3f8e3

Modify de novo function

e681bdc

KoalaQin self-assigned this Jan 28, 2025

KoalaQin requested a review from ch-kr January 28, 2025 20:29

KoalaQin assigned ch-kr Jan 28, 2025

KoalaQin added 2 commits January 29, 2025 09:52

black formatting

4ee8cdc

Reformat docstring

376811d

KoalaQin mentioned this pull request Jan 29, 2025

Generate de novo calls broadinstitute/gnomad_qc#654

Open

Change the citation

f80698c

ch-kr requested changes Jan 29, 2025

View reviewed changes

KoalaQin and others added 4 commits January 30, 2025 14:38

Apply suggestions from code review

ddf3811

Co-authored-by: Katherine Chao <[email protected]>

Transpose the thresholds table

b5fc1b7

merge origin changes

05a4320

Add a call_de_novo function

b565fa2

KoalaQin requested a review from ch-kr January 31, 2025 17:19

Formatting

2f11811

KoalaQin added the Changelog: new feature label Jan 31, 2025

isort

f4dfe32

KoalaQin added 5 commits February 3, 2025 09:00

Add test module for de novo functions

dce95f0

small formatting

7ccda17

black formatting

e9d49d4

Black

ef6dbe1

isort

0a5616d

KoalaQin added 2 commits February 3, 2025 13:56

Add missing docstring and notes

4376c66

formatting

6bfd82e

ch-kr reviewed Feb 3, 2025

View reviewed changes

Add test

bffe574

KoalaQin requested a review from ch-kr February 6, 2025 16:28

KoalaQin added 3 commits February 6, 2025 11:35

Black

f9a625b

remove imports

46538d2

docstring typo

d7d69b5

KoalaQin commented Feb 6, 2025

View reviewed changes

ch-kr requested changes Feb 6, 2025

View reviewed changes

KoalaQin and others added 6 commits February 7, 2025 10:46

Apply suggestions from code review

9d37bcb

Co-authored-by: Katherine Chao <[email protected]>

Docstring changes

f70125d

Merge suggestions

fcfa796

A small change

a387d19

Add error case

f45bc3c

Add suggested test cases

9c36896

KoalaQin requested a review from ch-kr February 7, 2025 17:42

Black

768d9f4

KoalaQin commented Feb 7, 2025

View reviewed changes

KoalaQin added 3 commits February 7, 2025 13:01

extra space removal

c2050d0

Add warning block back

2742e26

Change wording

285865c

ch-kr requested changes Feb 7, 2025

View reviewed changes

Apply suggestions from code review

ace2d37

Co-authored-by: Katherine Chao <[email protected]>

broadinstitute deleted a comment from ch-kr Feb 8, 2025

KoalaQin added 6 commits February 8, 2025 18:41

Address review comments

49f6249

Black

457cd9a

Change indel HIGH code

3571265

typo

9b1515b

Remove extra

b899e7b

Adjust table

5631c62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt `hl.de_novo` function #760

Adapt `hl.de_novo` function #760

KoalaQin commented Jan 28, 2025 •

edited

Loading

ch-kr left a comment

ch-kr Jan 29, 2025

ch-kr commented Jan 29, 2025

KoalaQin commented Jan 31, 2025

KoalaQin commented Feb 3, 2025

ch-kr left a comment

ch-kr Feb 3, 2025

ch-kr Feb 3, 2025

KoalaQin commented Feb 6, 2025

KoalaQin Feb 6, 2025

ch-kr Feb 6, 2025

KoalaQin Feb 7, 2025 •

edited

Loading

ch-kr Feb 7, 2025

ch-kr left a comment

ch-kr Feb 6, 2025

KoalaQin Feb 7, 2025

ch-kr Feb 6, 2025

ch-kr Feb 6, 2025

KoalaQin Feb 7, 2025

ch-kr Feb 7, 2025

ch-kr Feb 6, 2025

ch-kr Feb 7, 2025

ch-kr Feb 6, 2025

KoalaQin commented Feb 7, 2025

KoalaQin Feb 7, 2025

ch-kr left a comment

ch-kr Feb 7, 2025

ch-kr Feb 7, 2025

ch-kr Feb 7, 2025

ch-kr Feb 7, 2025


		.. math::

		P_{dn} = \frac{P(DN \mid \text{data})}{P(DN \mid \text{data}) + P(\text{missed het in parent(s)} \mid \text{data})}


		.. math::

		P(\text{data} \mid DN) = P(\text{hom_ref in mother}) \, P(\text{hom_alt in proband})

	def test_default_get_de_novo_expr_fail_conditions(self):
	def test_default_get_de_novo_expr(self):

Adapt hl.de_novo function #760

Are you sure you want to change the base?

Adapt hl.de_novo function #760

Conversation

KoalaQin commented Jan 28, 2025 • edited Loading

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ch-kr commented Jan 29, 2025

KoalaQin commented Jan 31, 2025

KoalaQin commented Feb 3, 2025

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KoalaQin commented Feb 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KoalaQin Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KoalaQin commented Feb 7, 2025

Choose a reason for hiding this comment

ch-kr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Adapt `hl.de_novo` function #760

Adapt `hl.de_novo` function #760

KoalaQin commented Jan 28, 2025 •

edited

Loading

KoalaQin Feb 7, 2025 •

edited

Loading