Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code to run logistic regression on v4 exomes and genomes with ancesty pcs #616

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

KoalaQin
Copy link
Contributor

No description provided.

ht = get_test_intervals(ht)
ht = ht.checkpoint(hl.utils.new_temp_file("test_intervals", "ht"))
exomes_vds = hl.vds.filter_intervals(
exomes_vds, ht, split_reference_blocks=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Did the Hail team say this is faster than just filtering the variant matrix table and doing a densify like we do in this script? https://github.com/broadinstitute/gnomad_qc/blob/main/gnomad_qc/v4/sample_qc/generate_qc_mt.py

Copy link
Contributor Author

@KoalaQin KoalaQin May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't talk to Hail team, I tried filter_variants, I found it took very long to finish that step, then I remembered that once the vds has this store_max_ref_length, filter_intervals is much faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good, if it's faster then go for it

logger.info("Densifying exomes...")
exomes_mt = hl.vds.to_dense_mt(exomes_vds)
exomes_mt = exomes_mt.annotate_cols(is_genome=False)
exomes_mt = exomes_mt.select_entries("GT").select_rows().select_cols("is_genome")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the only entry you need then you should filter to it before the densify. I think for this you probably also want to filter to only adj genotypes though, so you probably need more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, adj seems to make more sense. I will get that.

Copy link
Contributor Author

@KoalaQin KoalaQin May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the entries are not the same as what we used in getting freq for exomes or genomes, could you check the new steps for me?

return ht


def densify_union_exomes_genomes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split this up. I would run the exomes and genomes filter and densify in parallel, checkpoint each, and then union after those are done and checkpoint

:param joint_ht: Joint HT of v4 exomes and genomes.
:return: Test Table
"""
# Filter to chr22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before running the chr22 test, make an actual test that is only the first few partitions of chr22

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a few partitions were slower when I tested, when I get the set of intervals on chr22, they are more partitioned and they were densified faster.

"firth",
y=mt.is_genome,
x=mt.GT.n_alt_alleles(),
covariates=[1] + [mt.pc[i] for i in range(10)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember how many PCs we used for ancestry assignment off the top of my head, but I would use that number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mike told me it was 10, but I do remember you're exploring until 18,19, I will double check the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I wasnt clear, I wasnt certain it was 10, just thought it may be. Its 20: https://app.zenhub.com/workspaces/gnomad-5f4d127ea61afc001d6be50b/issues/gh/broadinstitute/gnomad_production/496

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I put that 10 temporarily because it was just 10 in Julia's code. I changed it in my test.

return mt


def main(args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a try catch for copying the logger in case it fails with an error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this.

hl.utils.new_temp_file(f"temp_{data_type}_vds_filtered", "vds")
)
mt = hl.vds.to_dense_mt(vds)
mt = annotate_adj(mt)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at your code, I think I could use filter_to_adj directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants