Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiently summing coverage tracks from multiple d4 files #82

Open
percyfal opened this issue Jun 24, 2024 · 3 comments
Open

Efficiently summing coverage tracks from multiple d4 files #82

percyfal opened this issue Jun 24, 2024 · 3 comments

Comments

@percyfal
Copy link

Hi,

I'm running variant calling on non-model organisms, and for some of the downstream analyses (e.g., nucleotide diversity calculations), it is necessary to generate (possibly boolean) accessibility masks that classify sites as accessible for analysis at a single base-pair resolution. Accessibility masks can be generated by summing coverages over all samples and masking out sites with too low or too high coverage. In addition, one could mask sites based on the number/fraction of individuals having sufficient coverage, i.e., absence/presence calls (cf https://onlinelibrary.wiley.com/doi/full/10.1111/mec.16077, Table 3). The genomes in question are so large that it is not possible to generate variant files including monomorphic sites on which to perform filtering.

Until now I have been using the Python API to sum coverages and count the number of indivuduals with coverages within a threshold range for each site. This is somewhat slow so I was wondering whether this functionality could be added directly to the d4 Rust library. I gave it a try based on the merge function, but my Rust knowledge is somewhat limited.

I'm thinking of commands somewhere in the line of

d4tools sum file1.d4 file2.d4 ... fileN.d4 outfile.d4

and

d4tools count file1.d4 file2.d4 ... fileN.d4 outfile.d4 --min-coverage 3

I'd be happy to submit a pull request if I could get pointers on where to start. What are your thoughts on this - do you prefer cases like these to be handled by external APIs (e.g., Python) or is it amenable to implementation in Rust?

Cheers,

Per

@cademirch
Copy link
Contributor

Hi @percyfal This is something I've been working on for our workflow snpArcher. I'm glad to see there is interest in a function/tool for generating accessibility masks via coverage. I don't have a repo for this yet, but will soon and can let you know when its available.

@percyfal
Copy link
Author

percyfal commented Jul 24, 2024

Thanks for the heads up @cademirch. FYI, I ended up drafting a Python package to perform the tasks detailed above. You can find the code at https://github.com/percyfal/d4utils. BTW, say hi to Erik with whom I previously have collaborated.

@cademirch
Copy link
Contributor

Awesome - just took a quick peek and it's looks great! I will definitely share with my colleagues, and perhaps we can integrate this in to our workflow. I'd also be happy to contribute if you are open to it - can discuss in your repo.

I'll let Erik know! Small world :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants