-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestions about methylpy #28
Comments
Thanks for the suggestions. Here are my thoughts.
I also have some thoughts that would like your comments.
|
Hi Yupeng, sorry I've missed this for a while...
In addition, actually I've implemented quite some ALLC related functions in another repo here https://github.com/lhqing/ALLCools, which basically implement everything related to ALLC handling using tabix/bgzip, add added some code for single-cell related stuff. I am willing to incorporate that into methylpy. I think I will have time on this after next Feb 2020 |
Hi, Yupeng,
Some suggestions of methylpy based on my experience. I can work on improve these (since I already did these here https://github.com/lhqing/cemba_data/blob/master/cemba_data/tools/). But I'd like to know your opinions:
chromosome names: I suggest always keep the chromosome name same as genome fasta, i.e. keep "chr" by default for genomes like mm10 and hg19, or never change chromosome names. I found this will decrease conflits in a lot cases.
chromosome orders: Right now the order of chrom in ALLC is based on bam file, and there is no paticular "sorted or not" check for ALLC file. What's worse is in the merge_allc_files_worker function, the order of chrom may be random, because dict.keys() is not ordered.
https://github.com/yupenghe/methylpy/blob/methylpy/methylpy/utilities.py#L543
Use bgzip/tabix:
I known this need a lot changes in all functions, but I found it worthwhile. I implemented the open_allc and merge_allc function based on bgzip/tabix which can be several times faster then gzip + f.seek strategy using old index. I tend to fully drop the old index strategy.
More universal map_to_region function:
which is very important for single cell, my implementation here: map_to_region
Given an ALLC file, this function count mC and cov for a list of regions, e.g. chrom 100kb bins, gene bodies, I use bedtools map internally, which is very fast.
The text was updated successfully, but these errors were encountered: