-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: extend mpileup API #1453
Comments
I'm trying to think of how practically it would be implemented to track how many soft-clipped reads overlap. For the right hand soft-clip, we could record when we're discarding a BAM from the in-flight list that while it's ending at the current pos P, it has N bases of soft-clips so for the next N bases we add +1 to a soft-clip depth stat. That would need some sorted list or tree into which we can keep inserting, so we're sorting based on right-hand end+SClen. It's also possible to add the BAM object to a secondary "just gone by" list I guess, and only free it once the soft-clip margin has passed. That would then give us access to the data, but it's cumbersome and high memory. However the right hand end is only half the problem. We can't really do this for the left hand end as we don't "see" the data until it's too late. The only solution would be to be doing read-ahead and have a maximum window of soft-clips. We may want that anyway to compensate for the extreme long reads that may have Kbps of soft-clipped data. We only need a small window for purposes of variant calling information. So a read-ahead may work too in that case. It's sounding like a total rewrite from the ground up though, both in implementation and API. It's not necessarily a bad idea as I've needed to hack my own pileup alternative anyway for dealing with iterating over bases within an insertion. (Sometimes it's simply not sufficient to be given a bunch of insertion sequence and have to parse the CIGARs manually to look for P operators.) Plus read-ahead of at least 1bp is necessary to handle reads that post-realignment start in an insertion, as they too don't "appear" until the event has already happened. That isn't so much "extend" as "write mpileup2". (See https://github.com/samtools/samtools/blob/develop/consensus_pileup.c for a start at that) It's possible we could just have a buffered-pileup wrapper around the existing mpileup. It may not be the most efficient, but it could be implementable. Eg something that introduces a delay. So while pileup is on pos P, the delayed_pileup is at pos P-D. That allows you to cope with the reads which appear some time in the future with soft-clip at the start. It's doable, but I think it'd be inefficient as it's likely it'd have to be duplicating memory and copying data around to avoid it being freed up by mpileup. (Or we add reference counting to the various structures involved so they can be considered to be in use by more than one piece of code.) |
Thanks for the analysis! My thoughts about this are:
|
Mixture of ideas from our samtools meeting (and some others).
|
The current mpileup API (
bam_mplp_auto()
) gives access to an array of per-read structures but does not give the possibility to store per-position data.One possible use case would be to store the number of overlapping soft-clipped reads at each position, this would allow to annotate candidate variants in difficult regions as suspicious. Currently the removed soft-clipped reads are invisible to the mpileup caller.
The text was updated successfully, but these errors were encountered: