-
-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug Fix - SamRecordClipper.numBasesExtendingPastMate #937
base: main
Are you sure you want to change the base?
Conversation
* add flag * reorder logic
testing for clipPastFirst - read1 len == read2 len
…mate end, mate start. depending on MC tag when doing operations on the mate allows for data inconsistency.
Bumping this. As it pertains to a stealthy bug that can produce incorrect output, rather than a new feature, I hope it can be prioritized. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #937 +/- ##
==========================================
+ Coverage 95.60% 95.63% +0.02%
==========================================
Files 126 126
Lines 7307 7308 +1
Branches 507 479 -28
==========================================
+ Hits 6986 6989 +3
+ Misses 321 319 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@nh13 @tfenne
When operating on SamRecords with multiple serialized operations, use of mate cigar record information is relied upon for clipping past mate end, but it is not synchronized between operations with it's corresponding SamRecord object's data.
This results in an inconsistent data model for clipping.
With previous code this test case would fail.
It would erroneously calculate r2.start == 100 & r1.start != r2.start, because the mate.start data from the MC tag was used, which was not updated with the 5' clipping from the first operation.
fgbio/src/main/scala/com/fulcrumgenomics/bam/SamRecordClipper.scala
Line 358 in 2af51ac
This reliance on a possibly dirty MC tag is the cause of this #878 inconsistent behavior. I believe this fix, makes this issue irrelevant & it certainly might fix some other oddities that people seem to have pointed out.
I've removed convenience functions which allow this to happen, and enforce either explicit passing of mates or start / end.
I do this rather than updating the MC record, because that extra step between each operation doesn't seem worthwhile, when we have the object loaded in memory already, the tags should be updated after all operations are complete a single time.
The exception to this is a single method which is used in consensus calling, where mate isn't easily available - this pattern still requires getting from mate cigar, which might still result in this bug occurring, as each mate could have clipping applied but again the data isn't synchronized to the MC tag.
I'll also work on that bug also, which I believe should be handled by operating on mates together, so all relevant data is loaded in memory explicitly, rather than using the tag.
As a sidenote: I don't understand the white space formatting rules for the project, do you happen to have an intellj profile i could use to keep these consistent for your formatting?