-
-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GroupReadsByUmi may fail when marking duplicates including secondary/supplementary reads #964
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #964 +/- ##
==========================================
+ Coverage 95.62% 95.64% +0.01%
==========================================
Files 126 126
Lines 7360 7392 +32
Branches 495 531 +36
==========================================
+ Hits 7038 7070 +32
Misses 322 322
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -41,6 +42,36 @@ import htsjdk.samtools.util.{CloserUtil, CoordMath, Murmur3, SequenceUtil} | |||
import java.io.Closeable | |||
import scala.math.{max, min} | |||
|
|||
|
|||
|
|||
case class Supplementary(refName: String, start: Int, positiveStrand: Boolean, cigar: Cigar, mapq: Int, nm: Int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scaladocs needed later
|
||
case class Supplementary(refName: String, start: Int, positiveStrand: Boolean, cigar: Cigar, mapq: Int, nm: Int) { | ||
def negativeStrand: Boolean = !positiveStrand | ||
def refIndex(header: SAMFileHeader): Int = header.getSequence(refName).getSequenceIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may invert this, and store refIndex
instead, since when we have the SAM header when want refName
(in toString
)
|
||
def end: Int = start + cigar.lengthOnTarget - 1 | ||
def unclippedStart: Int = { | ||
SAMUtils.getUnclippedStart(start, cigar.toHtsjdkCigar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably could just compute these directly without having to route to htsjdk
|
||
|
||
def apply(sa: String): Supplementary = { | ||
val parts = sa.split(",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably good to check we get 6 parts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Is validation of the type/value of each part also necessary?
for (primary <- r1; nonPrimary <- r2NonPrimary) { | ||
SamPairUtil.setMateInformationOnSupplementalAlignment(nonPrimary.asSam, primary.asSam, true) | ||
nonPrimary(SAMTag.MQ.name()) = primary.mapq | ||
nonPrimary("mp") = Supplementary.toString(primary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: store these tag definitions somewhere else
import scala.reflect.runtime.universe.Template | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import scala.reflect.runtime.universe.Template |
} | ||
else { | ||
TemplateCoordinateKey(mateChrom, readChrom, matePos, readPos, mateNeg, readNeg, lib, mid, rec.name, true) | ||
// For non-secondary/non-supplementary alignments, use the info in the record. For secondary and supplementary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: how can we simplify these two branches, since they're very similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you additionally set mp
on the primary alignments, not just the supplementaries, (and also take my suggestion to define an apply
for SamRecord
🙂 ) you could do the following:
val primary = if (!rec.secondary && !rec.supplementary) Supplementary(rec) else Supplementary(rec[String]("rp"))
val mate = Supplementary(rec[String]("mp"))
// Just the second branch, using the info from `Supplementary` instead of `SamRecord`
...
val primary = Supplementary(rec[String]("rp")) | ||
val mate = Supplementary(rec[String]("mp")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todo, better error message or fallback
…nd supplementary reads Secondary and supplementary reads must use the coordinates of the primary alignments within the template, otherwise they will not guaranteed to be next the primary alignments in the file. Therefore, we've added the "rp" and "mp" tags to store the SA-tag equivalent information for the primary alignment. This keeps information about the primary alignments with the secondary and supplementary alignments.
cd1460b
to
1b5753c
Compare
def negativeStrand: Boolean = !positiveStrand | ||
def refIndex(header: SAMFileHeader): Int = header.getSequence(refName).getSequenceIndex | ||
|
||
def end: Int = start + cigar.lengthOnTarget - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question Is end inclusive or exclusive?
(And maybe add scaladoc to clarify)
I recently added an equivalent property to fgpyo
and made it exclusive; if this is the same I think we don't want to subtract 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fgbio
is generally 1-based inclusive, but I don't think we say that anywhere.
object Supplementary { | ||
/** Returns a formatted alignment as per the SA tag: `(rname ,pos ,strand ,CIGAR ,mapQ ,NM ;)+` */ | ||
def toString(rec: SamRecord): String = { | ||
val strand = if (rec.positiveStrand) '+' else '-' | ||
f"${rec.refName},${rec.start},${strand},${rec.cigar},${rec.mapq},${rec.getOrElse(SAMTag.NM.name(),0)}" | ||
} | ||
|
||
|
||
def apply(sa: String): Supplementary = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer to have two apply
methods, one for SamRecord
and one for String
, and a class toString
method that converts an instance of Supplementary
to String
|
||
|
||
def apply(sa: String): Supplementary = { | ||
val parts = sa.split(",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Is validation of the type/value of each part also necessary?
@@ -107,11 +138,21 @@ case class Template(r1: Option[SamRecord], | |||
|
|||
/** Fixes mate information and sets mate cigar on all primary and supplementary (but not secondary) records. */ | |||
def fixMateInfo(): Unit = { | |||
for (primary <- r1; supp <- r2Supplementals) { | |||
SamPairUtil.setMateInformationOnSupplementalAlignment(supp.asSam, primary.asSam, true) | |||
// Set all mate info on BOTH secondary and supplementary records, not just supplementary records. We also need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on line 139 should be updated (or removed) to reflect this
|
||
def apply(sa: String): Supplementary = { | ||
val parts = sa.split(",") | ||
Supplementary(parts(0), parts(1).toInt, parts(2) == "+", Cigar(parts(3)), parts(4).toInt, parts(5).toInt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supplementary(parts(0), parts(1).toInt, parts(2) == "+", Cigar(parts(3)), parts(4).toInt, parts(5).toInt) | |
Supplementary(parts(0), parts(1).toInt - 1, parts(2) == "+", Cigar(parts(3)), parts(4).toInt, parts(5).toInt) |
suggestion We need to subtract 1 if start
is zero-based.
Without scaladoc I'm not sure, but I'm assuming it's zero based; and the SA
tag is 1-based.
Pos is a 1-based coordinate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: fgbio
is generally 1-based inclusive
for (primary <- r1; nonPrimary <- r2NonPrimary) { | ||
SamPairUtil.setMateInformationOnSupplementalAlignment(nonPrimary.asSam, primary.asSam, true) | ||
nonPrimary(SAMTag.MQ.name()) = primary.mapq | ||
nonPrimary("mp") = Supplementary.toString(primary) | ||
r2.foreach(r => nonPrimary("rp") = Supplementary.toString(r)) | ||
} | ||
for (primary <- r2; supp <- r1Supplementals) { | ||
SamPairUtil.setMateInformationOnSupplementalAlignment(supp.asSam, primary.asSam, true) | ||
for (primary <- r2; nonPrimary <- r1NonPrimary) { | ||
SamPairUtil.setMateInformationOnSupplementalAlignment(nonPrimary.asSam, primary.asSam, true) | ||
nonPrimary(SAMTag.MQ.name()) = primary.mapq | ||
nonPrimary("mp") = Supplementary.toString(primary) | ||
r1.foreach(r => nonPrimary("rp") = Supplementary.toString(r)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question Would you find it more legible to extract these for loops into a helper so we don't repeat it twice?
} | ||
else { | ||
TemplateCoordinateKey(mateChrom, readChrom, matePos, readPos, mateNeg, readNeg, lib, mid, rec.name, true) | ||
// For non-secondary/non-supplementary alignments, use the info in the record. For secondary and supplementary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you additionally set mp
on the primary alignments, not just the supplementaries, (and also take my suggestion to define an apply
for SamRecord
🙂 ) you could do the following:
val primary = if (!rec.secondary && !rec.supplementary) Supplementary(rec) else Supplementary(rec[String]("rp"))
val mate = Supplementary(rec[String]("mp"))
// Just the second branch, using the info from `Supplementary` instead of `SamRecord`
...
@@ -719,7 +719,7 @@ class GroupReadsByUmi | |||
|
|||
// Then output the records in the right order (assigned tag, read name, r1, r2) | |||
templatesByMi.keys.toSeq.sortBy(id => (id.length, id)).foreach(tag => { | |||
templatesByMi(tag).sortBy(t => t.name).flatMap(t => t.primaryReads).foreach(rec => { | |||
templatesByMi(tag).sortBy(t => t.name).flatMap(t => t.allReads).foreach(rec => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question Where are --include-supplementary
and --include-secondary
taken into consideration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the initial filter of the BAM file (see the _includeSecondaryReads
variabel)
There's an open issue in hts-specs about how we want to handle getting the primary alignment information when looking at a secondary or supplementary read: samtools/hts-specs#755
This PR adds the read primary "rp" tag to store the primary alignment for end of the current secondary/supplementary alignment, in the same format as the "SA" tag. The mate's primary alignment is stored in the "mp" tag. Both are currently lowercase as they are not reserved tags.
I have tested that
ZipperBams
will now add these, thatSortBam
will correctly sort in template-coordinate, and finally thatGroupReadsByUmi
passes. I added tests forGroupReadsByUmi
andSamOrder
.Also, in my hands, secondary and supplementary records will never be output by
GroupReadsByUmi
as currently only primary alignments are output.