Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defaulted to excluding duplicate and QC failing reads from pileup. #42

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions docs/04_Metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,12 @@ Aggregated cluster of breakpoint pileups
|id|String|Combined ID retaining the IDs of all constituent breakpoints|
|category|BreakpointCategory|Breakpoint category|
|left_contig|String|Contig name for left side of breakpoint|
|left_min_pos|Int|Minimum coordinate of left breakends (1-based)|
|left_max_pos|Int|Maximum coordinate of left breakends (1-based)|
|left_min_pos|Int|Minimum coordinate of left breakends (1-based inclusive)|
|left_max_pos|Int|Maximum coordinate of left breakends (1-based inclusive)|
|left_strand|Char|Strand at left breakends|
|right_contig|String|Contig name for right side of breakpoint|
|right_min_pos|Int|Minimum coordinate of right breakends (1-based)|
|right_max_pos|Int|Maximum coordinate of right breakends (1-based)|
|right_min_pos|Int|Minimum coordinate of right breakends (1-based inclusive)|
|right_max_pos|Int|Maximum coordinate of right breakends (1-based inclusive)|
|right_strand|Char|Strand at right breakends|
|split_reads|Int|Total number of split reads supporting the breakpoints in the cluster|
|read_pairs|Int|Total number of read pairs supporting the breakpoints in the cluster|
Expand Down Expand Up @@ -82,10 +82,10 @@ the only information comes from read-pairs and the breakpoint information should
|------|----|-----------|
|id|String|An ID assigned to the breakpoint that can be used to lookup supporting reads in the BAM.|
|left_contig|String|The contig of chromosome on which the left hand side of the breakpoint exists.|
|left_pos|Int|The position (possibly imprecise) of the left-hand breakend (1-based).|
|left_pos|Int|The position (possibly imprecise) of the left-hand breakend (1-based, inclusive).|
|left_strand|Char|The strand of the left-hand breakend; sequence reads would traverse this strand in order to arrive at the breakend and transit into the right-hand side of the breakpoint.|
|right_contig|String|The contig of chromosome on which the left hand side of the breakpoint exists.|
|right_pos|Int|The position (possibly imprecise) of the right-hand breakend (1-based).|
|right_pos|Int|The position (possibly imprecise) of the right-hand breakend (1-based, inclusive).|
|right_strand|Char|The strand of the right-hand breakend;. sequence reads would continue reading onto this strand after transiting the breakpoint from the left breakend|
|split_reads|Int|The number of templates/inserts with split-read alignments that identified this breakpoint.|
|read_pairs|Int|The number of templates/inserts with read-pair alignments (and without split-read alignments) that identified this breakpoint.|
Expand Down
2 changes: 1 addition & 1 deletion docs/tools/AggregateSvPileup.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ of the overlapping target regions are copied from the `SvPiluep` input (if prese
The output file is a tab-delimited table with one record per aggregated cluster of pileups. Aggregated
pileups are reported with the minimum and maximum (inclusive) coordinates of all pileups in the cluster, a
possible putative structural variant event type supported by the pileups, and the sum of read support from all
pileups in the cluster. Positions in this file are 1-based positions.
pileups in the cluster. Positions in this file are 1-based inclusive positions.

## Arguments

Expand Down
4 changes: 3 additions & 1 deletion docs/tools/SvPileup.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Two output files will be created:

1. `<output-prefix>.txt`: a tab-delimited file describing SV pileups, one line per breakpoint event. The returned
breakpoint will be canonicalized such that the "left" side of the breakpoint will have the lower (or equal to)
position on the genome vs. the "right"s side. Positions in this file are 1-based positions.
position on the genome vs. the "right"s side. Positions in this file are 1-based inclusive positions.
2. `<output-prefix>.bam`: a SAM/BAM file containing reads that contain SV breakpoint evidence annotated with SAM
tag.

Expand Down Expand Up @@ -91,4 +91,6 @@ Split read evidence will be returned in favor of across-read-pair evidence when
|slop|s|Int|The number of bases of slop to allow when determining which records to track for the left or right side of an aligned segment when merging segments.|Optional|1|5|
|targets-bed|t|FilePath|Optional bed file of target regions|Optional|1||
|targets-bed-requirement|T|Requirement|Requirement on if each side of the breakpoint must overlap a target. Will always annotate each side of the breakpoint.|Optional|1|AnnotateOnly|
|include-duplicates||Boolean|Include reads/templates marked as duplicates when constructing the pileup|Optional|1|false|
|include-qc-fails||Boolean|Include reads/templates marked as QC fail when constructing the pileup|Optional|1|false|

2 changes: 1 addition & 1 deletion docs/tools/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: fgsv tools

# fgsv tools

The following tools are available in fgsv version 0.2.0-d603e95.
The following tools are available in fgsv version 0.2.0-5ce8bc6.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please bump the major version, as excluding duplicates (and qc fails) by default willl be a breaking change?

## Breakpoint and SV Tools

Primary tools for calling and transforming breakpoints and SVs.
Expand Down
24 changes: 17 additions & 7 deletions src/main/scala/com/fulcrumgenomics/sv/tools/SvPileup.scala
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,6 @@ import enumeratum.EnumEntry
import htsjdk.samtools.SAMFileHeader.{GroupOrder, SortOrder}
import htsjdk.samtools.util.OverlapDetector
import htsjdk.tribble.bed.BEDFeature

import scala.collection.immutable.IndexedSeq
import scala.collection.{immutable, mutable}

/** An enumeration over how to utilize the input target BED file if given. */
Expand Down Expand Up @@ -122,7 +120,9 @@ class SvPileup
|side of an aligned segment when merging segments.""") slop: Int = 5,
@arg(flag='t', doc="Optional bed file of target regions") targetsBed: Option[FilePath] = None,
@arg(flag='T', doc="Requirement on if each side of the breakpoint must overlap a target. Will always annotate each side of the breakpoint.")
targetsBedRequirement: TargetBedRequirement.Requirement = TargetBedRequirement.AnnotateOnly
targetsBedRequirement: TargetBedRequirement.Requirement = TargetBedRequirement.AnnotateOnly,
@arg(doc="Include reads/templates marked as duplicates when constructing the pileup") val includeDuplicates: Boolean = false,
@arg(doc="Include reads/templates marked as QC fail when constructing the pileup") val includeQcFails: Boolean = false,
) extends SvTool {

import SvPileup._
Expand All @@ -145,7 +145,15 @@ class SvPileup

Bams.templateIterator(source)
.tapEach(t => progress.record(t.allReads.next()))
.flatMap(template => filterTemplate(template, minPrimaryMapq=minPrimaryMappingQuality, minSupplementaryMapq=minSupplementaryMappingQuality))
.flatMap { template =>
filterTemplate(
template,
minPrimaryMapq=minPrimaryMappingQuality,
minSupplementaryMapq=minSupplementaryMappingQuality,
incDupes=includeDuplicates,
incQcFails=includeQcFails,
)
}
.foreach { template =>
// Find the breakpoints
val evidences = findBreakpoints(
Expand Down Expand Up @@ -297,9 +305,11 @@ object SvPileup extends LazyLogging {
*/
def filterTemplate(t: Template,
minPrimaryMapq: Int,
minSupplementaryMapq: Int): Option[Template] = {
val r1PrimaryOk = t.r1.exists(r => r.mapped && r.mapq >= minPrimaryMapq)
val r2PrimaryOk = t.r2.exists(r => r.mapped && r.mapq >= minPrimaryMapq)
minSupplementaryMapq: Int,
incDupes: Boolean,
incQcFails: Boolean): Option[Template] = {
val r1PrimaryOk = t.r1.exists(r => r.mapped && r.mapq >= minPrimaryMapq && (incDupes || !r.duplicate) && (incQcFails || r.pf))
val r2PrimaryOk = t.r2.exists(r => r.mapped && r.mapq >= minPrimaryMapq && (incDupes || !r.duplicate) && (incQcFails || r.pf))

if (!r1PrimaryOk && !r2PrimaryOk) None else Some(
Template(
msto marked this conversation as resolved.
Show resolved Hide resolved
Expand Down
62 changes: 52 additions & 10 deletions src/test/scala/com/fulcrumgenomics/sv/tools/SvPileupTest.scala
Original file line number Diff line number Diff line change
Expand Up @@ -119,10 +119,12 @@ class SvPileupTest extends UnitSpec {
import SamBuilder.{Minus, Plus, Strand}

/** Construct a read/rec with the information necessary for breakpoint detection. */
def r(chrom: String, pos: Int, strand: SamBuilder.Strand, r: Int, cigar: String, supp: Boolean, mapq: Int = 60): SamRecord = {
def r(chrom: String, pos: Int, strand: SamBuilder.Strand, r: Int, cigar: String, supp: Boolean, mapq: Int = 60, dupe: Boolean = false, pf: Boolean = true): SamRecord = {
require(r == 0 || r == 1 || r == 2)
val rec = builder.addFrag(contig=builder.dict(chrom).index, start=pos, strand=strand, cigar=cigar, mapq=mapq).get
rec.supplementary = supp
rec.duplicate = dupe
rec.pf = pf

if (r > 0) {
rec.paired = true
Expand Down Expand Up @@ -332,7 +334,7 @@ class SvPileupTest extends UnitSpec {
r("chr1", 300, Plus, r=2, cigar="100M", supp=false, mapq=50),
)

SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20).value shouldBe template
SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20, incDupes=true, incQcFails=true).value shouldBe template
}

it should "do nothing to a template with high mapq primaries and supplementaries" in {
Expand All @@ -343,7 +345,7 @@ class SvPileupTest extends UnitSpec {
r("chr1", 120, Minus, r=2, cigar="30M70S", supp=true, mapq=50),
)

SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20).value shouldBe template
SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20, incDupes=true, incQcFails=true).value shouldBe template
}

it should "remove a low quality supplementary record" in {
Expand All @@ -354,7 +356,7 @@ class SvPileupTest extends UnitSpec {
r("chr1", 120, Minus, r=2, cigar="30M70S", supp=true, mapq=1),
)

SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20).value shouldBe template.copy(r2Supplementals=Nil)
SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20, incDupes=true, incQcFails=true).value shouldBe template.copy(r2Supplementals=Nil)
}

it should "remove all evidence of R2 if the primary mapping is low quality" in {
Expand All @@ -366,7 +368,7 @@ class SvPileupTest extends UnitSpec {
)

val expected = template.copy(r2=None, r2Supplementals=Nil)
SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20).value shouldBe expected
SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20, incDupes=true, incQcFails=true).value shouldBe expected
}

it should "return None if both R1 and R2 primaries are low quality" in {
Expand All @@ -377,7 +379,7 @@ class SvPileupTest extends UnitSpec {
r("chr1", 120, Minus, r=2, cigar="30M70S", supp=true, mapq=50),
)

SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20) shouldBe None
SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20, incDupes=true, incQcFails=true) shouldBe None
}

it should "remove an unmapped R2" in {
Expand All @@ -387,7 +389,7 @@ class SvPileupTest extends UnitSpec {
Seq(r("chr1", 100, Plus, r=2, cigar="100M", supp=false, mapq=0)).tapEach(_.unmapped = true).head,
)

SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20).value shouldBe template.copy(r2=None)
SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20, incDupes=true, incQcFails=true).value shouldBe template.copy(r2=None)
}

it should "handle a template with fragment data with high mapping quality" in {
Expand All @@ -396,7 +398,7 @@ class SvPileupTest extends UnitSpec {
r("chr7", 800, Plus, r=0, cigar="50S50M", supp=true, mapq=50),
)

SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20).value shouldBe template
SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20, incDupes=true, incQcFails=true).value shouldBe template
}

it should "remove low mapq supplementary records from a fragment template" in {
Expand All @@ -407,7 +409,7 @@ class SvPileupTest extends UnitSpec {
)

val expected = template.copy(r1Supplementals = template.r1Supplementals.filter(_.start != 800))
SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20).value shouldBe expected
SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20, incDupes=true, incQcFails=true).value shouldBe expected
}

it should "return None for a fragment template with a low quality primary mapping" in {
Expand All @@ -417,7 +419,47 @@ class SvPileupTest extends UnitSpec {
r("chr7", 820, Plus, r=0, cigar="70S30M", supp=true, mapq=50),
)

SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20) shouldBe None
SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20, incDupes=true, incQcFails=true) shouldBe None
}

it should "return a template with the duplicate read removed when one read is marked as duplicate" in {
val template = t(
r("chr1", 100, Plus, r=1, cigar="100M", supp=false, mapq=60, dupe=true),
r("chr7", 800, Minus, r=2, cigar="100M", supp=false, mapq=60, dupe=false),
)

val filtered = SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20, incDupes=false, incQcFails=false)
filtered.value.r1 shouldBe None
filtered.value.r2 shouldBe template.r2
}

it should "return None when both primary reads are marked as duplicates" in {
val template = t(
r("chr1", 100, Plus, r=1, cigar="100M", supp=false, mapq=60, dupe=true),
r("chr7", 800, Minus, r=2, cigar="100M", supp=false, mapq=60, dupe=true),
)

SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20, incDupes=false, incQcFails=false) shouldBe None
}

it should "return a template with the duplicate read removed when one read is marked as qc failing" in {
val template = t(
r("chr1", 100, Plus, r=1, cigar="100M", supp=false, mapq=60, dupe=false),
r("chr7", 800, Minus, r=2, cigar="100M", supp=false, mapq=60, dupe=false, pf=false),
)

val filtered = SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20, incDupes=false, incQcFails=false)
filtered.value.r1 shouldBe template.r1
filtered.value.r2 shouldBe None
}

it should "return None when both primary reads are marked as qc fail" in {
val template = t(
r("chr1", 100, Plus, r=1, cigar="100M", supp=false, mapq=60, dupe=false, pf=false),
r("chr7", 800, Minus, r=2, cigar="100M", supp=false, mapq=60, dupe=false, pf=false),
)

SvPileup.filterTemplate(template, minPrimaryMapq=30, minSupplementaryMapq=20, incDupes=false, incQcFails=false) shouldBe None
}

/** Writes the given BAM records to a BAM file */
Expand Down