diff --git a/README.md b/README.md index 09b2f19..6b5a88e 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ [bioconda-badge-link]: https://img.shields.io/conda/dn/bioconda/fgsv.svg?label=Bioconda [bioconda-link]: http://bioconda.github.io/recipes/fgsv/README.html -[github-badge]: https://github.com/fulcrumgenomics/fgsv/actions/workflows/unittests.yaml/badge.svg +[github-badge]: https://github.com/fulcrumgenomics/fgsv/actions/workflows/unittests.yaml/badge.svg?branch=main [github-link]: https://github.com/fulcrumgenomics/fgsv/actions/workflows/unittests.yaml [scala-badge]: https://img.shields.io/badge/language-scala-c22d40.svg [scala-link]: https://www.scala-lang.org/ @@ -17,8 +17,82 @@ [doi-badge]: https://zenodo.org/badge/454071954.svg [doi-link]: https://zenodo.org/doi/10.5281/zenodo.10452647 -Tools to find evidence for structural variation. +Tools to gather evidence for structural variation via breakpoint detection. ## Documentation -Documentation can be found in the [docs folder](docs/01_Introduction.md) +More detailed documentation can be found in the [docs folder](docs/01_Introduction.md). + +## Introduction to the `fgsv` Toolkit + +The `fgsv` toolkit contains tools for effective structural variant investigation. +These tools are not meant to be used as a structural variant calling toolchain in-and-of-itself; instead, it is better to think of `fgsv` as a breakpoint detection and structural variant exploration toolkit. + +> [!NOTE] +> When describing structural variation, we use the term **breakpoint** to mean a junction between two loci and the term **breakend** to refer to one of the loci on one side of a breakpoint. + +> [!IMPORTANT] +> All point intervals (1-length) reported by this toolkit are 1-based inclusive from the perspective of the reference sequence unless otherwise documented. + +### `SvPileup` + +Collates pileups of reads over breakpoint events. + +```console +fgsv SvPileup \ + --input sample.bam \ + --output sample.svpileup +``` + +The tool [`fgsv SvPileup`](https://github.com/fulcrumgenomics/fgsv/blob/main/docs/tools/SvPileup.md) takes a queryname-grouped BAM file as input and scans each query group (template) of alignments for structural variant evidence. +For a simple example: a paired-end read may have one alignment per read (one alignment for read 1 and another alignment for read 2) mapped to different reference sequences supporting a putative translocation. + +Primary and supplementary alignments for a template are used to construct a “chain” of aligned sub-segments in a way that honors the sub-segments mapping locations and strandedness as compared to the reference sequence. +The aligned sub-segments in a chain relate to each other through typical alignment mechanisms like insertions and deletions but also contain information about the relative orientation of the sub-segment to the reference sequence and importantly, jumps between reference sequences which could indicate translocations. +See the [SAM Format Specification v1](https://samtools.github.io/hts-specs/SAMv1.pdf) for more information on how reads relate to alignments. + +For each chain of aligned sub-segments per template, outlier jumps are collected where the minimum inter-segment jump distance within a read must be 100bp (by default) or greater, and the minimum inter-read jump distance (e.g. between reads in a paired-end read) must be 1000bp (by default) or greater. +At locations where these jumps occur, breakpoints are marked and given a unique ID based on the loci of the breakends and the directionality of the left and right strands leading into each breakend. +In the case where there is both evidence for a split-read jump and inter-read jump, the split-read alignment evidence is favored since it gives a precise breakpoint. +This process creates a collection of candidate breakpoint locations. + +The tool outputs a table of candidate breakpoints and a BAM file with each alignment tagged with the ID of the breakpoint it supports (if any). + +### `AggregateSvPileup` + +Aggregates and merges pileups that are likely to support the same breakpoint. + +```console +fgsv AggregateSvPileup \ + --bam sample.bam \ + --input sample.svpileup.txt \ + --output sample.svpileup.aggregate.txt +``` + +The tool [`fgsv AggregateSvPileup`](https://github.com/fulcrumgenomics/fgsv/blob/main/docs/tools/AggregateSvPileup.md) is used to aggregate nearby breakpoints into one event if they appear to support one true breakpoint. +This polishing step preserves true positive breakpoint events and is intended to reduce the number of false positive breakpoint events. + +Aggregating breakpoints is often necessary because of variability in typical short-read alignments caused by somatic mutation, sequencing error, alignment artifact, or breakend sequence similarity/homology to the reference sequence. +Variability in short-read alignments means that it is not always possible to locate the exact nucleotide coordinate where either breakends in a breakpoint occur. +Instead, either breakend of a true breakpoint may map to a plausible region (instead of a point coordinate) and when this happens, the cluster of breakends could be aggregated to build up support for one true breakpoint. + +Clustered breakpoints are only merged if their left breakends map to the same strand of the same reference sequence, their right breakends map to the same strand of the same reference sequence, and their left and right genomic breakend positions are both within a given length threshold of 10bp (by default). + +One shortcoming of the existing behavior, which should be corrected at some point, is that intra-read breakpoint evidence is considered similarly to inter-pair breakpoint evidence even though intra-read breakpoint evidence often has nucleotide-level alignment resolution and inter-pair breakpoint evidence does not. + +The tool outputs a table of aggregated breakpoints and a modified copy of the input BAM file where each alignment is tagged with the ID of the aggregate breakpoint it supports (if any). + +### `AggregateSvPileupToBedPE` + +Converts the output of `AggregateSvPileup` to the [BEDPE format](https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format). + +```console +fgsv AggregateSvPileupToBedPE \ + --input sample.svpileup.aggregate.txt \ + --output sample.svpileup.aggregate.bedpe +``` + +The tool [`fgsv AggregateSvPileupToBedPE`](https://github.com/fulcrumgenomics/fgsv/blob/main/docs/tools/AggregateSvPileupToBedPE.md) is used to convert the output of `AggregateSvPileup` to BEDPE so that it can be viewed in [IGV](https://igv.org/) and other BEDPE-supporting genome browsers. +For example: + +![BEDPE in IGV](docs/img/fgsv-bedpe.png) diff --git a/docs/01_Introduction.md b/docs/01_Introduction.md index 2d6262d..e99c2c0 100644 --- a/docs/01_Introduction.md +++ b/docs/01_Introduction.md @@ -6,15 +6,3 @@ The following sections will help you to get started. * [Contributing](03_Contributing.md) * [Metric Descriptions](04_Metrics.md) * [Tools Descriptions](05_Tools.md) - -## Overview - -`fgsv` contains tools for gathering evidence for structural variants -from aligned reads. The `SvPileup` tool searches for split read mappings -and read pairs that map across breakpoints, emitting verbose information -similar to other "piluep" tools for small variant detection, but in this -case for structural variation detection. The `AggregateSvPileup` attempts -to aggregate information across "nearby" pileups, which is useful as often -the genomic start and end of a breakpoint is not always precise. The tools -aim to be as sensitive as possible to find these evidence, but do neither -perform structural variation calling nor genotyping. diff --git a/docs/04_Metrics.md b/docs/04_Metrics.md index 1cd4c78..df3cace 100644 --- a/docs/04_Metrics.md +++ b/docs/04_Metrics.md @@ -29,12 +29,12 @@ Aggregated cluster of breakpoint pileups |id|String|Combined ID retaining the IDs of all constituent breakpoints| |category|BreakpointCategory|Breakpoint category| |left_contig|String|Contig name for left side of breakpoint| -|left_min_pos|Int|Minimum coordinate of left breakends (1-based)| -|left_max_pos|Int|Maximum coordinate of left breakends (1-based)| +|left_min_pos|Int|Minimum coordinate of left breakends (1-based inclusive)| +|left_max_pos|Int|Maximum coordinate of left breakends (1-based inclusive)| |left_strand|Char|Strand at left breakends| |right_contig|String|Contig name for right side of breakpoint| -|right_min_pos|Int|Minimum coordinate of right breakends (1-based)| -|right_max_pos|Int|Maximum coordinate of right breakends (1-based)| +|right_min_pos|Int|Minimum coordinate of right breakends (1-based inclusive)| +|right_max_pos|Int|Maximum coordinate of right breakends (1-based inclusive)| |right_strand|Char|Strand at right breakends| |split_reads|Int|Total number of split reads supporting the breakpoints in the cluster| |read_pairs|Int|Total number of read pairs supporting the breakpoints in the cluster| @@ -82,10 +82,10 @@ the only information comes from read-pairs and the breakpoint information should |------|----|-----------| |id|String|An ID assigned to the breakpoint that can be used to lookup supporting reads in the BAM.| |left_contig|String|The contig of chromosome on which the left hand side of the breakpoint exists.| -|left_pos|Int|The position (possibly imprecise) of the left-hand breakend (1-based).| +|left_pos|Int|The position (possibly imprecise) of the left-hand breakend (1-based, inclusive).| |left_strand|Char|The strand of the left-hand breakend; sequence reads would traverse this strand in order to arrive at the breakend and transit into the right-hand side of the breakpoint.| |right_contig|String|The contig of chromosome on which the left hand side of the breakpoint exists.| -|right_pos|Int|The position (possibly imprecise) of the right-hand breakend (1-based).| +|right_pos|Int|The position (possibly imprecise) of the right-hand breakend (1-based, inclusive).| |right_strand|Char|The strand of the right-hand breakend;. sequence reads would continue reading onto this strand after transiting the breakpoint from the left breakend| |split_reads|Int|The number of templates/inserts with split-read alignments that identified this breakpoint.| |read_pairs|Int|The number of templates/inserts with read-pair alignments (and without split-read alignments) that identified this breakpoint.| diff --git a/docs/img/fgsv-bedpe.png b/docs/img/fgsv-bedpe.png new file mode 100644 index 0000000..dea86c1 Binary files /dev/null and b/docs/img/fgsv-bedpe.png differ diff --git a/docs/tools/AggregateSvPileup.md b/docs/tools/AggregateSvPileup.md index 012d262..ab7e1ef 100644 --- a/docs/tools/AggregateSvPileup.md +++ b/docs/tools/AggregateSvPileup.md @@ -7,7 +7,7 @@ title: AggregateSvPileup ## Overview **Group:** Breakpoint and SV Tools -Merges nearby pileups of reads supporting putative breakpoints. +Aggregates and merges pileups that are likely to support the same breakpoint. Takes as input the file of pileups produced by `SvPileup`. That file contains a list of breakpoints, each consisting of a chromosome, position and strand for each side of the breakpoint, as well as quantified read support @@ -36,7 +36,7 @@ of the overlapping target regions are copied from the `SvPiluep` input (if prese The output file is a tab-delimited table with one record per aggregated cluster of pileups. Aggregated pileups are reported with the minimum and maximum (inclusive) coordinates of all pileups in the cluster, a possible putative structural variant event type supported by the pileups, and the sum of read support from all -pileups in the cluster. Positions in this file are 1-based positions. +pileups in the cluster. Positions in this file are 1-based inclusive positions. ## Arguments diff --git a/docs/tools/SvPileup.md b/docs/tools/SvPileup.md index a6290ec..7d194ba 100644 --- a/docs/tools/SvPileup.md +++ b/docs/tools/SvPileup.md @@ -7,7 +7,7 @@ title: SvPileup ## Overview **Group:** Breakpoint and SV Tools -Collates a pileup of putative structural variant supporting reads. +Collates pileups of reads over breakpoint events. ## Outputs @@ -15,7 +15,7 @@ Two output files will be created: 1. `.txt`: a tab-delimited file describing SV pileups, one line per breakpoint event. The returned breakpoint will be canonicalized such that the "left" side of the breakpoint will have the lower (or equal to) - position on the genome vs. the "right"s side. Positions in this file are 1-based positions. + position on the genome vs. the "right"s side. Positions in this file are 1-based inclusive positions. 2. `.bam`: a SAM/BAM file containing reads that contain SV breakpoint evidence annotated with SAM tag. diff --git a/docs/tools/index.md b/docs/tools/index.md index 495675b..0c8fe57 100644 --- a/docs/tools/index.md +++ b/docs/tools/index.md @@ -4,16 +4,16 @@ title: fgsv tools # fgsv tools -The following tools are available in fgsv version 0.2.0-d603e95. +The following tools are available in fgsv version 0.2.0-3f52590. ## Breakpoint and SV Tools Primary tools for calling and transforming breakpoints and SVs. |Tool|Description| |----|-----------| -|[AggregateSvPileup](AggregateSvPileup.md)|Merges nearby pileups of reads supporting putative breakpoints| +|[AggregateSvPileup](AggregateSvPileup.md)|Aggregates and merges pileups that are likely to support the same breakpoint| |[FilterAndMerge](FilterAndMerge.md)|Filters and merges SVPileup output| -|[SvPileup](SvPileup.md)|Collates a pileup of putative structural variant supporting reads| +|[SvPileup](SvPileup.md)|Collates pileups of reads over breakpoint events| ## Utility Tools diff --git a/src/main/scala/com/fulcrumgenomics/sv/tools/AggregateSvPileup.scala b/src/main/scala/com/fulcrumgenomics/sv/tools/AggregateSvPileup.scala index 01fb081..f8e4531 100644 --- a/src/main/scala/com/fulcrumgenomics/sv/tools/AggregateSvPileup.scala +++ b/src/main/scala/com/fulcrumgenomics/sv/tools/AggregateSvPileup.scala @@ -17,7 +17,7 @@ import scala.collection.mutable @clp(group=ClpGroups.BreakpointAndSv, description= """ - |Merges nearby pileups of reads supporting putative breakpoints. + |Aggregates and merges pileups that are likely to support the same breakpoint. | |Takes as input the file of pileups produced by `SvPileup`. That file contains a list of breakpoints, each |consisting of a chromosome, position and strand for each side of the breakpoint, as well as quantified read support diff --git a/src/main/scala/com/fulcrumgenomics/sv/tools/SvPileup.scala b/src/main/scala/com/fulcrumgenomics/sv/tools/SvPileup.scala index 5b2eeca..f74dd59 100644 --- a/src/main/scala/com/fulcrumgenomics/sv/tools/SvPileup.scala +++ b/src/main/scala/com/fulcrumgenomics/sv/tools/SvPileup.scala @@ -39,7 +39,7 @@ object TargetBedRequirement extends FgBioEnum[TargetBedRequirement] { @clp(group=ClpGroups.BreakpointAndSv, description= """ - |Collates a pileup of putative structural variant supporting reads. + |Collates pileups of reads over breakpoint events. | |## Outputs |