Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a description of the toolkit to the README #34

Merged
merged 34 commits into from
Aug 13, 2024
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
798a5ac
Add a description of the toolkit to the README
clintval Nov 22, 2023
e3e3dc1
Generate docs files
Mar 13, 2024
0ade11f
Merge remote-tracking branch 'origin/main' into cv_README
clintval May 10, 2024
531a122
Generate docs files
May 10, 2024
17e55e4
Fix up README after a review
clintval May 10, 2024
983dbb7
Fix up README after a review
clintval May 10, 2024
5a903a0
Remove outdated intro in Overview
clintval May 10, 2024
a62b275
Fixup a sentence
clintval May 10, 2024
a8cc755
Whitespace
clintval May 10, 2024
9304a02
Generate docs files
May 10, 2024
475c3e9
Generate docs files
May 10, 2024
c2ca29a
Remove duplicate .gitignore line
clintval May 10, 2024
0090746
Generate docs files
May 10, 2024
7067311
Small review fixups
clintval May 10, 2024
9a45075
Generate docs files
May 10, 2024
af562e6
docs: revise docs based on @msto review
clintval Jul 23, 2024
818d158
Generate docs files
Jul 23, 2024
a29d0ca
docs: small docs fixups for clarity and formatting
clintval Jul 23, 2024
c6a7e11
Generate docs files
Jul 23, 2024
8d80f2a
docs: one more pass at docs clarity!
clintval Jul 23, 2024
88fff29
chore: query group and template definition
clintval Jul 23, 2024
697b07d
docs: move reference down
clintval Jul 23, 2024
34e9faf
docs: do not repeat thyself
clintval Jul 23, 2024
ecf1df4
Generate docs files
Jul 23, 2024
09ead37
docs: little fixup
clintval Jul 23, 2024
45a17c1
Generate docs files
Jul 23, 2024
ef7f8f2
docs: formatting to be the same
clintval Jul 23, 2024
ac5b334
Generate docs files
Jul 23, 2024
f56f0e7
chore: header fixup
clintval Jul 23, 2024
2297324
chore: header fixup
clintval Jul 23, 2024
51557f2
Generate docs files
Jul 23, 2024
df8ead1
Generate docs files
Jul 23, 2024
3f52590
docs: suit review from @nh13
clintval Aug 13, 2024
ddd45aa
Generate docs files
Aug 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 77 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

[bioconda-badge-link]: https://img.shields.io/conda/dn/bioconda/fgsv.svg?label=Bioconda
[bioconda-link]: http://bioconda.github.io/recipes/fgsv/README.html
[github-badge]: https://github.com/fulcrumgenomics/fgsv/actions/workflows/unittests.yaml/badge.svg
[github-badge]: https://github.com/fulcrumgenomics/fgsv/actions/workflows/unittests.yaml/badge.svg?branch=main
[github-link]: https://github.com/fulcrumgenomics/fgsv/actions/workflows/unittests.yaml
[scala-badge]: https://img.shields.io/badge/language-scala-c22d40.svg
[scala-link]: https://www.scala-lang.org/
Expand All @@ -17,8 +17,82 @@
[doi-badge]: https://zenodo.org/badge/454071954.svg
[doi-link]: https://zenodo.org/doi/10.5281/zenodo.10452647

Tools to find evidence for structural variation.
Tools to gather evidence for structural variation via breakpoint detection.

## Documentation

Documentation can be found in the [docs folder](docs/01_Introduction.md)
More detailed documentation can be found in the [docs folder](docs/01_Introduction.md).

## Introduction to the `fgsv` Toolkit

The `fgsv` toolkit contains tools for effective structural variant debugging.
clintval marked this conversation as resolved.
Show resolved Hide resolved
These tools are not meant to be used as a structural variant calling toolchain in-and-of-itself; instead, it is better to think of `fgsv` as an effective breakpoint detection and structural variant exploration toolkit.
clintval marked this conversation as resolved.
Show resolved Hide resolved

> [!NOTE]
> When describing structural variation, we use the term **breakpoint** to mean a junction between two loci and the term **breakend** to refer to one of the loci in a breakpoint.
clintval marked this conversation as resolved.
Show resolved Hide resolved

> [!IMPORTANT]
> All point intervals (1-length) reported by this toolkit are 1-based inclusive from the perspective of the reference sequence unless otherwise documented.

### `SvPileup`

Collates pileups of reads over breakpoint events.

```console
fgsv SvPileup \
--input sample.bam \
--output sample.svpileup
```

The tool [`fgsv SvPileup`](https://github.com/fulcrumgenomics/fgsv/blob/main/docs/tools/SvPileup.md) takes a queryname-grouped BAM file as input and scans each query group (template) of alignments for structural variant evidence.
For a simple example: a paired-end read may have one alignment per read—one alignment for read 1 and another alignment for read 2 mapped to different reference sequences supporting a putative translocation.
clintval marked this conversation as resolved.
Show resolved Hide resolved

Primary and supplementary alignments for a template are used to construct a “chain” of aligned sub-segments in a way that honors the sub-segments mapping locations and strandedness as compared to the reference sequence.
The aligned sub-segments in a chain relate to each other through typical alignment mechanisms like insertions and deletions but also contain information about the relative orientation of the sub-segment to the reference sequence and importantly, jumps between reference sequences which could indicate translocations.
See the [SAM Format Specification v1](https://samtools.github.io/hts-specs/SAMv1.pdf) for more information on how reads relate to alignments.

For each chain of aligned sub-segments per template, outlier jumps are collected where the minimum inter-segment jump distance within a read must be 100bp (by default) or greater, and the minimum inter-read jump distance (e.g. between reads in a paired-end read) must be 1000bp (by default) or greater.
At locations where these jumps occur, breakpoints are marked and given a unique ID based on the loci of the breakends and the directionality of the left and right strands leading into each breakend.
In the case where there is both evidence for a split-read jump and inter-read jump, the split-read alignment evidence is favored since it gives a precise breakpoint.
This process creates a collection of candidate breakpoint locations.

The tool outputs a table of candidate breakpoints and a BAM file with each alignment tagged with the ID of the breakpoint it supports (if any).

### `AggregateSvPileup`

Aggregates and merges pileups that are likely to support the same breakpoint.

```console
fgsv AggregateSvPileup \
--bam sample.bam \
--input sample.svpileup.txt \
--output sample.svpileup.aggregate.txt
```

The tool [`fgsv AggregateSvPileup`](https://github.com/fulcrumgenomics/fgsv/blob/main/docs/tools/AggregateSvPileup.md) is used to aggregate nearby breakpoints into one event if they appear to support one true breakpoint.
This polishing step preserves true positive breakpoint events and is intended to reduce the number of false positive breakpoint events.

Aggregating breakpoints is often necessary because of variability in typical short-read alignments caused by somatic mutation, sequencing error, alignment artifact, or breakend sequence similarity/homology to the reference sequence.
Variability in short-read alignments means that it is not always possible to locate the exact nucleotide coordinate where either breakends in a breakpoint occur.
Instead, either breakend of a true breakpoint may map to a plausible region (instead of a point coordinate) and when this happens, the cluster of breakends could be aggregated to build up support for one true breakpoint.

Clustered breakpoints are only merged if their left breakends map to the same strand of the same reference sequence, their right breakends map to the same strand of the same reference sequence, and their left and right genomic breakend positions are both within a given length threshold of 10bp (by default).

One shortcoming of the existing behavior, which should be corrected at some point, is that intra-read breakpoint evidence is considered similarly to inter-pair breakpoint evidence even though intra-read breakpoint evidence often has nucleotide-level alignment resolution and inter-pair breakpoint evidence does not.

The tool outputs a table of aggregated breakpoints and a modified copy of the input BAM file where each alignment is tagged with the ID of the aggregate breakpoint it supports (if any).

### `AggregateSvPileupToBedPE`

Converts the output of `AggregateSvPileup` to the [BEDPE format](https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format).

```console
fgsv AggregateSvPileupToBedPE \
--input sample.svpileup.aggregate.txt \
--output sample.svpileup.aggregate.bedpe
```

The tool [`fgsv AggregateSvPileupToBedPE`](https://github.com/fulcrumgenomics/fgsv/blob/main/docs/tools/AggregateSvPileupToBedPE.md) is used to convert the output of `AggregateSvPileup` to BEDPE so that it can be viewed in [IGV](https://igv.org/) and other BEDPE-supporting genome browsers.
For example:

![BEDPE in IGV](docs/img/fgsv-bedpe.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty!

12 changes: 0 additions & 12 deletions docs/01_Introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,3 @@ The following sections will help you to get started.
* [Contributing](03_Contributing.md)
* [Metric Descriptions](04_Metrics.md)
* [Tools Descriptions](05_Tools.md)

## Overview

`fgsv` contains tools for gathering evidence for structural variants
from aligned reads. The `SvPileup` tool searches for split read mappings
and read pairs that map across breakpoints, emitting verbose information
similar to other "piluep" tools for small variant detection, but in this
case for structural variation detection. The `AggregateSvPileup` attempts
to aggregate information across "nearby" pileups, which is useful as often
the genomic start and end of a breakpoint is not always precise. The tools
aim to be as sensitive as possible to find these evidence, but do neither
perform structural variation calling nor genotyping.
12 changes: 6 additions & 6 deletions docs/04_Metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,12 @@ Aggregated cluster of breakpoint pileups
|id|String|Combined ID retaining the IDs of all constituent breakpoints|
|category|BreakpointCategory|Breakpoint category|
|left_contig|String|Contig name for left side of breakpoint|
|left_min_pos|Int|Minimum coordinate of left breakends (1-based)|
|left_max_pos|Int|Maximum coordinate of left breakends (1-based)|
|left_min_pos|Int|Minimum coordinate of left breakends (1-based inclusive)|
|left_max_pos|Int|Maximum coordinate of left breakends (1-based inclusive)|
|left_strand|Char|Strand at left breakends|
|right_contig|String|Contig name for right side of breakpoint|
|right_min_pos|Int|Minimum coordinate of right breakends (1-based)|
|right_max_pos|Int|Maximum coordinate of right breakends (1-based)|
|right_min_pos|Int|Minimum coordinate of right breakends (1-based inclusive)|
|right_max_pos|Int|Maximum coordinate of right breakends (1-based inclusive)|
|right_strand|Char|Strand at right breakends|
|split_reads|Int|Total number of split reads supporting the breakpoints in the cluster|
|read_pairs|Int|Total number of read pairs supporting the breakpoints in the cluster|
Expand Down Expand Up @@ -82,10 +82,10 @@ the only information comes from read-pairs and the breakpoint information should
|------|----|-----------|
|id|String|An ID assigned to the breakpoint that can be used to lookup supporting reads in the BAM.|
|left_contig|String|The contig of chromosome on which the left hand side of the breakpoint exists.|
|left_pos|Int|The position (possibly imprecise) of the left-hand breakend (1-based).|
|left_pos|Int|The position (possibly imprecise) of the left-hand breakend (1-based, inclusive).|
|left_strand|Char|The strand of the left-hand breakend; sequence reads would traverse this strand in order to arrive at the breakend and transit into the right-hand side of the breakpoint.|
|right_contig|String|The contig of chromosome on which the left hand side of the breakpoint exists.|
|right_pos|Int|The position (possibly imprecise) of the right-hand breakend (1-based).|
|right_pos|Int|The position (possibly imprecise) of the right-hand breakend (1-based, inclusive).|
|right_strand|Char|The strand of the right-hand breakend;. sequence reads would continue reading onto this strand after transiting the breakpoint from the left breakend|
|split_reads|Int|The number of templates/inserts with split-read alignments that identified this breakpoint.|
|read_pairs|Int|The number of templates/inserts with read-pair alignments (and without split-read alignments) that identified this breakpoint.|
Expand Down
Binary file added docs/img/fgsv-bedpe.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/tools/AggregateSvPileup.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ title: AggregateSvPileup
## Overview
**Group:** Breakpoint and SV Tools

Merges nearby pileups of reads supporting putative breakpoints.
Aggregates and merges pileups that are likely to support the same breakpoint.

Takes as input the file of pileups produced by `SvPileup`. That file contains a list of breakpoints, each
consisting of a chromosome, position and strand for each side of the breakpoint, as well as quantified read support
Expand Down Expand Up @@ -36,7 +36,7 @@ of the overlapping target regions are copied from the `SvPiluep` input (if prese
The output file is a tab-delimited table with one record per aggregated cluster of pileups. Aggregated
pileups are reported with the minimum and maximum (inclusive) coordinates of all pileups in the cluster, a
possible putative structural variant event type supported by the pileups, and the sum of read support from all
pileups in the cluster. Positions in this file are 1-based positions.
pileups in the cluster. Positions in this file are 1-based inclusive positions.

## Arguments

Expand Down
4 changes: 2 additions & 2 deletions docs/tools/SvPileup.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ title: SvPileup
## Overview
**Group:** Breakpoint and SV Tools

Collates a pileup of putative structural variant supporting reads.
Collates pileups of reads over breakpoint events.

## Outputs

Two output files will be created:

1. `<output-prefix>.txt`: a tab-delimited file describing SV pileups, one line per breakpoint event. The returned
breakpoint will be canonicalized such that the "left" side of the breakpoint will have the lower (or equal to)
position on the genome vs. the "right"s side. Positions in this file are 1-based positions.
position on the genome vs. the "right"s side. Positions in this file are 1-based inclusive positions.
2. `<output-prefix>.bam`: a SAM/BAM file containing reads that contain SV breakpoint evidence annotated with SAM
tag.

Expand Down
6 changes: 3 additions & 3 deletions docs/tools/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@ title: fgsv tools

# fgsv tools

The following tools are available in fgsv version 0.2.0-d603e95.
The following tools are available in fgsv version 0.2.0-51557f2.
## Breakpoint and SV Tools

Primary tools for calling and transforming breakpoints and SVs.

|Tool|Description|
|----|-----------|
|[AggregateSvPileup](AggregateSvPileup.md)|Merges nearby pileups of reads supporting putative breakpoints|
|[AggregateSvPileup](AggregateSvPileup.md)|Aggregates and merges pileups that are likely to support the same breakpoint|
|[FilterAndMerge](FilterAndMerge.md)|Filters and merges SVPileup output|
|[SvPileup](SvPileup.md)|Collates a pileup of putative structural variant supporting reads|
|[SvPileup](SvPileup.md)|Collates pileups of reads over breakpoint events|

## Utility Tools

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ import scala.collection.mutable

@clp(group=ClpGroups.BreakpointAndSv, description=
"""
|Merges nearby pileups of reads supporting putative breakpoints.
|Aggregates and merges pileups that are likely to support the same breakpoint.
|
|Takes as input the file of pileups produced by `SvPileup`. That file contains a list of breakpoints, each
|consisting of a chromosome, position and strand for each side of the breakpoint, as well as quantified read support
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ object TargetBedRequirement extends FgBioEnum[TargetBedRequirement] {

@clp(group=ClpGroups.BreakpointAndSv, description=
"""
|Collates a pileup of putative structural variant supporting reads.
|Collates pileups of reads over breakpoint events.
|
|## Outputs
|
Expand Down