Skip to content

Commit

Permalink
Fix up README after a reviewg
Browse files Browse the repository at this point in the history
  • Loading branch information
clintval committed May 10, 2024
1 parent 531a122 commit d36e72e
Show file tree
Hide file tree
Showing 2 changed files with 64 additions and 19 deletions.
83 changes: 64 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,30 +17,75 @@
[doi-badge]: https://zenodo.org/badge/454071954.svg
[doi-link]: https://zenodo.org/doi/10.5281/zenodo.10452647

Tools for calling breakpoints and exploring structural variation.
Tools to gather evidence for structural variation via breakpoint detection.

## Documentation

Documentation can be found in the [docs folder](docs/01_Introduction.md).

## Introduction to the `fgsv` Toolkit

The `fgsv` tools are an effective structural variant debugging toolkit but are not meant to be used as a structural variant calling toolchain in-and-of-itself.
Instead, it is better to think of `fgsv` as an effective breakpoint detection and structural variant exploration toolkit.

When describing structural variation, we use the term breakpoint to mean a junction between two loci and the term breakend to refer to one of the loci in a breakpoint.
Importantly, all point intervals (1-length) reported by this toolkit are 1-based inclusive from the perspective of the reference sequence.

### `fgsv SvPileup`

Collates a pileup of putative structural variant supporting reads.

```console
fgsv SvPileup \
--input sample.bam \
--output sample.svpileup
```

The tool [`fgsv SvPileup`](https://github.com/fulcrumgenomics/fgsv/blob/main/docs/tools/SvPileup.md) takes a query-grouped BAM file as input and scans through each template one at a time, where a template is the full collection of reads and alignments from a single source molecule.
Primary and supplemental alignments for a template are used to construct a “chain” of aligned sub-segments in a way that is order and strand-aware.
These aligned sub-segments relate to each other through typical alignment mechanisms like insertions and deletions but also contain information about the relative orientation of the sub-segment to the reference genome and importantly, jumps between reference sequences (chromosomes).

For each chain of aligned sub-segments per template, outlier jumps are collected where the minimum inter-segment distance within a read must be 100bp (by default) or greater and the minimum inter-read distance per pair must be 1000bp (by default) or greater.
In the case where there is both evidence for a split-read alignment and inter-read jump, the split-read alignment evidence is favored.
At locations where these jumps occur, breakpoints are marked, and the breakpoints are given a unique ID based on the position of the breakpoint, the directionality of the left and right strands, and the other location the aligned sub-segment jumps to.
The output of this process is simply a pileup of candidate breakpoint locations.
The output of this tool is a metrics file tabulating the breakpoints and a BAM file with each alignment having custom tags that indicate which breakpoint the alignment supports (by breakpoint ID), if any.

Because of variability in short-read sequence data and their alignments, evidence for a single breakpoint may span a few loci near the true breakpoint.
The tool [`fgsv AggregateSvPileup`](https://github.com/fulcrumgenomics/fgsv/blob/main/docs/tools/AggregateSvPileup.md) is used to coalesce nearby breakpoints into one call if they appear to belong to one breakpoint.
This polishing step preserves true positive breakpoint calls and should reduce the number of false positive breakpoint calls.
Adjacent breakpoints are only merged if their left sides map to the same reference sequence, their right rides sides map to the same reference sequence, the strandedness of the left and right aligned sub-segments is the same, and their left and right positions are both within a given length threshold.
One shortcoming of the existing behavior, that should be corrected at some point, is that inter-read breakpoint evidence is considered similarly to inter-pair breakpoint evidence even though inter-read breakpoint evidence often has nucleotide-level alignment resolution and inter-pair breakpoint evidence does not.
The output of this tool is a metrics file tabulating the coalesced breakpoints with all previous breakpoint IDs listed for the new breakpoint call and an estimation of the allele frequency of the call based on the alignments that support the breakpoint.

The `fgsv` tools are an effective structural variant debugging toolkit but are not meant to be considered as a structural variant calling toolchain in-and-of-itself.
Instead, it’s better to think of the `fgsv` toolkit as an effective “breakpoint caller”.
For example, a paired-end read may have an alignment per read: one alignment for read 1 and another alignment for read 2.

Primary and supplementary alignments for a template (see the [SAM Format Specification v1](https://samtools.github.io/hts-specs/SAMv1.pdf) for more information) are used to construct a “chain” of aligned sub-segments in a way that honors the logical ordering of sub-segments and their strandeness in relation to the reference sequence.
These aligned sub-segments in a chain relate to each other through typical alignment mechanisms like insertions and deletions but also contain information about the relative orientation of the sub-segment to the reference sequence and importantly, jumps between reference sequences such as translocations between chromosomes or contigs.

For each chain of aligned sub-segments per template, outlier jumps are collected where the minimum inter-segment distance within a read must be 100bp (by default) or greater, and the minimum inter-read distance across reads (e.g. between reads in a paired-end read) must be 1000bp (by default) or greater.
In the case where there is both evidence for a split-read alignment and inter-read jump, the split-read alignment evidence is favored since it gives a precise breakpoint.
At locations where these jumps occur, breakpoints are marked and the breakpoints are given a unique ID based on the positions of the breakends and the directionality of the left and right strands leading into each breakend.

This process creates a collection of candidate breakpoint locations.
The output of this tool is a metrics file tabulating the breakpoints and a BAM file with each breakpoint-supporting alignment having custom tags that indicate which breakpoint the alignment supports.

### `fgsv AggregateSvPileup`

Merges nearby pileups of reads supporting putative breakpoints.

```console
fgsv AggregateSvPileup \
--bam sample.bam \
--input sample.svpileup.txt \
--output sample.svpileup.aggregate.txt
```

Because of variability in typical short-read alignments, evidence for a single breakpoint may span a few loci near the true breakend loci. For example, if the breakpoint only has intra-read evidence, then the breakpoint could coincidentally occur within the unobserved bases between read 1 and read 2 in a pair. In other cases and due to sequence similarity or homology between each breakend locus, it is not always possible to locate the exact nucleotide point where the breakends occur, and instead a plausible region may exist that supports either breakend loci.

The tool [`fgsv AggregateSvPileup`](https://github.com/fulcrumgenomics/fgsv/blob/main/docs/tools/AggregateSvPileup.md) is used to coalesce nearby breakpoints into one event if they appear to belong to one true breakpoint.
This polishing step preserves true positive breakpoint events and intends to reduce the number of false positive breakpoint events.

Adjacent breakpoints are only merged if their left breakends map to the same reference sequence, their right breakends map to the same reference sequence, the strandedness of the left and right aligned sub-segments is the same, and their left and right genomic breakend positions are both within a given length threshold.

One shortcoming of the existing behavior, which should be corrected at some point, is that intra-read breakpoint evidence is considered similarly to inter-pair breakpoint evidence even though intra-read breakpoint evidence often has nucleotide-level alignment resolution and inter-pair breakpoint evidence does not.

The output of this tool is a metrics file tabulating the coalesced breakpoints with all previous breakpoint IDs listed for the new breakpoint event and an estimation of the allele frequency of the event based on the alignments that support the breakpoint.

## `AggregateSvPileupToBedPE`

Convert the output of `AggregateSvPileup` to BEDPE.

```console
fgsv AggregateSvPileupToBedPE \
--input sample.svpileup.aggregate.txt \
--output sample.svpileup.aggregate.bedpe
```

The tool [`fgsv AggregateSvPileupToBedPE`](https://github.com/fulcrumgenomics/fgsv/blob/main/docs/tools/AggregateSvPileupToBedPE.md) is used to convert the output of `AggregateSvPileup` to the [BEDPE format](https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format) so that it can be viewed in [IGV](https://igv.org/) and other BEDPE-supporting genome browsers. For example:

![BEDPE in IGV](docs/img/fgsv-bedpe.png)
Binary file added docs/img/fgsv-bedpe.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit d36e72e

Please sign in to comment.