ConSTRain is a short tandem repeat (STR) variant caller that can account for copy number variants (CNVs), aneuploidies, and polyploid genomes. It is accurate and fast, needing less than 20 minutes to genotype >1.7 million STRs in a 100X alignment of human whole-genome sequencing reads on 32 threads. For more detailed information on ConSTRain, please refer to the preprint.
- Method overview
- Installation
- Running ConSTRain
- Input files and their formats
- Command line arguments
- INFO and FORMAT fields
In case you encounter any problems with ConSTRain or have a question, feel free to open an issue or send an email to [email protected].
To infer STR genotypes, ConSTRain generates all possible allele length distributions for each locus and returns the one that best matches the observed allele length distribution. An overview of how ConSTRain works internally is shown below.
ConSTRain overview and example.
(1) An STR locus are loaded from the input files.
The locus reference information is parsed from the STR panel.
The STR copy number is set based on the karyotype, and optionally updated if the STR is affected by a CNV.
(2) Reads overlapping the STR region are extracted from the alignment file, and the length of the STR region in each read is determined.
(3) The observed distribution is sorted, and at most as many allele lengths as the STR copy number are kept.
(4) This yields the final observed allele length distribution.
(5) Next, all possible genotypes are generated for the STR copy number and stored in matrix G.
(6) From G, the matrix D is generated by multiplying it with the total number of mapped reads (51 in the example) divided by the STR copy number (3 in the example).
Each row in D corresponds to the expected allele length distribution of one of the genotypes in G.
(7) The expected distribution with the lowest error to the observed distribution is found by taking the absolute difference between each row in D and the observed distribution, then (8) taking the sum of rows and finding the one with the lowest value.
(9) The genotype in G with the lowest error is selected (10) and reported in the output.
The inferred genotype of the STR locus in this example consists of an allele of 4 CAG
units (present once), an allele of 5 CAG
units (present once), and an allele of 8 CAG
units (also present once).
ConSTRain releases ship with pre-compiled binaries for MacOS (Intel) and x86_64 Linux. You can find them here. Please notify us of any issues or if you'd like us to include pre-compiled binaries for other targets.
In order to build ConSTRain from the source code, you will need to install the Rust programming language. Then, clone this repository and build ConSTRain as follows:
git clone https://github.com/acg-team/ConSTRain.git
cd ConSTRain/ConSTRain
cargo build --release --bin ConSTRain
The binary will be created at ./target/release/ConSTRain
(alternatively, cargo install --path .
from within the ConSTRain/ConSTRain
directory will create the binary and put it in ~/.cargo/bin
).
Building hts-sys
fails with the following error:
Unable to find libclang: "couldn't find any valid shared libraries matching: ['libclang.so', 'libclang-*.so', 'libclang.so.*' 'libclang-*.so.*'], set the LIBCLANG_PATH environment variable to a path where one of these files can be found (invalid: [])"
This is caused because Bindgen
requires libclang
. The problem can be solved by installing clang
(see instructions here) or, if it is already installed, providing the location of the directory containing libclang.so
through the LIBCLANG_PATH
environment variable:
LIBCLANG_PATH=/path/to/clang/lib cargo build --release --bin ConSTRain
Building hts-sys
fails with an error like this:
./htslib/htslib/hts.h:31:10: fatal error: 'stddef.h' file not found
Caused because some system header file (e.g., stddef.h
) is not found. Address this by providing the location of the directory containing system header files through the C_INCLUDE_PATH
environment variable:
C_INCLUDE_PATH=/path/to/include cargo build --release --bin ConSTRain
Some basic ConSTRain invocations are shown here.
For details on all command line arguments, run ConSTRain --help
or go to the command line arguments section below.
Calling STR variants from aligned sequencing reads is done via the ConSTRain alignment
subcommand.
To run, ConSTRain alignment
needs at least a BED file specifiying locations of STRs in the reference genome, a JSON file encoding the target chromosomes and their ploidies, and an alignment in SAM/BAM/CRAM format.
For some species, input files are provided in resources directory (currently, only Homo sapiens and Musa acuminata).
A basic ConSTRain alignment
invocation looks like this:
ConSTRain alignment -a <ALIGNMENT> -k <KARYOTYPE> -r <REPEATS>
By default, ConSTRain prints logging information up to and including 'info' level to stderr.
To enable more in-depth logging information, set the RUST_LOG
environment variable, e.g.:
RUST_LOG=debug ConSTRain alignment -a <ALIGNMENT> -k <KARYOTYPE> -r <REPEATS>
If you know there are CNVs in the sample you're analysing, ConSTRain can account for these if you supply the affected regions as a BED file via the --cnvs
flag:
ConSTRain alignment -a <ALIGNMENT> -k <KARYOTYPE> -r <REPEATS> --cnvs <CNVs>
Once an alignment has been analysed using ConSTRain alignment
the generated VCF file can be used to re-run ConSTRain.
This is useful if you decide you want to use different filtering parameters, or maybe you have access to new CNV information that was not available when you first analysed a sample.
Rather than having to analyse the alignment again, you can use the ConSTRain vcf
subcommand to re-estimate STR genotypes.
This is possible because ConSTRain includes the observed allele length distribution for each STR as a FORMAT field in the VCF file.
Running ConSTRain vcf
is typically a matter of a few seconds.
The basic invocation of ConSTRain vcf
is:
ConSTRain vcf -v <VCF> -k <KARYOTYPE> -r <REPEATS> -s <SAMPLE>
Again, you can add CNV information via the --cnv
flag:
ConSTRain vcf -v <VCF> -k <KARYOTYPE> -r <REPEATS> -s <SAMPLE> --cnvs <CNVs>
ConSTRain expects alignments to adhere to file specifications maintained by the GA4GH, and outputs variant calls in VCF format. However, specific formats are expected for other input files, which are outlined here.
- Copy number variants: CNVs can be specified as a BED3+1 file, with the extra column indicating the (absolute!) copy number of the region affected by the CNV.
E.g.,
chr5 106680 106750 3
. - Karyotype: the organism's karyotype should be specified as a JSON file mapping the names of chromosomes to their ploidies.
E.g.,
{"chr1": 2, ... "chrX": 2, "chrY: 0"}
for a human female sample,{"chr1": 2, ... "chrX": 1, "chrY: 1"}
for a human male. It is critical that chromosome names in this file match chromosome names in the alignment file exactly. - Repeats: the location of repeats in the reference genome should be provided as a BED3+2 file, with the two extra columns indicating the repeat unit length and the repeat unit.
E.g.,
chr5 21004 21014 2 AT
specifies a dinucleotide repeat with sequenceATATATATAT
starting at position 21004 on chromosome 5.
All of ConSTRain's command line arguments are listed here. For details on expected formats of the CNV, karyotype, and repeat file, see the section on input file formats below.
long | short | subcommand | required | default | description |
---|---|---|---|---|---|
--alignment | -a | alignment | y | Input file to extract repeat allele lengths from. Can be SAM/BAM/CRAM | |
--cnvs | both | n | Copy number variants for this individual. Expected format is BED3+1 | ||
--flanksize | alignment | y | 5 | Size of flanking region around the target repeat that reads need to cover to be considered (both sides) | |
--karyotype | -k | alignment | y | File containing chromosome names and their base ploidies. Expected format is JSON | |
--max-cn | both | y | 20 | Maximum copy number to consider | |
--max-norm-depth | both | n | Maximum normalised depth of coverage to perform allele length estimation. E.g., max-norm-depth of 30. means loci with more than 60 reads for a locus with copy number 2 will be skipped, 90 for copy number 3 etc. |
||
--min-norm-depth | both | y | 1.0 | Minimum normalised depth of coverage to perform allele length estimation. E.g., min-norm-depth of 10. means at least 20 reads are needed for a locus with copy number 2, 30 for copy number 3 etc. (must be at least 1.) |
|
--reference | alignment | n | Reference genome. Expected format is FASTA (NOT gzipped), index file should exist right next to FASTA. Required if alignment is in CRAM format | ||
--repeats | -r | alignment | y | File specifying target repeat regions. Expected format is BED3+2 | |
--sample | both | n (y for vcf) | infer from file name | Sample name | |
--threads | both | y | 1 | Number of threads to use | |
--vcf | -v | vcf | y | Input file to extract repeat allele lengths from (VCF) |
ConSTRain provides several INFO and FORMAT fields in its VCF output to describe repeat loci and genotyping results.
INFO field | Description |
---|---|
END | End position of reference allele |
RU | Repeat unit |
PERIOD | Repeat period (length of unit) |
REF | Repeat allele length in reference |
FORMAT field | Description |
---|---|
GT | Genotype |
FT | Filter tag. Contains PASS if all filters passed, otherwise reason for filter |
CN | Copy number |
DP | Number of fully spanning reads mapped to locus |
FREQS | Frequencies observed for each allele length. Keys are allele lengths and values are the number of reads with that allele length |
REPLEN | Genotype given in the number of times the unit is repeated for each allele |
The FT FORMAT field may hold different values. An overview of the meaning of different FT values is given here:
FT value | Description |
---|---|
PASS | All filters passed |
UNDEF | Undefined ConSTRain filter |
DPZERO | No reads were mapped to locus |
DPOOR | Normalised depth of coverage at locus was out of range specified by --min-norm-depth and --max-norm-depth command line arguments |
CNZERO | Copy number was zero |
CNOOR | Copy number was out of range specified by --max-cn command line argument |
CNMISSING | No copy number set for locus. Can happen if contig is missing from karyotype or if only a part of the STR is affected by a CNA |
AMBGT | Multiple genotypes are equally likely |