Skip to content
Ricardo A. Vialle edited this page Jun 2, 2022 · 6 revisions

Welcome to the snakeSV wiki!

1. The pipeline

The snakeSV pipeline includes pre- and post-processing steps to deal with large-scale studies. The input data of the pipeline consists of BAM files for each sample, a reference genome file (.FASTA), and a configuration file in YAML format. Additionally, users can also input custom annotation files in BED format for SV interpretation and VCF files with structural variants to be genotyped in addition to the discovery set.

2. Inputs

The minimal requirements for snakeSV to run are described in the following configuration file example (YAML format).

SAMPLE_KEY: "sampleKey.txt"

TOOLS:
  - "manta"
  - "smoove"
  - "delly"

# Reference genome files
REFERENCE_FASTA: "data/ref/human_g1k_v37.fasta.gz"
REF_BUILD: "37"

Description of each parameter:

SAMPLE_KEY: A file mapping sample IDs to BAM paths (tab-separated). Column names must be included as described below.

participant_id	bam
Sample_1	path_to/Sample_1.bam
Sample_2	path_to/Sample_2.bam
Sample_4	path_to/Sample_3.bam

TOOLS: A list of tools to run SV discovery. Each tool must be specified as Snakemake rules in a separate file, where the name of the rule matches the name of the file as follows: "${SNAKEDIR}/rules/tools/<tool_name>.smk"

REFERENCE_FASTA: A reference genome fasta (e.g. human_g1k_v37.fasta)

REF_BUILD: The reference build (e.g. 37 or 38)

Optional parameters:

GTF: A gencode GTF file to annotate SVs consequences (e.g. gencode.v38lift37.annotation.nochr.gtf.gz). Based on these genic annotations, the impact of each SVs can be defined as Loss of Function (LOF), Intragenic Exonic Duplication (DUP_LOF), Whole-Gene Copy Gain (COPY_GAIN), and Whole-Gene Inversion (INV_SPAN), as described at the gnomAD-SV consortium.

ANNOTATION_BED: List of paths to BED files with custom annotations to be used for annotation of SVs. These BED file requires the following columns (chr, start, end, element_name). SVs overlapping coordinates described in any BED file will be indicated by the element_name appended to the INFO field.

SV_PANEL: Path to a VCF file with SVs to be included in the genotyping step in addition to the discovered SVs

TMP_DIR: Path to a temporary folder to be used by tools, if not the default /tmp

3. Installation

3.1 Requirements:

The mandatory requirements are a Linux environment with Python and Git. The pipeline uses Conda environments to centralize all the tool management, but it can be easily customizable to include different tools and methods, not necessarily distributed by Anaconda (and derivatives).

This step can be ignored if Anaconda is already installed in the system.

# Download Miniconda installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Set permissions to execute
chmod +x Miniconda3-latest-Linux-x86_64.sh 	

# Execute. Make sure to "yes" to add the conda to your PATH
./Miniconda3-latest-Linux-x86_64.sh 		

# Add channels
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

3.2 Install snakeSV using Bioconda (recommended):

Install snakeSV using Biconda:

conda install -c bioconda snakesv

Alternatively, install snakeSV in a separated environment (named "snakesv_env") with the command:

conda create -n snakesv_env -c bioconda snakesv
conda activate snakesv_env # Command to activate the environment. To deactivate use "conda deactivate"

After installing, to test if everything is working well, you can run the pipeline with an example data set included.

# First create a folder to run the test
mkdir snakesv_test
cd snakesv_test

# Run the snakeSV using example data.
snakeSV --test_run

3.3 Clone snakeSV git repository:

# Clone the repo
git clone https://github.com/RajLabMSSM/snakeSV.git

# Go to the folder created
cd snakeSV

# Test run - Make sure you have snakemake installed in your system (conda install snakemake)
snakemake -s workflow/Snakefile --configfile example/tiny/config.yaml --config workdir="example/tiny/files/" OUT_FOLDER="${PWD}/results_snakesv" --cores 1 --use-conda --use-singularity -p

4. HPC run:

From a cloned repository, make a copy of cluster configuration file:

cp config/cluster_lsf.yaml cluster.yaml

Edit the file with your cluster specifications (threads, partitions, cpu/memory, etc) for each rule.

Run snakeSV via wrapper (LSF example):

./snakejob -u cluster.yaml -c config/config.yaml