-
Notifications
You must be signed in to change notification settings - Fork 3
Home
The snakeSV pipeline includes pre- and post-processing steps to deal with large-scale studies. The input data of the pipeline consists of BAM files for each sample, a reference genome file (.FASTA), and a configuration file in YAML format. Additionally, users can also input custom annotation files in BED format for SV interpretation and VCF files with structural variants to be genotyped in addition to the discovery set.
The minimal requirements for snakeSV to run are described in the following configuration file example (YAML format).
SAMPLE_KEY: "sampleKey.txt"
TOOLS:
- "manta"
- "smoove"
- "delly"
# Reference genome files
REFERENCE_FASTA: "data/ref/human_g1k_v37.fasta.gz"
REF_BUILD: "37"
SAMPLE_KEY: A file mapping sample IDs to BAM paths (tab-separated). Column names must be included as described below.
participant_id bam
Sample_1 path_to/Sample_1.bam
Sample_2 path_to/Sample_2.bam
Sample_4 path_to/Sample_3.bam
TOOLS: A list of tools to run SV discovery. Each tool must be specified as Snakemake rules in a separate file, where the name of the rule matches the name of the file as follows: "${SNAKEDIR}/rules/tools/<tool_name>.smk"
REFERENCE_FASTA: A reference genome fasta (e.g. human_g1k_v37.fasta)
REF_BUILD: The reference build (e.g. 37 or 38)
GTF: A gencode GTF file to annotate SVs consequences (e.g. gencode.v38lift37.annotation.nochr.gtf.gz). Based on these genic annotations, the impact of each SVs can be defined as Loss of Function (LOF), Intragenic Exonic Duplication (DUP_LOF), Whole-Gene Copy Gain (COPY_GAIN), and Whole-Gene Inversion (INV_SPAN), as described at the gnomAD-SV consortium.
ANNOTATION_BED: List of paths to BED files with custom annotations to be used for annotation of SVs. These BED file requires the following columns (chr, start, end, element_name). SVs overlapping coordinates described in any BED file will be indicated by the element_name appended to the INFO field.
SV_PANEL: Path to a VCF file with SVs to be included in the genotyping step in addition to the discovered SVs
TMP_DIR: Path to a temporary folder to be used by tools, if not the default /tmp
The mandatory requirements are a Linux environment with Python and Git. The pipeline uses Conda environments to centralize all the tool management, but it can be easily customizable to include different tools and methods, not necessarily distributed by Anaconda (and derivatives).
This step can be ignored if Anaconda is already installed in the system.
# Download Miniconda installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Set permissions to execute
chmod +x Miniconda3-latest-Linux-x86_64.sh
# Execute. Make sure to "yes" to add the conda to your PATH
./Miniconda3-latest-Linux-x86_64.sh
# Add channels
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
Install snakeSV using Biconda:
conda install -c bioconda snakesv
Alternatively, install snakeSV in a separated environment (named "snakesv_env") with the command:
conda create -n snakesv_env -c bioconda snakesv
conda activate snakesv_env # Command to activate the environment. To deactivate use "conda deactivate"
After installing, to test if everything is working well, you can run the pipeline with an example data set included.
# First create a folder to run the test
mkdir snakesv_test
cd snakesv_test
# Run the snakeSV using example data.
snakeSV --test_run
# Clone the repo
git clone https://github.com/RajLabMSSM/snakeSV.git
# Go to the folder created
cd snakeSV
# Test run - Make sure you have snakemake installed in your system (conda install snakemake)
snakemake -s workflow/Snakefile --configfile example/tiny/config.yaml --config workdir="example/tiny/files/" OUT_FOLDER="${PWD}/results_snakesv" --cores 1 --use-conda --use-singularity -p
From a cloned repository, make a copy of cluster configuration file:
cp config/cluster_lsf.yaml cluster.yaml
Edit the file with your cluster specifications (threads, partitions, cpu/memory, etc) for each rule.
Run snakeSV via wrapper (LSF example):
./snakejob -u cluster.yaml -c config/config.yaml