-
Notifications
You must be signed in to change notification settings - Fork 3
Usage examples
snakeSV can leverage SVs discovered using long-read assemblies to help identify SVs using short-read data. For this example, we will perform SV discovery in the Ashkenazi Jewish trio (HG002-son, HG003-father, and HG004-mother) samples from the Personal Genome Project. In addition, we will use the diploid assembly for the HG002 son, generated from the Human Genome Structural Variation Consortium (HGSVC). For proposes of this example, the analysis will be restricted to chromosomes 22, X, and Y. However, it can be easily expanded to the whole genome.
A supporting script can be used to download and preprocess files required for running this analysis (including reference genomes):
# Make sure snakeSV is installed
conda activate snakesv_env
# Clone snakeSV repogit clone https://github.com/RajLabMSSM/snakeSV.git
# Download and process short read data from AJ trio (results will be saved in a folder named "data")
sh example/aj_trio/01_prepare_short_read.sh
sh example/aj_trio/02_download_gtf_annotation.sh
Next, run the following script to download the long-read diploid assemblies for HG002 and align them to the chromosomes 22, X, and Y of the human reference genome (GRCh37) using minimap2. The script will also discover SVs using svim-asm
# Download and process long-read assemblies from trio son HG002
sh example/aj_trio/03_prepare_long_read.sh
The resulting VCF file can then be added to the snakeSV config.yaml
, as follows:
SV_PANEL:
- " data/sv_panel/hg002/variants.22XY.vcf.gz"
Then, the pipeline can be run as described below. A preconfigured config.yaml
file is given and can be used to run the analysis. Results will be saved in a folder named “results_study_case_1”.
snakeSV --configfile example/aj_trio/study_case_1/config.yaml -pr --cores 1 --use-conda
Genetics of brain diseases and traits often have tissue-cell-specific effects. Here we will show how to easily add cell-type-specific enhancer information to improve the interpretation of SVs in a brain disease context. In snakeSV, users can input customizable annotations in BED file format. We will use the H3K27ac peaks data to check for overlaps with SVs coordinates. The BED file needs some manipulation, as svtk requires the following columns (chr, start, end, element_name). For this example, we will use the same files from example 1, so the following chunk of code can be ignored if already executed.
# Make sure snakeSV is installed
conda activate snakesv_env
# Clone snakeSV repogit clone https://github.com/RajLabMSSM/snakeSV.git
# Download and process short read data from AJ trio (results will be saved in a folder named "data")
sh example/aj_trio/01_prepare_short_read.sh
sh example/aj_trio/02_download_gtf_annotation.sh
In addition, we will obtain cell-specific ChIP-seq data from Nott et al. (Nature Genetics, 2019). Use the supporting script to download and process data accordingly.
sh example/aj_trio/04_download_custom_annotation.sh
Next, add the paths to these files into our configuration file:
ANNOTATION_BED:
- "data/annotation/custom/astrocytes_H3K27ac.bed"
- "data/annotation/custom/microglia_H3K27ac.bed"
- "data/annotation/custom/neurons_H3K27ac.bed"
- "data/annotation/custom/oligodendrocytes_H3K27ac.bed"
A preconfigured config.yaml
file is given and can be used to run the analysis. Then, the pipeline can be run as described below. Results will be saved in a folder named “results_study_case_2”.
snakeSV --configfile example/aj_trio/study_case_2/config.yaml -pr --cores 1 --use-conda
After running the pipeline, the annotated VCF file will include in the INFO field the labels NONCODING_BREAKPOINT
and NONCODING_SPAN
if the SV overlaps any of the H3K27ac peaks.