JACUSA2 - A framework for the mapping of RNA modifications
We provide a Snakemake pipeline for RNA modification detection using nanopore Direct RNA-seq of multiple conditions and replicates.
- The pipeline is based on JACUSA2 variant calling, where the output of JACUSA2 is analyzed to detect RNA modification patterns and predict modified sites.
- JACUSA2 detects modification based on mapping characteristics, it requires BAM files as input.
- JACUSA2 can be run in single-mode (call-1) or paired mode (call-2). For sensitivity, we adopt paired mode.
- All data used in our manuscript can be found at Zenodo.
We recommend installing software dependencies via Conda
on Linux. You can find Miniconda installation instructions for Linux here.
Make sure you install the Miniconda Python3 distribution.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
In the Conda installer you will have to accept the license and select an installation directory. When asked also select to run 'conda init'. After the installer completed you can open a shell to get the basic Conda setup.
For performance and compatibility reasons you should install Mamba
via conda to install Snakemake. See Snakemake documentation for more details.
conda install -c conda-forge mamba
Once you have installed Conda and Mamba, you can download the Snakemake pipeline.
git clone https://github.com/dieterich-lab/MiMB_JACUSA2_chapter.git
cd MiMB_JACUSA2_chapter/Nanopore_HEK293
Then, install the required packages after creating an environment with the downloaded environment.yml
file.
mamba env create -f environment.yaml
conda activate JACUSA2pipeline_env
Before executing the Snakemake workflow, we recommend downloading JACUSA2 jar file and make sure that you set the path jar
in the config file. In case the jar file is not set, "JACUSA_v2.0.1.jar" will be automatically downloaded and used.
The following protocol describes how to detect m6A modification from nanopore direct RNA-seq data. The benchmark is obtained from PRJEB40872 and is composed of two samples from two conditions: wild-type cells (WT, modified RNAs) and Mettl3 knockout cells (KO, unmodified RNAs) with two replicates (2 and 3). The supervised analysis (pattern learning) and the evaluation are based on m6A sites from miCLIP_union.bed
file. The sites are reported in three miCLIP-based studies Bouliaset al. [2019], Koh et al. [2019], Körtel et al. [2021]. The consensus of the three studies annotated with "Boulias,Koertel,Koh" is used for the training process whereas the test set involves the rest of miCLIP sites and non miCLIP sites. [Following the use case 1 of the manuscript, we limited the analysis to the set of sites reported in 'data/selected_regions.bed' file, which refers to the intersection of samples of use cases 1 and 2 ]
We present hereafter the main steps to produce BAM files which would be the input of our pipeline.
- Base call the ionic current signal stored in FAST5 file using Guppy basecaller.
guppy_basecaller --compress_fastq -i path_to_fast5 -s path_to_output -c config_file.cfg --cpu_threads_per_caller 14 --num_callers 1
The inputs are the path to FAST5 files, the output folder, and the config file or the flowcell/kit combination. The output is FASTQ files that can
be compressed using the option –compress fastq
. Set the number of threads cpu threads per caller
and the number of parallel basecallers num caller
according to your resources. Additional details can be found here.
- Align reads to the transcriptome using the following Minimap2 command. To reduce the indexing time of the human genome, save the index with the option
-d
before the mapping and use the index instead of the reference file in the minimap2 command line.
minimap2 -d reference.mmi reference.fa
minimap2 -t 5 --MD -ax splice --junc-bonus 1 -k14 --secondary=no --junc-bed final_annotation_96.bed -ub reference.mmi Reads.fastq.gz |samtools view -bS > mapping.bam
The inputs are the FASTQ files Reads.fastq.gz
and the reference sequence reference.fasta
. The output is a SAM file that is converted to BAM file mapping.bam
using samtools command. The setting is used for spliced alignments of the direct RNA sequencing :
-ax splice --junc-bed annotation.bed --junc-bonus INT
where annotation.bed is the file containing the splice junctions and INT is the bonus score. The BED file can be generated using the following command:
paftools.js gff2bed annotation.gtf > annotation.bed
-ub
to align reads to both strands of the reference,- small k-mer size
-k [=14]
to enhance sensitivity.
We recommend outputting primary alignments –secondary=no
. –MD
parameter is used to add the reference sequence information to the alignment; this is recommended for the downstream analysis. Check Minimap2 manual for further details.
For the testing example, we consider BAM files of wild-type cells ("HEK293T-WT-rep2.bam","HEK293T-WT-rep3.bam") and knockout cells ("HEK293T-KO-rep2.bam","HEK293T-KO-rep3.bam") from Zenodo as our inputs.
The pipeline is composed of many target rules (fig. 4) and requires setting different parameters.
-
Be aware to set all parameters before running the pipeline. Please put all required data in a folder with the name
data
in the working directory. Here is the config file of the testing example.#label of the analysis label: "HEK293_WT_KO" #path to JACUSA2 JAR file jar: "JACUSA_v2.0.1-RC.jar" # path to the output directory, if it doesn't exist it will be created path_out: "output" # path to the directory containing inputs - all input files are relative to this directory path_inp: "data" # path to the reference squence reference: "data/GRCh38_96.fa" # path to the BED6 file containing known modified sites where 'name' refers to the annotation of the position. Useful for learning patterns (training and test set splitting). modified_sites: "data/miCLIP_union.bed" # path to the file containing size of chromosomes (Chromosome | size ) chr_size: "data/hg38.size" # BED6 file containing set of 5-mer (NNANN) to analyze, if = null, all 5-mers (NNANN) will be considered. regions: "data/selected_regions.bed" # dictionary of two keys (cond1, cond2) referring to the paired conditions inputs. The value is the list of replicates names without ".bam" extension. data: cond1: - HEK293T-WT-rep2 - HEK293T-WT-rep3 cond2: - HEK293T-KO-rep2 - HEK293T-KO-rep3 # dictionary where keys refer to parameters (e.g. [p: 16] to set the number of threads to 16). Please put "null" if no value is assigned to the parameter. We use the following parameters: jacusa_params: # mandatory parameter referring to the library of the first condition sample. P1: "FR-SECONDSTRAND" # mandatory parameter referring to the library of the second condition sample. P2: "FR-SECONDSTRAND" # filter reads by mapping quality m: 1 # filter reads by basecalling quality q: 1 # filter reads by coverage c: 4 # recommended parameter to filter sites within the holy-polymer regions. a: "Y" # customize the number of threads p: 16 # mandatory parameter to output deletion score. D: null # mandatory parameter to output insertion score. I: null # dictionary where you can add java specific parameters as keys e.g. Xmx20g for 20G as the maximum size of the memory allocation. java_params: null # specify patterns and their combinations to be used, please set to "null" if no value is affected to the field. pattern_params: # specify the annotation of the set of modified sites to be used as a training set. in case you use an external pattern put "null". internal_pattern: "Boulias,Koertel,Koh" # path to an external pattern in case you don't use internal_pattern, else put "null" external_pattern: null # patterns to combine, add as many combinations as you want as a [key(any name): value (pattern numbers)] combination. combined_patterns: pt1: [1, 2, 4, 6] pt2: [1, 2, 3, 4, 6]
Please check JACUSA2 manual for more details on how to use JACUSA2 parameters.
- Run
JACUSA2_call2
target to call variants using paired conditions.
$ snakemake --cores all jacusa2_call2
The output is a file called Cond1vsCond2Call2.out
under ./output/{label}/jacusa
and filtered BAM files under ./output/{label}/bam
.
- Run
get_features
target to preprocess JACUSA2 call2 output and extract features.
$ snakemake --cores all get_features
The output is an R object features.rds
under ./output/{label}/features/
.
- Run
get_pattern
target to learn patterns representing m6A modification.
$ snakemake --cores all get_pattern
The output is an R object NMF.rds
containing the factorization result, including basis and coefficient matrices, plus, plots showing the rank selection result, pattern scoring barplot and heatmap of NMF resulting matrices. The output is under ./output/{label}/pattern/
. Implicitly, training and test set files (resp. train_features.rds
, test_features.rds
) are created under features
folder and, subsequently, used for the learning model.
For the testing example, train_features.rds
is supposed to contain 1905 sites. The scoring of patterns will look like the following barplot.
Figure 1: Membership score of resulted patterns
- Run
visualize_pattern
target to visualize patterns and their combinations.
$ snakemake --cores all visualize_pattern
The output is a set of figures representing barplots for the produced patterns. The outputs can be found under ./output/{label}/pattern/viz/
.
For the testing example, the combination of patterns representing more than 80% of the training set will look like the following barplot:
Figure 2: Combination of patterns representing 80% of the training set
- Run
predict_modification
target to predict modified sites
$ snakemake --cores all predict_modification
The output is a BED6 file(s) containing scores of the selected pattern(s) for the test set under ./output/{label}/prediction/
and the corresponding eCDF (empirical cumulative distribution) and PPV (positive predictive values) plots.
For the testing example, the eCDF will look like the following figure:
Figure 3: eCDF of estimated scores of sites from diffrent miCLIP categories and non miCLIP sites
Note that rules are linked so that the workflow is determined from top (e.g. predict modification) to bottom (e.g. sort bam) and executed accordingly from bottom to top. Therefore, running ”predict_modification” rule leads to executing all rules on its pipeline. Check Snakemake documentation for further details.
Figure 4: Snakemake JACUSA2-based pipeline
Once the pipeline has run successfully you should expect the following files in the output directory:
bam/
: This will be skipped if you start from JACUSA2 output{mapping BAM file name}.sorted.bam
- sorted BAM file{mapping BAM file name}.filtered.bam
- filtered BAM file{mapping BAM file name}.filtered.bam.bai
- index of the filtered BAM file
jacusa/
:Cond1vsCond2Call2.out
- JACUSA2 call-2 output
features/
:features.rds
- tabular features (target sites (here, 'A') x 15 features)train_features.rds
- training set to extract RNA modification patternstest_features.rds
- test set
patterns/
:-
NMF.rds
- NMF factorization (R object) -
asses_NMF_1.pdf
- NMF rank survey -
asses_NMF_2.pdf
- silhouette + cophenetic correlation result -
pattern_scores.pdf
- barplot of the membership score of patterns based on basis matrix -
NMF_matrices.pdf
- heatmaps for basis and coefficient matrices of the NMF result -
viz/
:pattern_{pattern number}_barplot_NMF.pdf
- barplot of patterns (by position and score type) from coefficient matrix
-
prediction/
:pattern{pattern number}_prediction.bed
- estimated scores from selected patternspattern{pattern number}_ppv.pdf
- PPV plotpattern{pattern number}_ecdf.pdf
- eCDF plot
All dependencies and packages are added to the environment.yaml
file.
Software | Version |
---|---|
Minimap2 | 2.22 |
samtools | 1.12 |
openjdk | 11.0.13 |
R | 4.0.5 |
PERL | 5.28.1 |
bedtools | 2.29.2 |
snakemake | 6.8.1 |
R Package | Version |
---|---|
ggplot2 | 3.3.5 |
NMF | 0.23.0 |
pROC | 1.18.0 |
stringr | 1.4.0 |