ScNapBar (single-cell Nanopore barcode demultiplexer) is a workflow to assign barcodes to long-read single-cell sequencing data. ScNapBar enables cell barcode assignment with high accuracy using unique molecular identifiers (UMI) or a Naïve Bayes probabilistic approach. It requires bam files from both Nanopore and Illumina reads, then builds a model based on the parameters estimated from the two libraries.
If you use ScNapBar, please cite the following paper:
Wang Q, Boenigk S, Boehm V, Gehring NH, Altmueller J, Dieterich C. Single cell transcriptome sequencing on the Nanopore platform with ScNapBar. RNA. 2021 Apr 27;27(7):763–70. doi:10.1261/rna.078154.120.
- Software dependencies are managed using
conda
, for more information see
https://docs.conda.io/projects/conda/en/latest/user-guide/install/.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda config --set auto_activate_base false
- Install scNapBar.
# Clone repository, add --recursive to include the sequan submodule.
git clone --recursive https://github.com/dieterich-lab/single-cell-nanopore.git
cd single-cell-nanopore
# Create environment...
conda env create --name scNapBar --file environment.yaml
conda activate scNapBar
cmake .
make
- Test scNapBar using the example data under data. Download the reference genome and place the file under data. Edit the provided
config.yaml
file by changing the paths to output and temporary directories, as well as the name of the reference genome. Run thesnakemake
command under the conda environment. Use the-j
parameter to specify the number of available cores.
snakemake -j 12 --printshellcmds --verbose
You can also submit the job via job schedulers. We have provided an example using SLURM. Adjust the cluster.json
file, or use your own snakemake profile
.
snakemake -j 12 --until run_umi_seq --printshellcmds --verbose --cluster-config cluster.json --cluster "sbatch -A {cluster.account} --mem={cluster.mem} -t {cluster.time} -c {cluster.threads} -p {cluster.partition}"
- scNapBar general usage. Edit the provided
config.yaml
file to match your own sequence files, reference genome, annotations, etc. Update the adapter and polyT length that fit your libraries. Run thesnakemake
command under the conda environment.
-
ScNapBar
(option 1 default)
uses a probabilistic model for barcode assignment, which performs very well in cases of low sequencing saturation. -
ScNapBar
(option 2)
assigns barcode based on matched Illumina UMIs without additional probabilistic modeling. Use the following command to start this mode:
snakemake --until run_umi_seq
For testing purposes, we suggest downloading GSE130708 or PRJNA722142 ( we used chr17 as an example dataset in the data folder, see Quick run).
We use the following naming convention:
-
possorted_genome_bam.bam
from Cell Ranger,or
barcodes.tsv.gz
,features.tsv.gz
, andmatrix.mtx.gz
from thefiltered_feature_bc_matrix
folder of Cell Ranger. -
Background cell barcodes re-named as
barcodes_raw.tsv.gz
, from thebarcodes.tsv.gz
of theraw_feature_bc_matrix
folder of Cell Ranger. -
Nanopore reads (Nanopore.fq.gz) in FASTQ (compressed) format.
-
Reference genome in FASTA format.
-
Annotation file in
refFlat
format (for option 2). Note: The file must have the.refFlat
extension. If running on the example data, the file must first be uncompressed. See also the UCSC genome annotation database.
The output files are written to the results folder (this directory must exist before running the pipeline), or any folder specified by dir_out
in the config.yaml
file. Example output files from the quick run using the example data are provided under analysis. We also included SRSF2 isoforms
characterized by barcodes from the published manuscript (SRSF2.GFP+.bam and SRSF2.GFP-.bam).
The main target output files are real.label
(option 1) or real.umi
(option 2):
-
real.label
: read_id, barcode, score. The barcode assignment from the real Nanopore reads with scores. The scores range from the cutoff set inconfig.yaml
to 99. Reads assigned to multiple barcodes are removed if both are above the score cutoff. -
real.umi
: The barcodes assignment of the Nanopore reads with matched Illumina UMIs from the same cell and the same gene.
Other files include:
-
Nanopore.bam
: The mapping of the real Nanopore reads generated by minimap2. Note: the name of this file is given by the basename value of the config keynanopore_fq
. -
sim.prob
andreal.prob
: The complete feature tables used to generate the probability scores of the simulated reads and the real Nanopore reads, respectively (option 1). -
sim.label
: the barcode assignment from the simulated reads with scores. The scores range from 0-99, and larger scores indicate higher confidence for the assignment. Reads assigned to multiple barcodes only have the assignment with the highest score retained (option 1). -
sim_barcodes.txt
: The ground-truth of cell barcodes in the simulated reads (option 1). -
sim.model.rda
: The naive bayes model trained from the simulated reads (option 1). -
genome.fa
: The artificial genome used for generating the simulated reads (option 1). -
sim_umi.fasta
andreal_umi.fasta
: The rest DNA sequences after removing the adapter and barcode sequences.
The parameters are set in the config.yaml
. If the entry is a file, then it must be placed under the data
folder.
-
reference_genome
: The reference genome sequences in FASTA format. -
nanopore_fq
: Nanopore reads you want to process in FASTQ (compressed) format. -
adapter
: 10x genomics P1 adapter sequences. -
polyTlength
: Number of poly-Ts you want to simulate. -
cdnalength
: Number of nucleotides used to append to each entry in the artifical genome. -
umilength
: Number of nucleotides of the UMI sequences. -
barcodelength
: Number of nucleotides of the cell barcode sequences. -
numSimReads
: Number of Nanopore reads to simulate. -
numSimReads
: Number of Illumina reads to sample from the Illumina sequencing. -
cutoff
: Score cutoff of the barcode assignment of the real Nanopore reads. -
percent_raw
: A fraction number representing the percentage of additional simulated reads you want to use as true negatives. These reads contain the cell barcodes from the background rather than the whitelist. From our experience, there are about 20% reads do not contain the cell barcodes from the whitelist in the 10x genomics library. -
threads
: Number of CPUs for the multiple-threaded jobs. Note: See Control the number of cores/threads per rule -
cdnaseq
: DNA sequences used to append to each entry in the artificial genome. -
nano_seed
: Seed for the pseudo-random number generator (NanoSim).
singleCellPipe is modified from flexbar which performs the adapter and barcode alignments. Most of the flexbar parameters are available in this program, yet there are a few distinctive parameters as follows:
-
-ul
: Number of nucleotides of barcode and UMI sequences. E.g., the parameter should be 26 for a 16bp barcode and 10bp UMI library. -
-kb
: Number of additional nucleotides to search after the adapter alignment. -
-fl
: Number of nucleotides from both ends for searching the barcode sequences.
In this paragraph, we explain the use of each major job in the pipeline.
-
build_illumina
: If Illumina bam file is not provided, we use the filtered cellular barcodes (barcodes.tsv.gz) detected from Cell Ranger pipeline, and produce an Illumina bam file based on the frequencies in the matrices. -
find_dist
: add some more background cellular barcodes into the cell barcode whitelist, to make sure we do not align the read to sub-optimal barcodes due to the absence of the real barcode sequences. It retrieves all the other barcodes within two edit-distances from the filtered cellular barcodes. -
get_cbfreq
: use the read counts for each barcode as prior knowledge in the Bayesian model. -
align_longreads
: We align the Nanopore reads to the reference genome using Minimap2 with long-reads settings.
-
build_genome
: We generated an artificial "genome" which contains only the cDNA primer from 10x Chromium Single Cell V3, cellular barcode and UMI sequences as the same counts as the Illumina library, followed by 20bp oligo-dT and 32bp cDNA sequences in our pipeline, in order to estimate the likelihood of barcode mismatches and indels in our model. -
build_nanosim
: The Nanopore error profile is produced using the "read_analysis.py" from NanoSim, and creates a directory under theanalysis
folder that contains the error profile of the Nanopore reads. -
sim_reads
: We generate a number of Nanopore reads based on the artificial we built previously using NanoSim. -
build_test
: The generated Nanopore reads were trimmed to 100bp and wrote to a bam file, and the ground truth barcode sequences can be known by looking into the corresponding genomic locations from the artificial genome in the pipeline. -
run_pipe_sim
andrun_pipe_real
: run the feature extraction pipeline on the bam file of the simulated and the real Nanopore reads, respectively. -
add_label
: By comparing to our ground truth barcode sequences, we assign either 0 or 1 as labels to each barcode alignment, indicating whether the corresponding alignment is correct or not. -
build_model
: build a naive bayesian model based on the label and previously extracted barcode alignment features. -
pred
: Use the naive bayesian model previously built and predict the likelihood given the alignment features from the other Nanopore reads sequenced from the same cDNA library. Then we use the Bayesian theorem to calculate the posterior probabilities that the barcode alignment is correct among all potential barcodes. The predicted probabilities allow benchmarking our predictions with the other simulated reads, or do barcode assignment with the real Nanopore reads. -
get_gene_umi
: create UMI whitelist for each gene based on the Illumina data. -
run_umi_sim
: perform UMI alignment against the UMI whitelist of the same gene on the sequences that have the barcode trimmed. -
filter_sim
andfilter_pred
: output the reads which passed the cutoff of the predicted probabilities.
Q: fatal error: tbb/pipeline.h: No such file or directory
when compiling singleCellPipe
.
A: Please run conda install [--name scNapBar -c conda-forge] tbb=2020.3 tbb-devel=2020.3
to install the required TBB
libraries.
Q: fatal error: seqan/basic.h: No such file or directory
when compiling singleCellPipe
.
A: Please download SeqAn first and move the SeqAn
include folder to seqan, or make sure you clone the repository with the --recursive
flag. If you forgot this flag, you can always git submodule update --init
afterwards.
-
Qi Wang <[email protected]>
-
Sven Bönigk <[email protected]>
-
Christoph Dieterich
Currently maintained by Etienne Boileau <[email protected]>