The SAIGA pipeline was designed for processing and analyzing Oxford Nanopore Technologies (ONT) MinION DNA-barcoding sequence data.
SAIGA was developed for and accompanies the paper: Seah, A.; Lim, M.C.; McAloose, D.; Prost, S.; Seimon, T.A. MinION-Based DNA Barcoding of Preserved and Non-Invasively Collected Wildlife Samples. Genes 2020, 11, 445.
Our pipeline is called SAIGA to help bring awareness to conservation efforts for the Critically endangered Saiga (Saiga tatarica). Saiga populations are threatened by habitat loss and fragmentation, illegal wildlife trade (meat, horn), and disease. Consequently, in addition to population size crashes, the sex ratio has been skewed due to preferential hunting of males for their horns. Saiga horn is the primary component of several Traditional Chinese Medicines that claim to detox and treat colds and lung disease; however saiga horn has no scientifically recognized medicinal value 1,2. As of 2019, the CITES Conference of Parties voted to keep saiga on the CITES Appendix II list, but there is now a zero quota for the international trade on wild saiga horn 3.
1 https://china.wcs.org/Wildlife/Saiga-Antelope.aspx, 2 https://medium.com/@WCS/saving-the-saiga-6829a429abeb, 3 https://newsroom.wcs.org/News-Releases/articleType/ArticleView/articleId/12926/BREAKING-NEWS-From-CITES-CoP18-on-Saiga-Antelope.aspx
SAIGA logo by Natalie Tam
- Getting started
- Software dependencies
- Formatting input files
- Picking parameters
- Usage: Run SAIGA!
- SAIGA output files
- Software installation advice
Update Jan 2022: Thanks for stopping by! The SAIGA pipeline is no longer in active development, so there may be software dependency updates that require refinement of the parser scripts (see issues). For a similar clustering pipeline, please check out NGSpeciesID, which grew from the development of SAIGA.
- Download this Github repository using the 'Clone or Download' green button at the top of the Github page.
- Set up pipeline directories by running the following command within the downloaded repository.
python setuppipe.py
- Install software. In the software installation advice section, we provide tips on installing sofware, however you should go to the software links for specific installation instructions.
- Guppy v3.1.5+781ed575
- NanoPlot v1.21.0
- NanoFilt v2.5.0
- qcat v1.1.0
- MiniBar v0.21
- seqtk v1.3-r106
- isONclust v0.0.6
- spoa v3.0.1
- cd-hit-est v4.8.1
- medaka v0.10.0
- NCBI blast v2.8.1+
SAIGA scripts were written with Python3 and tested for MacOS.
For Ubuntu, you likely need to edit the conda path to use medaka. See path notes to edit in medaka script.
See Demo/
for example input files.
-
Move MinKNOW folder containing fast5 files to 0_MinKNOW_rawdata folder. The fast5 files for a given dataset should be in their own folder within the 0_MinKNOW_rawdata folder.
-
Create sample list text file, save it in the 2a_samp_lists folder. The file must be named with
[year][month][day]_sample_list.txt
, where the date info is the day of the sequencing run (you could name the section in[]
whatever you want, but SAIGA scripts use this label to keep track of files from different sequence runs). The file is tab-delimited. It requires the sample name, barcoding gene, amplicon length, and index (ONT or custom). One line per sample. -
If you're using MiniBar to demultiplex reads, create primer and index sequence file, save it in the 2a_samp_lists folder. The file must be named with
[year][month][day]_primerindex.txt
, where the date info as the same as above (this label must match!). The file is tab-delimited. It requires the sample name (same as above), barcoding primer name, barcoding primer sequences, and index sequences (including reverse complement sequences for both). See MiniBar package website for more details. I have a script (Pipeline_scripts/revcomp.py
) that generates reverse complement sequences for given input sequences (all_primerindex_seqs.csv
), formatted like this:
MarkerName,MarkerSeq,Type
16s_FSn,AYHAGACSAGAAGACCC,primer
NB01,CACAAAGACACCGACAACTTTCTT,ONTindex
The _primerindex.txt
file currently has to be created manually.
- Create a fasta file with your Sanger or other reference sequences, save it in the Blast_resources folder. This file is used during the Blast step. The reliability of the Blast species identification depends on the quality of this reference database - curate carefully! Each sequence header should include the sample name, species identifier, and barcoding gene, separated by spaces.
The pipeline is written to allow you to run different steps without rerunning all analyses over and over. The flags/options described below should be edited in the run_saiga.sh
script. The top lines of this script run the demo dataset.
The first time the pipeline is run on a given dataset, you'll need to analyze the raw basecalled files with --rawNP
, demultiplex samples with --demultgo
, and filter reads with --filt
.
SAIGA was written to implement demultiplexing by either qcat or MiniBar. qcat can only demultiplex (currently) with ONT index kits, while MiniBar can demultiplex ONT and custom indexes. To run qcat, you need the following flags: --demultgo y --demult qcat --min_score [ ] --ONTbarcodekit [ ]
. To run MiniBar, you need the following flags: --demultgo y --demult minibar --mbseqs [ ] --mb_idx_dist [ ] --mb_pr_dist [ ]
.
The pipeline filters by read quality with a Phred score threshold and by read length with a buffer you set around the amplicon length provided in your input sample_list.txt
file: --filt y --qs [ ] --buffer [ ]
.
Next, you can choose to analyze the full demultiplexed and filtered dataset (--subgo y --subset none
) or subsets (--subgo y --subset 500 --subseed 100
for sample of 500 reads with random seed 100) for each sample.
To complete the pipeline analysis, use the --clust
flag. You need to specify the demultiplexer (--demult
), the subset (--subset
), a threshold for the minimum percent of reads per isONclust cluster (--perthresh
), a minimum cd-hit-est cluster similarity threshold (--cdhitsim
), and a fasta file with sequences you'd like to compare consensus sequences to (--db
): --clust y --demult [ ] --subset [ ] --perthresh [ ] --cdhitsim [ ] --db [ ]
.
You can run the pipeline steps in pieces - just set the required option flags to n
if you don't want to re-run those steps.
Example commands:
# Run all steps of pipeline at once
python saiga_wrapper.py --dat demo --samps demo_sample_list.txt --rawNP y --demultgo y --filt y --subgo y --clust y --demult qcat --qcat_minscore 99 --ONTbarcodekit PBC001 --qs 7 --buffer 100 --subset 50 --subseed 100 --perthresh 0.1 --cdhitsim 0.8 --db demo.fasta
# Run only through filtering
python saiga_wrapper.py --dat demo --samps demo_sample_list.txt --rawNP y --demultgo y --filt y --subgo n --clust n --demult minibar --mbseqs demo_primerindex.txt --mb_idx_dist 2 --mb_pr_dist 11 --qs 7 --buffer 100
# Run only the subsetting and clustering steps (assuming the previous steps had already been run)
python saiga_wrapper.py --dat demo --samps demo_sample_list.txt --rawNP n --demultgo n --filt n --subgo y --clust y --demult minibar --subset 50 --subseed 100 --perthresh 0.1 --cdhitsim 0.8 --db demo.fasta
REQUIRED flags:
Flag | Description |
---|---|
--datID | dataset identifer; typically yearmonthdate of sequence run (e.g., 20190906 for Sept 6, 2019); must match the label used for sample_list.txt and primerindex.txt files |
--samps | tab-delimited text file of sample names, barcode, barcode length, index name (e.g., 20190906_sample_list.txt) |
--rawNP | Option to generate NanoPlots for raw reads. Options: y, n |
--demultgo | Option to demultiplex reads. Options: y, n. MiniBar requires --demult, --mbseqs, --mb_idx_dist, --mb_pr_dist. Qcat requires --demult, --qcat_minscore, --ONTbarcodekit flags |
--filt | Option to filter demultiplexed reads. Options: y, n. Requires --qs, --buffer flags |
--subgo | Option to make random data subsets. Options: y, n. Requires --subset, --subseed flags |
--clust | Option to cluster and Blast. Options: y, n. Requires --demult, --subset, --subseed, --perthresh, --cdhitsim, --db flags |
Additional flags:
Flag | Description |
---|---|
--demult | Options: qcat , minibar |
--mbseqs | For MiniBar demultiplexing, input primer and index seqs file (e.g., 20190906_primerindex.txt) |
--mb_idx_dist | MiniBar index edit distance (e.g., 2) |
--mb_pr_dist | MiniBar primer edit distance (e.g., 11) |
--qcat_minscore | qcat minimum alignment score (0-100 scale) |
--ONTbarcodekit | ONT barcode kit (e.g., PBC001) |
--qs | Phred quality score threshold to filter reads by |
--buffer | Buffer length +/- amplicon length to filter reads by |
--subset | Options: none OR integer subset of reads to be randomly selected (e.g., 500) |
--subseed | seed number to create random data subsets with seqtk |
--perthresh | Percent read threshold for keeping isONclust clusters (e.g., 0.1 for keeping clusters with >= 10% of reads) |
--cdhitsim | Sequence similarity threshold for cd-hit-est to cluster spoa consensus sequences by (e.g., 0.8 for clustering spoa consensus with at least 80% similarity) |
--db | Blast reference database fasta file |
Note: The demo files have already been basecalled.
- Go to
Demo/
for input files. The demo files need to be moved to correct pipeline directories:
- Download demo files
- Move the entire
demo_guppybasecallouts
directory to1_basecalled
- unzip the fastq files. In bash terminal, you can enter:
for file in *.fastq.gz; do gunzip $file; done
- unzip the fastq files. In bash terminal, you can enter:
- Move
demo_primerindex.txt
anddemo_sample_list.txt
to2a_samp_lists
- Move
demo.fasta
toBlast_resources
- Run pipeline:
- Open the
run_saiga.sh
script. Edit python command as needed for script options. Demo dataset commands are at the top. - Run:
bash run_saiga.sh
in terminal
- Check output results:
- Outputs from Blast search are in
FinalResults/
(allsamps_parsedout.txt
files)
-
Add your data input files (see Formatting input files).
-
Basecall MinKNOW fast5 files with this command. Note that the directory paths are separated with spaces! The
guppy_basecalling_wrapper.sh
script requires the path to fast5 files, the output path, and the directory path for SAIGA (should be./
since commands are executed from within the SAIGA directory!).
Example:
bash ./Pipeline_scripts/guppy_basecalling_wrapper.sh 0_MinKNOW_rawdata/yourdataset_minknow/fast5 yourdataset_guppybasecallouts ./
- Run rest of pipeline.
You can simply run the pipeline with python saiga_wrapper.py [flags]
, but if you have many datasets or analyses to do, you can write all the python commands into the run_saiga.sh
and run them all with bash run_saiga.sh
.
- Edit
run_saiga.sh
:- This is the script that runs through all the pipeline steps for multiple datasets and/or analyses. You need to comment out the demo version commands (add ‘#’ to the beginning of the lines) so they aren't run.
- Then edit the python command flags with appropriate dataset name, files, and parameter values. You need to set the following flags to
y
orn
to tell the pipeline whether to:--rawNP
: concatenate raw basecalled fastq files, output NanoPlot stats--demultgo
: demultiplex reads--filt
: filter reads--subgo
: create subsets--clust
: cluster reads, generate consensus, blast to reference database
- If you've run the top steps, and want to rerun the later steps, just switch the top ones to
n
. However, all steps have to be run at least once!
bash run_saiga.sh
- Check your results in
FinalResults/
. 🎉 👏
Brief descriptions of output files in SAIGA directories.
Outputs generated by the --rawNP
option:
_guppybasecallouts/
contain guppy basecalled fastq and log files, as well as a concatenated fastq file._raw_NanoPlots/
contain NanoPlots for the raw basecalled data, including figures and stats related to the overall sequence run. See NanoPlot Github for more details on outputs.
Outputs generated by the --demultgo
, --filt
, and --subgo
options:
minibar_demultiplexouts/
contain MiniBar demultiplexed read outputsqcat_demultiplexouts/
contain qcat demultiplexed read outputs- Both directories contain:
.fastq
demultiplexed read files_filtered.fastq
filtered demultiplexed read files.log
files from NanoFilt after read filteringsub/
contain filtered read subsets per sample and_filtered_NanoPlots/
NanoPlots for the filtered reads- If you run the pipeline on the full set of demultiplexed reads (
--subset none
), there will be_demultiplexed_NanoPlots/
and_filtered_NanoPlots/
NanoPlot directories for both demultiplexed and subsequent filtered reads.
- The
--subset none
option will also generate_demultnanostats.txt
and_filtnanostats.txt
files, which summarizes NanoPlot stats from the_demultiplexedout/
directory for each dataset. These files make it easier to check read statistics quickly without digging through too many directories.
Outputs generated by the --clust
option:
readclstrs/
contain outputs from isONclust, SPOA, and cd-hit-est, as well as input files for Medaka for each sample:clstr_fqs/
contain fastq files for all isONclust clusters, as well as SPOA consensus sequence fasta files for clusters that pass the--perthresh
thresholdreadsformedaka/
contain the input read fastq file for Medaka- Output files related to isONclust (all generated by isONclust):
- Numbered directories are intermediate outputs
sorted.fastq
intermediate filelogfile.txt
contains output statistics about read error ratefinal_cluster_origins.tsv
contains 1 representative read per isONclust clusterfinal_clusters.tsv
contains all reads per isONclust cluster
- Output files related to SPOA (generated by SAIGA):
all_clstr_conseqs.fasta
is the input file for cd-hit-est with all the SPOA consensus sequences
- Output files related to cd-hit-est:
cdhit_[].fasta
contains 1 representative sequence per cd-hit-est cluster; generated by cd-hit-estcdhit_[].fasta.clstr
shows how cd-hit-est clustered SPOA consensus sequences; generated by cd-hit-estfinal clusters_subset.csv
contains all reads that pass cd-hit-est cluster round, will be used to format input file for Medaka; generated by SAIGA
- Output files related to Medaka:
formedaka_singleline.fasta
contains SPOA consensus sequences that pass cd-hit-est cluster round; generated by SAIGAformedaka_singleline.fasta.firstclstr
contains the SPOA consensus sequence from majority read isONclust cluster, and will be used as input for Medaka; generated by SAIGA- intermediatee files generated by Medaka:
formedaka_singleline.firstclstr.fai, formedaka_singleline.firstclstr.mmi
- Read cluster tallies (generated by SAIGA):
reads_per_cluster.txt
uses thefinal_clusters.tsv
file to tally the number, percent, and cumulative sum of reads per isONclust clusterparsedclstr_table.txt
shows number, percent, and cumulative sum of reads per isONclust clusters that passed the--perthresh
threshold, as well as their cd-hit-est cluster assignment. The total number of reads that go on to the Medaka step is tallied from this table. It only includes reads that are included in the top line cd-hit-est cluster. The proportion of reads used to generate the final consensus sequence comes from this table.
Outputs generated by the --clust
option:
_spID/
contains outputs from Medaka and Blast search for eaech sample:- intermediate files generated by Medaka:
calls_to_draft.bam, calls_to_draft.bam.bai, consensus_probs.hdf
consensus.fasta
contains the final consensus sequence from Medakablastout.sorted
contains the Blast output (output format 6)_finalparsed_output.txt
contains the best Blast result based on bit score and lowest number of mismatches; generated by SAIGA
- intermediate files generated by Medaka:
This directory contains _allsamps_parsedout.txt
files that summarize the following for the analysis run:
Column | Description |
---|---|
Sample | sample name |
Raw_count | total number of raw reads from entire sequence run |
Demultiplexed_count | number of demultiplexed filtered reads per sample |
isONclust_count | number of reads used to build final consensus |
prop_demreads_inclstr | proportion of reads used to build final consensus |
query | Blast query (Medaka consensus sequence) |
genbank | Blast database match ID |
per_id | Blast percent sequence identity |
aln_len | Blast alignment length of match |
num_mismatch | Blast number of mismatches |
gap_open | Blast alignment gaps |
qstart | Blast query start position |
qend | Blast query end position |
sstart | Blast database sequece start position |
send | Blast database sequence end position |
evalue | Blast evalue |
bitscore | Blast bit score |
db_sp_name | Species ID of Blast database sequence match |
MkConsSeq | Medaka consensus sequence |
One of the easiest ways to install python libraries and other software is to use conda install from Anaconda.
- Download Anaconda3: https://www.anaconda.com/distribution/. We want Anaconda3 so that it downloads Python3. Follow installation instructions from website.
- Once you have anaconda, configure the conda command to tell it where to look online for software. It does not matter what directory you’re in.
conda config --add channels bioconda
conda config --add channels conda-forge
Note that this pipeline assumes the following software are called via source ~/.bashrc
. You can add them with a text editor, e.g., nano ~/.bashrc
.
- guppy
- qcat
- MiniBar
- seqtk
- spoa
- cd-hit
- blast
https://community.nanoporetech.com/downloads
- I downloaded Mac OSX version
unzip ont-guppy-cpu_2.3.1_osx64.zip
- Add guppy to bashrc path (
export PATH=$PATH:[your path]/ont-guppy-cpu/bin
)
https://github.com/wdecoster/nanofilt
conda install -c bioconda nanofilt
- Runs as
NanoFilt <flags>
- this was originally causing a problem before I added bioconda channel to conda
https://github.com/wdecoster/NanoPlot
- Download the zipped folder of the github repository from green button that says ‘Clone or download’
- Unzip folder:
unzip NanoPlot-master.zip
- Set up NanoPlot:
cd ./NanoPlot-master
python setup.py develop
- Runs as
NanoPlot <inputs>
- Alternatively, use pip or conda installation commands (the above is a work-around since pip and conda timed out for me)
https://github.com/nanoporetech/qcat
- follow qcat download instructions (I downloaded via
git clone
option (v1.1.0)) - Add qcat to bashrc path (
export PATH=$PATH:[your path]/qcat
)
https://github.com/calacademy-research/minibar
- Download MiniBar:
curl https://raw.githubusercontent.com/calacademy-research/minibar/master/minibar.py
- Permit execution of MiniBar’s python script:
chmod 775 minibar.py
- minibar.py needs python library called edlib. Note that this differs from edlib, which is a command line version that works in Unix/Linux environment rather than Python.
conda install -c bioconda python-edlib
- Add MiniBar to bashrc path (
export PATH=$PATH:[your path]/MiniBar/
)
- follow github installation instructions
- Add to bashrc path (
export PATH=$PATH:[your path]/seqtk
)
https://github.com/ksahlin/isONclust
- download:
pip install isONclust
https://github.com/rvaser/spoa
git clone --recursive https://github.com/rvaser/spoa spoa
cd spoa
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -Dspoa_build_executable=ON ..
make
- Add spoa to bashrc path (
export PATH=$PATH:[your path]/spoa/build/bin/
)
https://github.com/weizhongli/cdhit/wiki/2.-Installation
- cd-hit-est needs gcc:
brew install gcc
- download cd-hit tar.gz file
tar xvf cd-hit-v4.6.6-2016-0711.tar.gz --gunzip
cd cd-hit-v4.8.1-2019-0228
make CC=/usr/local/Cellar/gcc/9.1.0/bin/g++-9
make
cd cd-hit-auxtools/
make
- Add cd-hit to bashrc path (
export PATH=$PATH:[your path]/cd-hit-v4.8.1-2019-0228
)
https://github.com/nanoporetech/medaka
conda create -n medaka -c conda-forge -c bioconda medaka
- Medaka now has its own conda environment, which means you have to activate/deactivate the environment to run medaka:
conda activate medaka
conda deactivate
- if you get an error about numpy, I had to roll back the version of a library called openblas.
conda install openblas=0.3.6
- Runs as
medaka_consensus
https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
- Follow download instructions from NCBI
- Add ncbi-blast to bashrc path (
export PATH=$PATH:[your path]/ncbi-blast-2.8.1+/bin
)