forked from sortmerna/sortmerna
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSortMeRNA-User-Manual.tex
995 lines (832 loc) · 44.4 KB
/
SortMeRNA-User-Manual.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
\documentclass[10pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{amsmath, amsthm, amssymb}
\usepackage[english]{isodate}
\usepackage[parfill]{parskip}
\usepackage{url}
\usepackage{keystroke}
\usepackage{graphicx}
\usepackage{fancyvrb}
\usepackage{color}
\usepackage[usenames,dvipsnames]{xcolor}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{hyperref}
\hypersetup{
colorlinks,
citecolor=black,
filecolor=black,
linkcolor=black,
urlcolor=black
}
\usepackage{tikz} % graphic diagrams
\usetikzlibrary{positioning,patterns,backgrounds,decorations.pathreplacing,decorations.markings,shapes,fit,calc,shadows} % fitting shapes to coordinates
\usetikzlibrary{automata,trees}
\newcommand\verbbf[1]{\textcolor[rgb]{0,0,1}{\textbf{#1}}}
\title{SortMeRNA User Manual}
\author{Evguenia Kopylova\\ {\em [email protected]}}
\date{Feb 2016, version 2.1}
\addtolength{\oddsidemargin}{-.7in}
\addtolength{\textwidth}{1.2in}
\begin{document}
\maketitle
\newpage
\tableofcontents
\newpage
\section{Introduction}
Copyright (C) 2012-2016 Bonsai Bioinformatics Research Group \\
(LIFL - Universit\'{e} Lille 1), CNRS UMR 8022, INRIA Nord-Europe, France \\
\url{http://bioinfo.lifl.fr/RNA/sortmerna/} \\
Copyright (C) 2014-2016 Knight Lab \\
Department of Pediatrics, UCSD School of Medicine, La Jolla, California, USA\\
\url{https://knightlab.colorado.edu}
SortMeRNA is a local sequence alignment tool for filtering, mapping and OTU-picking.
The core algorithm is based on approximate seeds and allows for fast and sensitive analyses
of NGS reads. The main application of SortMeRNA is filtering rRNA from metatranscriptomic data.
Additional applications include OTU-picking and taxonomy assignation available through QIIME v1.9+ (\url{http://qiime.org}).
SortMeRNA takes as input a file of reads (fasta or fastq format) and one or multiple rRNA
database file(s), and sorts apart aligned and rejected reads into two files specified by the user.
SortMeRNA works with Illumina, 454, Ion Torrent and PacBio data, and can produce SAM and
BLAST-like alignments.
For questions \& help, please contact:
\begin{verbatim}
1. Evguenia Kopylova [email protected]
2. Laurent Noe [email protected]
3. Helene Touzet [email protected]
4. Rob Knight [email protected]
\end{verbatim}
{\bf Important:} This user manual is strictly for SortMeRNA version 2.1.
\section{Installation}
\begin{figure}[here!]
\caption{\texttt{sortmerna- 2.1} directory tree}~\\
\centering
\tikzstyle{every node}=[draw=black,thick,anchor=west]
\tikzstyle{selected}=[draw=red,fill=red!30]
\tikzstyle{root}=[fill=gray!50]
\begin{tikzpicture}[%
grow via three points={one child at (0.5,-0.7) and
two children at (0.5,-0.7) and (0.5,-1.4)},
edge from parent path={(\tikzparentnode.south) |- (\tikzchildnode.west)}]
\node [root] {sortmerna- 2.1}
child { node {alp}}
child { node {cmph}}
child { node {src}}
child { node {include}}
child { node {scripts}}
child { node {tests}}
child { node {rRNA\_databases}
child { node {silva-bac-16s-id90.fasta}}
child { node {...}}
}
child [missing] {}
child [missing] {}
child { node [selected] {sortmerna} }
child { node [selected] {indexdb\_rna} }
;
\end{tikzpicture}
\label{fig:systemtree}
\end{figure}
\subsection{Install from tarball release}
\label{sec:install}
\begin{enumerate}
\item Download \texttt{sortmerna- 2.1.tar.gz} from \url{https://github.com/biocore/sortmerna/releases}
\item Extract the source code package into a directory of your choice, enter \texttt{sortmerna- 2.1} directory and type,
\begin{verbatim}
> bash ./build.sh
\end{verbatim}
\item At this point, two executables \texttt{indexdb\_rna} and \texttt{sortmerna} will be located
in the \texttt{sortmerna- 2.1} directory.
If the user would like to install the executables into their default installation directory (\texttt{/usr/local/bin} for Linux or \texttt{/opt/local/bin} for Mac) then type,
\begin{verbatim}
> make install (with root permissions)
\end{verbatim}
\item To begin using SortMeRNA, type `\texttt{indexdb\_rna -h}' or `\texttt{sortmerna -h}'. Databases must first be indexed using \texttt{indexdb\_rna}.
\end{enumerate}
\subsection{Install development version from git}
\label{sec:install}
\begin{enumerate}
\item Clone the sortmerna directory to your local system
\begin{verbatim}
> git clone https://github.com/biocore/sortmerna.git
\end{verbatim}
\item Build sortmerna
\begin{verbatim}
> cd sortmerna
> bash ./build.sh
\end{verbatim}
\end{enumerate}
\subsection{Install from precompiled code}
\begin{enumerate}
\item Download the latest binary distribution of SortMeRNA from \url{http://bioinfo.lifl.fr/RNA/sortmerna}
\item Extract the source code package into a directory of your choice,
\begin{verbatim}
> tar -xvf sortmerna- 2.1.tar.gz
> cd sortmerna- 2.1
\end{verbatim}
\item To begin using SortMeRNA, type `\texttt{indexdb\_rna -h}' or `\texttt{sortmerna -h}'. The user must firstly index
the databases with the command \texttt{indexdb\_rna} before they can run the command \texttt{sortmerna}.
\end{enumerate}
\subsection{Uninstall}
\noindent If the user installed SortMeRNA using the command \texttt{`make install'}, then they can use the command \texttt{`make uninstall'} to
uninstall SortMeRNA (with root permissions).
\section{Databases}
\noindent SortMeRNA comes prepackaged with 8 databases,\\
\resizebox{6.4in}{!}{
\begin{tabular}{l|l|l|l|l}
\textbf{representative database} & \textbf{\%id} & $\#$ \textbf{seq (clustered)} & \textbf{origin} & $\#$ \textbf{seq (original)} \\
\hline
silva-bac-16s-id90 & 90 & 12798 & SILVA SSU Ref NR v.119 & 464618 \\
silva-arc-16s-id95 & 95 & 3193 & SILVA SSU Ref NR v.119 & 18797 \\
silva-euk-18s-id95 & 95 & 7348 & SILVA SSU Ref NR v.119 & 51553 \\
silva-bac-23s-id98 & 98 & 4488 & SILVA LSU Ref v.119 & 43822 \\
silva-arc-23s-id98 & 98 & 251 & SILVA LSU Ref v.119 & 629 \\
silva-euk-28s-id98 & 98 & 4935 & SILVA LSU Ref v.119 & 13095 \\
rfam-5s-id98 & 98 & 59513 & RFAM & 116760 \\
rfam-5.8s-id98 & 98 & 13034 & RFAM & 225185 \\
\end{tabular}}
~\\
HMMER 3.1b1 and SumaClust v1.0.00 were used to reduce the size of the original databases to the similarity listed in column 2 (\%id) of the table above
(see {\tt/sortmerna/rRNA\_databases/README.txt} for a list of complete steps).
These representative databases were specifically made for fast filtering of rRNA. Approximately the same number of rRNA will be filtered
using silva-bac-16s-id90 (12802 rRNA) as using Greengenes 97\% (99322 rRNA), but the former will run significantly faster.
\noindent \textbf{id} $\%$: members of the cluster must have identity at least this \% id with the representative sequence \\
\noindent \textbf{Remark}: The user must first index the fasta database by using the command \texttt{indexdb\_rna} and
then filter/map reads against the database using the command \texttt{sortmerna}.
\section{How to run SortMeRNA}
\subsection{Index the rRNA database: command `\texttt{indexdb\_rna}'}
\noindent The executable \texttt{indexdb\_rna} indexes an rRNA database.\\
\noindent To see the man page for \texttt{indexdb\_rna},
\begin{Verbatim}[fontsize=\footnotesize]
>> indexdb_rna -h
Program: SortMeRNA version 2.1, 01/02/2016
Copyright: 2012-16 Bonsai Bioinformatics Research Group:
LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe
2014-16 Knight Lab, Department of Pediatrics, UCSD, La Jolla,
Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the
implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the GNU Lesser General Public License for more details.
Contact: Evguenia Kopylova, [email protected]
Laurent Noe, [email protected]
Helene Touzet, [email protected]
usage: ./indexdb_rna --ref db.fasta,db.idx [OPTIONS]:
--------------------------------------------------------------------------------------------------------
| parameter value description default |
--------------------------------------------------------------------------------------------------------
--ref STRING,STRING FASTA reference file, index file mandatory
(ex. --ref /path/to/file1.fasta,/path/to/index1)
If passing multiple reference sequence files, separate
them by ':',
(ex. --ref /path/to/file1.fasta,/path/to/index1:/path/to/file2.fasta,path/to/index2)
[OPTIONS]:
--fast BOOL suggested option for aligning ~99% related species off
--sensitive BOOL suggested option for aligning ~75-98% related species on
--tmpdir STRING directory where to write temporary files
-m INT the amount of memory (in Mbytes) for building the index 3072
-L INT seed length 18
--max_pos INT maximum number of positions to store for each unique L-mer 10000
(setting --max_pos 0 will store all positions)
-v BOOL verbose
-h BOOL help
\end{Verbatim}
~\\~\\
\noindent There are eight rRNA representative databases provided in the `\texttt{sortmerna- 2.1/rRNA\_databases}' folder.
All databases were derived from the SILVA SSU and LSU databases (release 119) and the RFAM databases using HMMER 3.1b1 and SumaClust v1.0.00.
Additionally, the user can index their own database. \\
\subsubsection{Example 1: indexdb\_rna using one database}
\begin{Verbatim}[fontsize=\footnotesize]
>> ./indexdb_rna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s-db -v
Program: SortMeRNA version 2.1, 01/02/2016
Copyright: 2012-16 Bonsai Bioinformatics Research Group:
LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe
2014-16 Knight Lab, Department of Pediatrics, UCSD, La Jolla,
Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the
implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the GNU Lesser General Public License for more details.
Contact: Evguenia Kopylova, [email protected]
Laurent Noe, [email protected]
Helene Touzet, [email protected]
Parameters summary:
K-mer size: 19
K-mer interval: 1
Maximum positions to store per unique K-mer: 10000
Total number of databases to index: 1
Begin indexing file ./rRNA_databases/silva-bac-16s-id90.fasta under index name ./index/silva-bac-16s-db:
Collecting sequence distribution statistics .. done [1.133206 sec]
start index part # 0:
(1/3) building burst tries .. done [23.643256 sec]
(2/3) building CMPH hash .. done [22.306709 sec]
(3/3) building position lookup tables .. done [54.958680 sec]
total number of sequences in this part = 12798
writing kmer data to ./index/silva-bac-16s-db.kmer_0.dat
writing burst tries to ./index/silva-bac-16s-db.bursttrie_0.dat
writing position lookup table to ./index/silva-bac-16s-db.pos_0.dat
writing nucleotide distribution statistics to ./index/silva-bac-16s-db.stats
done.
\end{Verbatim}
~\\
\subsubsection{Example 2: indexdb\_rna using multiple databases}
Multiple databases can be indexed simultaneously by passing them as a `:' separated list to \texttt{--ref} (no spaces allowed).
\begin{Verbatim}[fontsize=\footnotesize]
>> ./indexdb_rna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s-db:\
./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-db:\
./rRNA_databases/silva-arc-16s-id95.fasta,./index/silva-arc-16s-db:\
./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-db:\
./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-db:\
./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s:\
./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-db:\
./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s-db
\end{Verbatim}
\newpage
\subsection{A guide to choosing `{\bf sortmerna}' parameters for filtering and read mapping}
In SortMeRNA version 1.99 beta and up, users have the option to output sequence alignments for their matching rRNA reads in
the SAM or BLAST-like formats. Depending on the desired quality of alignments, different parameters choices must be set.
Table~\ref{tab:guide} presents a guide to setting parameters choices for most use cases. In all cases, output alignments are always guaranteed to reach
the threshold E-value score (default E-value=1). An E-value of 1 signifies that one random alignment is expected for aligning
\textbf{all} reads against the reference database. The E-value in SortMeRNA is computed for the entire search space, not per read.
\begin{table}[htp!]
\caption{SortMeRNA alignment parameter guide}
\label{tab:guide}
\centering
\footnotesize
\begin{tabular}{l | l | l}
\toprule
\parbox[t]{0.6in}{\sf option} & {\sf speed} & \parbox[t]{0.45in}{\sf description} \\
\midrule
\multirow{8}{*}{{\tt --num-alignments INT}}
& Very fast for {\tt INT = 1}& \parbox{6cm}{Output the first alignment passing E-value threshold ({\bf best choice if only filtering is needed})} \\
\cmidrule{2-3}
& Speed decreases for higher value {\tt INT} & \parbox{6cm}{Higher {\tt INT} signifies more alignments will be made \& output }\\
\cmidrule{2-3}
& Very slow for {\tt INT = 0} & \parbox{6cm}{All alignments reaching the E-value threshold are reported (this option is not suggested for high similarity rRNA databases, due to many possible alignments per read causing a very large file output)} \\
\midrule
\multirow{4}{*}{{\tt --best INT}}
& Fast for {\tt INT = 1} & \parbox{6cm}{Only one high-candidate reference sequence will be searched for alignments (determined heuristically using a Longest Increasing Subsequence of seed matches). The single best alignment of those will be reported }\\
\cmidrule{2-3}
& Speed decreases for higher value {\tt INT} & \parbox{6cm}{Higher {\tt INT} signifies more alignments will be made, though only the best one will be reported } \\
\cmidrule{2-3}
& Very slow for {\tt INT = 0} & \parbox{6cm}{All high-candidate reference sequences will be searched for alignments, though only the best one will be reported }\\
\bottomrule
\end{tabular}
\end{table}
\newpage
\subsection{Filter rRNA reads}
\noindent The executable \texttt{sortmerna} can filter rRNA reads against an indexed rRNA database.\\
\noindent To see the man page for \texttt{sortmerna},
\begin{Verbatim}[fontsize=\footnotesize]
>> ./sortmerna -h
Program: SortMeRNA version 2.1, 01/02/2016
Copyright: 2012-16 Bonsai Bioinformatics Research Group:
LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe
2014-16 Knight Lab, Department of Pediatrics, UCSD, La Jolla,
Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the
implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the GNU Lesser General Public License for more details.
Contact: Evguenia Kopylova, [email protected]
Laurent Noe, [email protected]
Helene Touzet, [email protected]
usage: ./sortmerna --ref db.fasta,db.idx --reads file.fa --aligned base_name_output [OPTIONS]:
-------------------------------------------------------------------------------------------------------------
| parameter value description default |
-------------------------------------------------------------------------------------------------------------
--ref STRING,STRING FASTA reference file, index file mandatory
(ex. --ref /path/to/file1.fasta,/path/to/index1)
If passing multiple reference files, separate
them using the delimiter ':',
(ex. --ref /path/to/file1.fasta,/path/to/index1:/path/to/file2.fasta,path/to/index2)
--reads STRING FASTA/FASTQ reads file mandatory
--aligned STRING aligned reads filepath + base file name mandatory
(appropriate extension will be added)
[COMMON OPTIONS]:
--other STRING rejected reads filepath + base file name
(appropriate extension will be added)
--fastx BOOL output FASTA/FASTQ file off
(for aligned and/or rejected reads)
--sam BOOL output SAM alignment off
(for aligned reads only)
--SQ BOOL add SQ tags to the SAM file off
--blast STRING output alignments in various Blast-like formats
`0' - pairwise
`1' - tabular (Blast -m 8 format)
`1 cigar' - tabular + column for CIGAR
`1 cigar qcov' - tabular + columns for CIGAR
and query coverage
`1 cigar qcov qstrand' - tabular + columns for CIGAR,
query coverage and strand
--log BOOL output overall statistics off
--num_alignments INT report first INT alignments per read reaching E-value -1
(--num_alignments 0 signifies all alignments will be output)
or (default)
--best INT report INT best alignments per read reaching E-value 1
by searching --min_lis INT candidate alignments
(--best 0 signifies all candidate alignments will be searched)
--min_lis INT search all alignments having the first INT longest LIS 2
LIS stands for Longest Increasing Subsequence, it is
computed using seeds' positions to expand hits into
longer matches prior to Smith-Waterman alignment.
--print_all_reads BOOL output null alignment strings for non-aligned reads off
to SAM and/or BLAST tabular files
--paired_in BOOL both paired-end reads go in --aligned fasta/q file off
(interleaved reads only, see Section 4.2.4 of User Manual)
--paired_out BOOL both paired-end reads go in --other fasta/q file off
(interleaved reads only, see Section 4.2.4 of User Manual)
--match INT SW score (positive integer) for a match 2
--mismatch INT SW penalty (negative integer) for a mismatch -3
--gap_open INT SW penalty (positive integer) for introducing a gap 5
--gap_ext INT SW penalty (positive integer) for extending a gap 2
-N INT SW penalty for ambiguous letters (N's) scored as --mismatch
-F BOOL search only the forward strand off
-R BOOL search only the reverse-complementary strand off
-a INT number of threads to use 1
-e DOUBLE E-value threshold 1
-m INT INT Mbytes for loading the reads into memory 1024
(maximum -m INT is 4096)
-v BOOL verbose off
[OTU PICKING OPTIONS]:
--id DOUBLE %id similarity threshold (the alignment must 0.97
still pass the E-value threshold)
--coverage DOUBLE %query coverage threshold (the alignment must 0.97
still pass the E-value threshold)
--de_novo_otu BOOL FASTA/FASTQ file for reads matching database < %id off
(set using --id) and < %cov (set using --coverage)
(alignment must still pass the E-value threshold)
--otu_map BOOL output OTU map (input to QIIME's make_otu_table.py) off
[ADVANCED OPTIONS] (see SortMeRNA user manual for more details):
--passes INT,INT,INT three intervals at which to place the seed on the read L,L/2,3
(L is the seed length set in ./indexdb_rna)
--edges INT number (or percent if INT followed by % sign) of 4
nucleotides to add to each edge of the read
prior to SW local alignment
--num_seeds INT number of seeds matched before searching 2
for candidate LIS
--full_search BOOL search for all 0-error and 1-error seed off
matches in the index rather than stopping
after finding a 0-error match (<1% gain in
sensitivity with up four-fold decrease in speed)
--pid BOOL add pid to output file names off
[HELP]:
-h BOOL help
--version BOOL SortMeRNA version number
\end{Verbatim}
~\\
\noindent The user can adjust the amount of memory allocated for loading the reads through the
command option \texttt{-m}. By default, \texttt{-m} is set to be high enough for 1GB.
If the reads file is larger than 1GB, then \texttt{sortmerna} internally divides the file into partial sections of
1GB and executes one section at a time. Hence, if a user has an input file of 15GB and only 1GB of RAM to store it, the
file will be processed in partial sections using \texttt{mmap} without having to physically split it prior to execution. Otherwise, the user
can increase \texttt{-m} to map larger portions of the file. The limit for \texttt{-m} is given by typing \texttt{sortmerna -h}.
\newpage
\subsubsection{Example 3: multiple databases and the fastest alignment option}
\begin{Verbatim}[fontsize=\footnotesize]
>> time ./sortmerna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s-db:\
./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-db:\
./rRNA_databases/silva-arc-16s-id95.fasta,./index/silva-arc-16s-db:\
./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-db:\
./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-db:\
./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s:\
./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-db:\
./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s-db\
--reads SRR106861.fasta --sam --num_alignments 1 --fastx --aligned SRR105861_rRNA\
--other SRR105861_non_rRNA --log -v
Program: SortMeRNA version 2.1, 01/02/2016
Copyright: 2012-16 Bonsai Bioinformatics Research Group:
LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe
2014-16 Knight Lab, Department of Pediatrics, UCSD, La Jolla,
Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the
implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the GNU Lesser General Public License for more details.
Contact: Evguenia Kopylova, [email protected]
Laurent Noe, [email protected]
Helene Touzet, [email protected]
Computing read file statistics ... done [2.16 sec]
size of reads file: 35238748 bytes
partial section(s) to be executed: 1 of size 35238748 bytes
Parameters summary:
Number of seeds = 2
Edges = 4 (as integer)
SW match = 2
SW mismatch = -3
SW gap open penalty = 5
SW gap extend penalty = 2
SW ambiguous nucleotide = -3
SQ tags are not output
Number of threads = 1
Begin mmap reads section # 1:
Time to mmap reads and set up pointers [0.11 sec]
Begin analysis of: ./rRNA_databases/silva-bac-16s-id90.fasta
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.602397
Gumbel K = 0.328927
Minimal SW score based on E-value = 54
Loading index part 1/1 ... done [4.67 sec]
Begin index search ... done [83.53 sec]
Freeing index ... done [0.87 sec]
Begin analysis of: ./rRNA_databases/silva-bac-23s-id98.fasta
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.603075
Gumbel K = 0.330488
Minimal SW score based on E-value = 53
Loading index part 1/1 ... done [3.63 sec]
Begin index search ... done [94.76 sec]
Freeing index ... done [0.41 sec]
Begin analysis of: ./rRNA_databases/silva-arc-16s-id95.fasta
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.596230
Gumbel K = 0.322143
Minimal SW score based on E-value = 52
Loading index part 1/1 ... done [1.14 sec]
Begin index search ... done [22.63 sec]
Freeing index ... done [0.14 sec]
Begin analysis of: ./rRNA_databases/silva-arc-23s-id98.fasta
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.597749
Gumbel K = 0.325630
Minimal SW score based on E-value = 49
Loading index part 1/1 ... done [0.50 sec]
Begin index search ... done [13.27 sec]
Freeing index ... done [0.06 sec]
Begin analysis of: ./rRNA_databases/silva-euk-18s-id95.fasta
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.612228
Gumbel K = 0.334926
Minimal SW score based on E-value = 52
Loading index part 1/1 ... done [3.23 sec]
Begin index search ... done [30.28 sec]
Freeing index ... done [0.45 sec]
Begin analysis of: ./rRNA_databases/silva-euk-28s-id98.fasta
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.612068
Gumbel K = 0.344763
Minimal SW score based on E-value = 53
Loading index part 1/1 ... done [3.43 sec]
Begin index search ... done [35.69 sec]
Freeing index ... done [0.48 sec]
Begin analysis of: ./rRNA_databases/rfam-5s-database-id98.fasta
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.616617
Gumbel K = 0.341306
Minimal SW score based on E-value = 51
Loading index part 1/1 ... done [1.77 sec]
Begin index search ... done [13.50 sec]
Freeing index ... done [0.22 sec]
Begin analysis of: ./rRNA_databases/rfam-5.8s-database-id98.fasta
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.617817
Gumbel K = 0.340589
Minimal SW score based on E-value = 49
Loading index part 1/1 ... done [0.60 sec]
Begin index search ... done [8.78 sec]
Freeing index ... done [0.07 sec]
Total number of reads mapped (incl. all reads file sections searched): 104243
Writing aligned FASTA/FASTQ ... done [1.13 sec]
Writing not-aligned FASTA/FASTQ ... done [0.10 sec]
\end{Verbatim}
~\\
\noindent The option `\texttt{--log}' will create an overall statistics file,\\
\begin{Verbatim}[fontsize=\footnotesize]
>> cat SRR105861_rRNA.log
Time and date
Command: sortmerna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s-db:\
./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-db:\
./rRNA_databases/silva-arc-16s-id95.fasta,./index/silva-arc-16s-db:\
./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-db:\
./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-db:\
./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s:\
./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-db:\
./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s-db\
--reads /Users/jenya/Downloads/SRR106861.fasta --sam --num_alignments 1\
--fastx --aligned SRR105861_rRNA --other SRR105861_non_rRNA.fasta fasta -v
Process pid = 1957
Parameters summary:
Index: ./index/silva-bac-16s-db
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.602397
Gumbel K = 0.328927
Minimal SW score based on E-value = 54
Index: ./index/silva-bac-23s-db
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.603075
Gumbel K = 0.330488
Minimal SW score based on E-value = 53
Index: ./index/silva-arc-16s-db
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.596230
Gumbel K = 0.322143
Minimal SW score based on E-value = 52
Index: ./index/silva-arc-23s-db
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.597749
Gumbel K = 0.325630
Minimal SW score based on E-value = 49
Index: ./index/silva-euk-18s-db
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.612228
Gumbel K = 0.334926
Minimal SW score based on E-value = 52
Index: ./index/silva-euk-28s
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.612068
Gumbel K = 0.344763
Minimal SW score based on E-value = 53
Index: ./index/rfam-5s-db
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.616617
Gumbel K = 0.341306
Minimal SW score based on E-value = 51
Index: ./index/rfam-5.8s-db
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.617817
Gumbel K = 0.340589
Minimal SW score based on E-value = 49
Number of seeds = 2
Edges = 4 (as integer)
SW match = 2
SW mismatch = -3
SW gap open penalty = 5
SW gap extend penalty = 2
SW ambiguous nucleotide = -3
SQ tags are not output
Number of threads = 1
Reads file = SRR106861.fasta
Results:
Total reads = 113128
Total reads passing E-value threshold = 104243 (92.15%)
Total reads failing E-value threshold = 8885 (7.85%)
Minimum read length = 59
Maximum read length = 1253
Mean read length = 267
By database:
./rRNA_databases/silva-bac-16s-id90.fasta 25.73%
./rRNA_databases/silva-bac-23s-id98.fasta 64.37%
./rRNA_databases/silva-arc-16s-id95.fasta 0.00%
./rRNA_databases/silva-arc-23s-id98.fasta 0.00%
./rRNA_databases/silva-euk-18s-id95.fasta 0.00%
./rRNA_databases/silva-euk-28s-id98.fasta 0.00%
./rRNA_databases/rfam-5s-database-id98.fasta 2.04%
./rRNA_databases/rfam-5.8s-database-id98.fasta 0.00%
\end{Verbatim}
\subsubsection{Filtering paired-end reads}
When writing aligned and non-aligned reads to FASTA/Q files, sometimes the situation arises
where one of the paired-end reads aligns and the other one doesn't. Since SortMeRNA
looks at each read individually, by default the reads will be split into two separate files. That is, the read that
aligned will go into the {\tt--aligned} FASTA/Q file and the pair that didn't align will go into the
{\tt--other} FASTA/Q file.
This situation would result in the splitting of some paired reads in the
output files and not optimal for users who require paired order of the reads for
downstream analyses.
For users who wish to keep the order of their paired-ended reads, two options are available.
If one read aligns and the other one not then,
\begin{enumerate}
\item[(1)] {\tt--paired-in} will put both reads into the file specified by {\tt--aligned}
\item[(2)] {\tt--paired-out} will put both reads into the file specified by {\tt--other}
\end{enumerate}
The first option, {\tt--paired-in} is optimal for users that want all reads in the {\tt--other} file
to be non-rRNA. However, there are small chances that reads which are non-rRNA will also be
put into the {\tt--aligned} file.
The second option, {\tt--paired-out} is optimal for users that want only rRNA reads in the
{\tt--aligned} file. However, there are small chances that reads which are rRNA will also be
put into the {\tt--other} file.
If neither of these two options is added to the {\tt sortmerna} command, then aligned and
non-aligned reads will be properly output to the {\tt--aligned} and {\tt--other} files, possibly breaking
the order for a set of paired reads between two output files.
{\bf It's important to note} that regardless of the options used, the {\tt--log} file will always
report the true number of reads classified as rRNA (not the number of reads in the {\tt--aligned}
file).
\subsubsection{Example 4: forward-reverse paired-end reads (2 input files)}
\begin{figure}[here!]
\centering
\resizebox{5in}{!}{
\tikzstyle{mybox} = [draw=OliveGreen, fill=blue!5, very thick,
rectangle, rounded corners, inner sep=10pt, inner ysep=20pt]
\tikzstyle{fancytitle} =[fill=OliveGreen, text=white, rectangle, rounded corners]
%
\begin{tikzpicture}
\node [mybox] (box1) {%
\begin{minipage}[t!]{2in}
{\footnotesize
@SEQUENCE\_ID\_1/\textbf{1} \\
ACTT..\\
+\\
QUALITY\_1/1\\
@SEQUENCE\_ID\_2/\textbf{1} \\
GTTA..\\
+\\
QUALITY\_2/1\\
..
}
\end{minipage}
};
\node [mybox] (box2) [right=of box1,xshift=2cm] {%
\begin{minipage}[t!]{2in}
{\footnotesize
@SEQUENCE\_ID\_1/\textbf{2} \\
GTAC..\\
+\\
QUALITY\_1/2\\
@SEQUENCE\_ID\_2/\textbf{2} \\
CCAC..\\
+\\
QUALITY\_2/2\\
..
}
\end{minipage}
};
\node[fancytitle] at (box1.north) {{\small FASTQ forward reads}};
\node[fancytitle] at (box2.north) {{\small FASTQ reverse reads}};
\draw [decorate,color=black!80,decoration={brace,mirror,amplitude=5pt,raise=2pt}] (3,0.3) -- node[right=10pt]{$~~pair~\#~1$}(3,1.8);
\draw [decorate,color=black!80,decoration={brace,amplitude=5pt,raise=2pt}] (5.8,0.3) -- node[right=10pt]{~}(5.8,1.8);
\draw [decorate,color=black!80,decoration={brace,amplitude=5pt,raise=2pt}] (3,0) -- node[right=10pt]{$~~pair~\#~2$}(3,-1.5);
\draw [decorate,color=black!80,decoration={brace,mirror,amplitude=5pt,raise=2pt}] (5.8,0) -- node[right=10pt]{~}(5.8,-1.5);
\end{tikzpicture}
}%resizebox
\caption{Forward and reverse reads in paired-end sequencing format}
\label{fig:format2}
\end{figure}
\begin{figure}[here!]
\centering
\resizebox{2.3in}{!}{
\tikzstyle{mybox} = [draw=OliveGreen, fill=blue!5, very thick,
rectangle, rounded corners, inner sep=10pt, inner ysep=20pt]
\tikzstyle{fancytitle} =[fill=OliveGreen, text=white, rectangle, rounded corners]
%
\begin{tikzpicture}
\node [mybox] (box) {%
\begin{minipage}[t!]{2in}
{\footnotesize
@SEQUENCE\_ID\_1/\textbf{1} \\
ACTT..\\
+\\
QUALITY\_1/1\\
@SEQUENCE\_ID\_1/\textbf{2} \\
GTAC..\\
+\\
QUALITY\_1/2\\
..
}
\end{minipage}
};
\node[fancytitle] at (box.north) {{\small FASTQ paired-end reads}};
\draw [decorate,color=black!80,decoration={brace,mirror,amplitude=10pt,raise=2pt}] (0.5,-1.5) -- node[right=10pt]{$~pair~\#~1$}(0.5,1.8);
\end{tikzpicture}
}%resizebox
\caption{Paired-end read format accepted by SortMeRNA}
\label{fig:format1}
\end{figure}
\noindent SortMeRNA accepts only 1 file as input for the reads. If a user has two input files, in the case for the
foward and reverse paired-end reads (see Figure~\ref{fig:format2}), they may use the \texttt{merge-paired-reads.sh} script found in
\texttt{`sortmerna/scripts'} folder to interleave the paired reads into the format of Figure~\ref{fig:format1}.\\
\noindent The command for \texttt{merge-paired-reads.sh} is the following,
\begin{verbatim}
> bash ./merge-paired-reads.sh forward-reads.fastq reverse-reads.fastq outfile.fastq
\end{verbatim}
\noindent Now, the user may input \texttt{outfile.fastq} to SortMeRNA for analysis.
\noindent Similarly, for unmerging the paired reads back into two separate files, use the command,
{\small
\begin{verbatim}
> bash ./unmerge-paired-reads.sh merged-reads.fastq forward-reads.fastq reverse-reads.fastq
\end{verbatim}}
{\bf Important:} unmerge-paired-reads.sh should only be used if one of the options {\tt--paired\_in} or {\tt--paired\_out}
was used during filtering. Otherwise it may give incorrect results if a paired-read was split during alignment (one
read aligned and the other one not).
\newpage
\subsection{Read mapping}
\subsubsection{Mapping reads for classification}
Although SortMeRNA is very sensitive with the small rRNA databases distributed with the source code,
these databases are not optimal for classification since often alignments with 75-90\% identity
will be returned (there are only several thousand rRNA in most of the databases, compared to the original
SILVA or Greengenes databases containing millions of rRNA). Classification at the species level generally
considers alignments at 97\% and above, so it is suggested to use a larger database is species classification
is the main goal.
Moreover, SortMeRNA is a local alignment tool, so it's also important to look at the query coverage \% for
each alignment. In the SAM output format, neither \% id or query coverage are reported. If the user wishes
for these values, then the Blast tabular format with CIGAR + query coverage option {(\tt--blast '1 cigar qcov')} is the way to go.
\subsubsection{Example 5: mapping reads against the 16S Greengenes 97\% id database with multithreading}
This example will generate SAM and BLAST tabular output files. Alignments are classified as significant
based on the E-value cutoff (default 1). SortMeRNA's E-value takes into consideration the full size of the
reference database as well as the query file, thus the E-value is higher than BLAST's (ex. equivalent to BLAST's 1e-5).
\begin{Verbatim}[fontsize=\footnotesize]
>> sortmerna --ref 97_otus_gg_13_8.fasta,./index/97_otus_gg_13_8\
--reads SRR106861.fasta --blast '1 cigar qcov' --sam --log --aligned SRR106861_gg_rRNA -a 20 -v
Program: SortMeRNA version 2.1, 01/02/2016
Copyright: 2012-16 Bonsai Bioinformatics Research Group:
LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe
2014-16 Knight Lab, Department of Pediatrics, UCSD, La Jolla,
Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the
implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the GNU Lesser General Public License for more details.
Contact: Evguenia Kopylova, [email protected]
Laurent Noe, [email protected]
Helene Touzet, [email protected]
Computing read file statistics ... done [0.44 sec]
size of reads file: 35238748 bytes
partial section(s) to be executed: 1 of size 35238748 bytes
Parameters summary:
Number of seeds = 2
Edges = 4 (as integer)
SW match = 2
SW mismatch = -3
SW gap open penalty = 5
SW gap extend penalty = 2
SW ambiguous nucleotide = -3
SQ tags are not output
Number of threads = 20
Begin mmap reads section # 1:
Time to mmap reads and set up pointers [0.10 sec]
Begin analysis of: 97_otus_gg_13_8.fasta
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.600470
Gumbel K = 0.327880
Minimal SW score based on E-value = 57
Loading index part 1/1 ... done [10.76 sec]
Begin index search ... done [23.75 sec]
Freeing index ... done [1.44 sec]
Total number of reads mapped (incl. all reads file sections searched): 29089
Writing alignments ... done [7.71 sec]
\end{Verbatim}
This is almost the same number of 16S rRNA as identified by SortMeRNA using the smaller provided database,
\begin{Verbatim}[fontsize=\footnotesize]
>> cat SRR106861_gg_rRNA.log
Date and time
Command: sortmerna --ref 97_otus_gg_13_8.fasta,./index/97_otus_gg_13_8\
--reads SRR106861.fasta --blast '1 cigar qcov' --sam --log --aligned SRR106861_gg_rRNA -a 20 -v
Process pid = 44246
Parameters summary:
Index: ./index/97_otus_gg_13_8
Seed length = 18
Pass 1 = 18, Pass 2 = 9, Pass 3 = 3
Gumbel lambda = 0.600470
Gumbel K = 0.327880
Minimal SW score based on E-value = 57
Number of seeds = 2
Edges = 4 (as integer)
SW match = 2
SW mismatch = -3
SW gap open penalty = 5
SW gap extend penalty = 2
SW ambiguous nucleotide = -3
SQ tags are not output
Number of threads = 20
Reads file = SRR106861.fasta
Results:
Total reads = 113128
Total reads passing E-value threshold = 29089 (25.71%)
Total reads failing E-value threshold = 84039 (74.29%)
Minimum read length = 59
Maximum read length = 1253
Mean read length = 267
By database:
97_otus_gg_13_8.fasta 25.71%
\end{Verbatim}
\newpage
\subsection{OTU-picking}
SortMeRNA is implemented in QIIME's closed-reference and open-reference OTU-picking workflows.
The readers are referred to QIIME's tutorials for an in-depth discussion of these methods
\url{http://qiime.org/tutorials/otu_picking.html}.
\section{SortMeRNA advanced options}
\subsection*{{\tt --num\_seeds INT}}
The threshold number of seeds required to match in the primary seed-search filter before
moving on to the secondary seed-cluster filter. More specifically, the threshold number of
seeds required before searching for a longest increasing subsequence (LIS) of the seeds' positions
between the read and the closest matching reference sequence. By default, this is set to 2 seeds.
\subsection*{{\tt --passes INT,INT,INT}}
In the primary seed-search filter, SortMeRNA moves a seed of length $L$ (parameter of {\tt indexdb\_rna})
across the read using three passes. If at the end of each pass a threshold number of seeds (defined by {\tt --num\_seeds})
did not match to the reference database, SortMeRNA attempts to find more seeds by decreasing the interval at which the
seed is placed along the read by using another pass. In default mode, these intervals are set to
$L,L/2,3$ for Pass 1, 2 and 3, respectively. Usually, if the read is highly similar to the reference
database, a threshold number of seeds will be found in the first pass.
\subsection*{{\tt --edges INT(\%)}}
The number (or percentage if followed by \%) of nucleotides to add to each edge of the alignment region
on the reference sequence before performing Smith-Waterman alignment. By default, this is set to 4 nucleotides.
\subsection*{{\tt --full\_search FLAG}}
During the index traversal, if a seed match is found with 0-errors, SortMeRNA will stop searching for further
1-error matches. This heuristic is based upon the assumption that 0-error matches are more significant than
1-error matches. By turning it off using the {\tt--full\_search} flag, the sensitivity may increase (often
by less than 1\%) but with up to four-fold decrease in speed.
\subsection*{{\tt --pid FLAG}}
The pid of the running {\tt sortmerna} process will be added to the output files in order to avoid over-writing output if the same
{\tt --aligned STRING} base name is provided for different runs.
\section{Help}
Any issues or bug reports should be reported to \url{https://github.com/biocore/sortmerna/issues} or by e-mail
to the authors (see list of e-mails in Section 1 of this document). Comments and suggestions are also always appreciated!
\section{Citation}
If you use SortMeRNA please cite,
Kopylova E., No\'{e} L. and Touzet H., ``SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data", {\it Bioinformatics} (2012), doi: 10.1093/bioinformatics/bts611.
\end{document}