-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathdada2_workflow.Rmd
468 lines (318 loc) · 21.3 KB
/
dada2_workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
---
title: "Nemabiome.ca - DADA2 Workflow"
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, cache = TRUE)
```
# DADA2 Workflow {.tabset .tabset-fade}
## Setup
The [DADA2 pipeline](https://www.nature.com/articles/nmeth.3869) is used as a method to correct errors that are introduced into sequencing data during amplicon sequencing. It is implemented as an open-source R-package that will allow you to run through the entire pipeline, including steps to filter, dereplicate, identify chimeras, and merge paired-end reads. We will use one external application, cutadapt, to remove primer sequences. The output of this pipeline is a table of amplicon sequence variants (ASVs), as opposed to the traditional OTUs seen in other amplicon sequencing workflows that cluster similar sequences into a single OTU. DADA2 is capable of resolving biological differences of 1 or 2 nucleotides, producing ASVs from amplicon data that are of higher resolution than OTUs. For a more general and detailed tutorial please see the dada2 [website](https://benjjneb.github.io/dada2)
### Requirements
* MacOS or Linux. Windows is possible if you use Windows Subsystem for Linux or a virtual machine like VirtualBox.
* A modern desktop or laptop. The memory and cpu requirements for running dada2 are fairly modest. If you have a low-powered computer it will still run but will likely take much longer.
### Install R
To use DADA2, you must download and install R for your operating system. This can be done [here](https://cloud.r-project.org/). You should use the latest version of R. It is also highly recommned (but not required) to download and install RStudio, which is an integrated development environment (IDE) for R programming. This can be done through [this link](https://www.rstudio.com/download).
### Install dada2
Since dada2 is available as part of the Bioconductor Project Package Repository, first install Bioconductor.
```
install.packages("BiocManager")
```
To install the latest version of the DADA2 package (1.12.1), type the following:
```
BiocManager::install("dada2", version = "3.9")
```
*If you have previously installed R and Bioconductor, you may need to update them to the most recent versions. This tutorial relies on features of dada2 only in version 1.12.1 or greater*
### Install cutadapt
To use the DADA2 pipeline, primers must be removed from the amplicon sequence data. In the following tutorial, we will be using the `cutadapt` tool in R to do this. Cutadapt can be installed as a conda package. If you do not already have conda installed, you can do so [here](https://docs.conda.io/en/latest/miniconda.html). Once you have conda installed, go to Terminal or Command Prompt, and type the following:
```
conda install -c bioconda cutadapt
```
### Download files
This tutorial is run with the files listed on the [Analysis](analysis.html) page.
Please download those and if you want to follow along exactly then place them in a folder called `Miseq_Run` in your home directory.
Here we only use two samples to keep the run time short but feel free to run it with all the example files if desired.
You'll also need the ITS2 database files formatted for IDTAXA.
For the purpose of this tutorial, the ITS2 database version used in this guide can be found at https://doi.org/10.5281/zenodo.3235802.
Please download the [Nematode_ITS2_1.0.0_idtaxa.fasta](https://zenodo.org/record/3235803/files/Nematode_ITS2_1.0.0_idtaxa.fasta?download=1)
file with sequences and the [Nematode_ITS2_1.0.0_idtaxa.tax](https://zenodo.org/record/3235803/files/Nematode_ITS2_1.0.0_idtaxa.tax?download=1) taxonomy file.
For your own analysis, the latest version of our nematode [ITS2 database](its2_database.html) can be found under "Analysis" at the top navigation bar.
### Preparing
Once you have installed the above, you are ready to start using the DADA2 pipeline. The following is a sample analysis using the DADA2 pipeline (https://benjjneb.github.io/dada2/tutorial.html). This is meant to be an example and not a definitive workflow. First, we load packages and set a seed for reproducibility.
```{r libraries, echo=T, message = FALSE}
library(DECIPHER)
packageVersion("DECIPHER")
library(dada2)
packageVersion("dada2")
library(ShortRead)
packageVersion("ShortRead")
library(Biostrings)
packageVersion("Biostrings")
library(ggplot2)
packageVersion("ggplot2")
library(stringr) # not strictly required but handy
packageVersion("stringr")
library(readr)
packageVersion("readr")
set.seed(106)
```
To get started we'll need to do a little prep. We'll start by defining the path
of sequences to be processed (using "~/Miseq_Run" from previous examples).
The DADA2 pipeline is intended to be used with demultiplexed, paired-end fastq files.
That is to say, the samples must be separated into individual files and ordered
such that forward and reverse reads of the same sample are matched.
The pattern of the filenames should be the same for all forward and reverse files.
In this example, the names are formatted as such: `samplename_R1_001.fastq.gz`
for forward reads and `samplename_R2_001.fastq.gz` for reverse reads.
We then create a vector of file names, one for the forward reads and one for the reverse.
The `samples` vector contains our sample names, extracted from the file names using a regular expression.
For Illumina data, often the sample name is everything before the first underscore.
The primer sequences are defined here as well, because we'll be clipping those off shortly.
The forward primer will be at the beginning of the forward read and the reverse primer
will be at the beginning of the reverse read. The reverse complements are included
in case the amplicon sequence is shorter than our read length. When this happens the
sequence will contain the reverse complement of the opposite primer, i.e. there may be
reverse-complemented reverse primer at the end of the forward read.
```{r primer_initialization}
path <- "~/Miseq_Run"
fwd_files <- sort(list.files(path, pattern = "R1", full.names = TRUE))
rev_files <- sort(list.files(path, pattern = "R2", full.names = TRUE))
# It's also handy to have a vector of sample names, which in this case is everything up
# until the first underscore, which is what our regular expression caputres. You may
# also create this manually if you don't have too many samples
samples = str_extract(basename(fwd_files), "^[^_]+")
names(fwd_files) <- samples
names(rev_files) <- samples
fwd_primer <- "ACGTCTGGTTCAGGGTTGTT"
rev_primer <- "TTAGTTTCTTTTCCTCCGCT"
fwd_primer_rev <- as.character(reverseComplement(DNAStringSet(fwd_primer)))
rev_primer_rev <- as.character(reverseComplement(DNAStringSet(rev_primer)))
```
Before we move on let's do a quick sanity check to make sure our primer sequences are detected. The code below will count the number of times we have a primer hit in our fastq file. Note that this is a quick and dirty method to be sure that we have the right primer sequence. For proper primer removal we'll need something more sophisticated, like `cutadapt` which we'll demonstrate next.
```{r}
# This function counts number of reads in which the primer is found
count_primers <- function(primer, filename) {
num_hits <- vcountPattern(primer, sread(readFastq(filename)), fixed = FALSE)
return(sum(num_hits > 0))
}
count_primers(fwd_primer, fwd_files[[1]])
count_primers(rev_primer, rev_files[[1]])
```
## Primer removal
Now we can use cutadapt to remove primers, and check that they have been successfully removed. Cutadapt was designed to remove sequencing adapters that sometimes get left on the reads, however, it's main function is to detect and trim off regions that match a known sequence, which is precisely what we want to do here.
Cutadapt is best installed using conda (`conda install cutadapt`) and can also be used directly from the command line. Here we demonstrate how to run it directly from R using the `system2` command but the results are the same either way. The first parameter to `system2` is the program we wish to run and the second is character vector of arguments and their values.
```{r}
# CHANGE ME to the cutadapt path on your machine. If you've installed conda according to the
# provided instructions then this will likely be the same for you.
cutadapt <- path.expand("~/miniconda3/envs/cutadapt/bin/cutadapt")
# Make sure it works
system2(cutadapt, args = "--version")
```
Now output filenames are defined as well as parameters for cutadapt.
The critical parameters are the primer sequences and their orientation.
It's strongly recommended that you review the [cutadapt documentation](https://cutadapt.readthedocs.io/en/stable/guide.html)
for full details of each parameter.
Briefly:
`-g`: sequence to trim off the 5' end of the forward read (forward primer)
`-a`: sequence to trim off the 3' end of the forward read (reverse complemented reverse primer)
`-G`: sequence to trim off the 5' end of the reverse read (reverse primer)
`-A`: sequence to trim off the 3' end of the reverse read (reverse complemented forward primer)
We'll also add `-m 50` to get rid of super short junky reads, `--max-n 1` to get rid of reads that have any N's in them
and `-n 2` so that cutadapt will remove multiple primer hits if there happens to be read-through.
The `--discard-untrimmed` is also added so only reads that contain a primer will be kept ensuring we keep only valid amplicons.
And finally a `-q 15` will trim off low quality bases from the 3' end.
<div class="alert alert-warning" role="alert">
<p>In our experience this small amount of trimming can remove the need for truncating the reads later on but this is something that may vary from run to run it and an important paramter to consider for your own data </p>
</div>
```{r}
# Create an output directory to store the clipped files
cut_dir <- file.path(path, "cutadapt")
if (!dir.exists(cut_dir)) dir.create(cut_dir)
fwd_cut <- file.path(cut_dir, basename(fwd_files))
rev_cut <- file.path(cut_dir, basename(rev_files))
names(fwd_cut) <- samples
names(rev_cut) <- samples
# It's good practice to keep some log files so let's create some
# file names that we can use for those
cut_logs <- path.expand(file.path(cut_dir, paste0(samples, ".log")))
cutadapt_args <- c("-g", fwd_primer, "-a", rev_primer_rev,
"-G", rev_primer, "-A", fwd_primer_rev,
"-n", 2, "--discard-untrimmed")
# Loop over the list of files, running cutadapt on each file. If you don't have a vector of sample names or
# don't want to keep the log files you can set stdout = "" to output to the console or stdout = NULL to discard
for (i in seq_along(fwd_files)) {
system2(cutadapt,
args = c(cutadapt_args,
"-o", fwd_cut[i], "-p", rev_cut[i],
fwd_files[i], rev_files[i]),
stdout = cut_logs[i])
}
# quick check that we got something
head(list.files(cut_dir))
```
## Quality filtering
### Inspect quality scores
Now that primers are removed, we can begin the DADA2 pipeline in full. We start with plotting quality profiles of our sequences to determine suitable quality filtering parameters. For the sake of speed we only show 2 samples here but feel free to look at all your samples (or more than 2 anyway).
```{r quality_profiles}
plotQualityProfile(fwd_cut[1:2]) + ggtitle("Forward")
```
Everything looks pretty good here. The majority of our reads should be 230 bp based on the fact that we trimmed 20 bp primers off the beginning of the read. It looks like a very, very few did not get trimmed properly but from the red line across the bottom, which is the number of reads of that length, we can see that this is such a small percentage that we don't need to worry about it.
<div class="alert alert-success" role="alert">
<h4 class="alert-heading">Side note</h4>
<p>If you're paying attention you might be asking why, if the primers always appear at the same point in the read, can't we just clip the reads at this point. In fact, for some datasets, this one included, that would be a valid strategy given that the all the ITS2 sequences are longer than the read length (250 bp). However, this is not always the case and as described above, if the amplicon length is less than the read lenght we'll have to find and remove the opposite primer. So in general it's good practice to use something like cutadapt which will ensure the correct sequences are output. </p>
</div>
```{r}
plotQualityProfile(rev_cut[1:2]) + ggtitle("Reverse")
```
Reverse reads are lower quality as is always the case for Illumina data. However, it looks like the quality average is above 20 for most of the length of the read so we'll go ahead with the analysis.
### Filter reads
For the purposes of our analysis, we chose to conduct quality filtering with maximum expected errors of 2 for the forward sequence, 5 for the reverse, to truncate after a quality score of 2 or lower.
<div class="alert alert-warning" role="alert">
<p>Make your own decisions about trimming paramters! Your data will be different, possibly very different, from ours so our parmeter choices are likely to be not at all appropriate for your data. So don't copy and paste this code - think carefully about your data and choose parameters appropriate for your data.</p>
</div>
```{r}
# Same as for the clippling we create an output directory to store the filtered files
filt_dir <- file.path(path, "filtered")
if (!dir.exists(filt_dir)) dir.create(filt_dir)
fwd_filt <- file.path(filt_dir, basename(fwd_files))
rev_filt <- file.path(filt_dir, basename(rev_files))
names(fwd_filt) <- samples
names(rev_filt) <- samples
filtered_out <- filterAndTrim(
fwd = fwd_cut,
filt = fwd_filt,
rev = rev_cut,
filt.rev = rev_filt,
maxEE = c(2, 5),
truncQ = 2,
rm.phix = TRUE,
compress = TRUE,
multithread = TRUE
)
head(filtered_out)
```
## Main workflow
### Learn errors
Here dada2 learns the error profile for your data using an interative approach. These error profiles are used in the next step to correct errors.
```{r}
err_fwd <- learnErrors(fwd_filt, multithread = TRUE)
err_rev <- learnErrors(rev_filt, multithread = TRUE)
plotErrors(err_fwd, nominalQ = TRUE)
```
The plot shows how well the estimated error rates fit the observed rates. The black line should line up reasonably well with the black points. This looks reasonable so we continue on.
### Denoising (aka error-correcting)
We can now denoise our sequences using the DADA algorithm. Note that older versions of dada2 required a dereplication step that is now done automatically by the `dada` function.
```{r denoise}
dada_fwd <- dada(fwd_filt, err = err_fwd, multithread = TRUE)
dada_rev <- dada(rev_filt, err = err_rev, multithread = TRUE)
```
### Merge pairs
Now that we have accurate sequences we can merge the paired-end reads to reconstruct the full amplicon sequence. Note that most sequences should succesfully overlap, _unless_ your trimming parameters were too aggresive (i.e. too much sequence was trimmed off the end) _or_ you have amplicons that longer than the combined read length (>500 bp, in this case). In the latter case these reads are still useable but will need special handling (more details to come on those cases).
```{r merge}
mergers <- mergePairs(
dadaF = dada_fwd,
dadaR = dada_rev,
derepF = fwd_filt,
derepR = rev_filt,
maxMismatch = 1,
verbose=TRUE
)
```
### Sequence table
Before we can assign taxonomy, we can construct a sequence table, which lists the number of each sequence present in each sample. We can see the dimensions of this sequence table, where the number of rows is the number of samples and the number of columns is the number of total unique sequences
```{r seq_table}
seqtab <- makeSequenceTable(mergers)
dim(seqtab)
```
### Remove chimeras
Don't be alarmed here if many of your sequences are discarded - the number of chimeras varies but can be quite high in some cases.
```{r}
seqtab_nochim <- removeBimeraDenovo(seqtab, method = "consensus", multithread = TRUE, verbose = TRUE)
dim(seqtab_nochim)
```
## How did we do?
### Sequence length distribution
```{r}
table(nchar(getSequences(seqtab_nochim)))
```
### Number of reads at each step
This is an important step, particularly if you're trying to track down where something went wrong.
```{r}
# small function to get the number of sequences
getN <- function(x) sum(getUniques(x))
track <- cbind(
filtered_out,
sapply(dada_fwd, getN),
sapply(dada_rev, getN),
sapply(mergers, getN),
rowSums(seqtab_nochim)
)
colnames(track) <- c("raw", "filtered", "denoised_fwd", "denoised_rev", "merged", "no_chim")
rownames(track) <- samples
head(track)
```
## Assigning taxonomy with IDTAXA
Now that we have a sequence table with the chimeras removed, we can classify/assign taxonomy to the sequence variants. Although there are multiple multiple taxonomic classification methods, we will be using IdTaxa, which is available through the `DECIPHER` package. You can read more about IdTaxa [here](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0521-5).
### Training the classifier
The IDTAXA classification method relies on a training set that contains sequences that are representative of known taxa.
Although there exists trained classifiers, we will be using the ITS2 Nematode database as our classifier.
Since it is not trained, we will have to train the classifier using `LearnTaxa`,
which is also available as a `DECIPHER` package. The `train` parameter of `LearnTaxa`
contains the sequences included in the classifier as a DNAStringSet. Each sequence is
formatted such that the names of each taxonomic level is separated by a semicolon.
Once the sequence headers of the classifier are parsed, the `taxonomy` parameter will contain the taxonomic assignment,
separated by a semicolon, for each sequence in `train`.
``` {r}
train <- readDNAStringSet("~/Nematode_ITS2_1.0.0_idtaxa.fasta")
tax <- read_tsv("~/Nematode_ITS2_1.0.0_idtaxa.tax")
# Train the classifier
trainingSet <- LearnTaxa(train, names(train), tax)
```
### Running IdTaxa
To use IdTaxa, we will need to convert the unique sequence variants that we want to classify into a DNAStringSet object. Now that we have our trained classifier (i.e. training set) and the sequences we want to classify, we can run `IdTaxa`. You can change various parameters of `IdTaxa` depending on your data. We recommend reviewing the [IdTaxa documentation] (https://rdrr.io/bioc/DECIPHER/man/IdTaxa.html) for further information. Here, we have specified several parameters:
* `strand = "both"` : by setting the `strand` parameter to "both", each sequence variant is classified using both the forward and reverse complement orientation. The sequence varient will be classified as the result with the highest confidence.
* `threshold = 60` : setting the `threshold` to 60 specifies when the taxonomic classifications are truncated. A lower threshold usually results in higher taxonomic level classifications, but lower accuracy (i.e. confidence). A higher threshold usually results in lower taxonomic level classifications, but with higher accuracy.
* `bootstraps = 100` : this specifies the amount of times bootstrap replicates are performed for each sequence.
* `processors = NULL` : automatically uses all available processors.
* `verbose = TRUE` : displays progress.
* `type = "extended"` : by setting the `type` to "extended", the output for each sequence variant will contain the taxonomic classification, the names of each taxonomic rank, and the confidence of the assignment.
``` {r}
dna <- DNAStringSet(getSequences(seqtab_nochim))
idtaxa <- IdTaxa(dna,
trainingSet,
strand = "both",
threshold = 60,
bootstraps = 100,
processors = NULL,
verbose = TRUE,
type = "extended")
```
Although it contains all the necessary information, the output of `IdTaxa` is not as intuitive as we would like. It outputs a list of the sequence variants. Within this list, each element will have its taxonomic classification and the confidence for each taxon. To facilitate easier manipulation and comprehension of the data, we will convert this list into a matrix, with the rows as the sequence variants and the columns as their taxonomic classifications. We also drop the "Root" column.
```{r}
taxid <- t(sapply(idtaxa, function(x) setNames(x$taxon, x$rank)))[, -1]
```
### Over to phyloseq
The individual pieces of the analysis can be save to text or excel files but if you've made it this far then why not continue on in R! [Phyloseq](https://joey711.github.io/phyloseq/) is a powerful toolkit for working with these types of data in R and is the recommended next step for analysis. More tutorials on downstream analysis to come!
```{r}
library(phyloseq)
# This is just a placeholder - normally this would be a dataframe containing all your sample inforamtion,
# with rownames matching the sample names
samp_data <- data.frame(
row.names = samples,
sample = samples
)
# We need better tables for our sequences than the actual sequence which is the dada2 default
asvs <- paste0("ASV_", 1:length(dna))
rownames(taxid) <- asvs
colnames(seqtab_nochim) <- asvs
names(dna) <- asvs
# Now we put everything into one phyloseq object (even the sequences) so it is easy to use
physeq = phyloseq(
otu_table(seqtab_nochim, taxa_are_rows = FALSE),
tax_table(taxid),
sample_data(samp_data),
dna
)
physeq
```