-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathIntuition for appropriate filtering parameters.Rmd
292 lines (203 loc) · 16 KB
/
Intuition for appropriate filtering parameters.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
---
title: "Intuition for appropriate filtering parameters"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This workflow discusses the steps to follow for choosing appropriate filtering parameters for analyzing amplicon sequence reads using **DADA2**.
We discuss the steps to visualize paired end amplicon sequences and then choose the trimming parameters to retain enough overlap between the forward and the reverse read after truncation. The reads after truncation should be large enough to maintain 20 + biological.length.variation nucleotides of overlap between them as discussed in the [tutorial](https://benjjneb.github.io/dada2/tutorial.html).
## Getting ready
First we load the `dada2` package. If you don't already have it, see the [dada2 installation instructions](https://benjjneb.github.io/dada2/dada-installation.html).
```{r dada2 Library, warning=FALSE, message=FALSE, tidy=TRUE, results='hold',comment= ""}
library(dada2);
packageVersion("dada2")
```
The dataset used here is a 16S V3 - V4 region with forward and reverse primers present on the sequences obtained from the ENA, with the study ID PRJEB6592 as [here](https://www.ebi.ac.uk/ena/data/view/PRJEB6592) from the paper, [Evaluating Bias of Illumina based Bacterial 16S rRNA Gene Profiles](http://aem.asm.org/content/80/18/5717.long). The dataset was constructed using an Illumina library construction, 2 x 250bp on an Illumina MiSeq. The primers used, Forward Primer : `CCTACGGGAGGCAGCAG` of length 17 bp and Reverse Primer : `GGACTACHVGGGTWTCTAAT` of length 20 bp.
```{r path, warning=FALSE, message=FALSE, tidy=TRUE, results='hold',comment= ""}
path <- "~/Desktop/Example_files"
list.files(path)
```
Identifying the lengths of the primers used in this library.
```{r primers, warning=FALSE, message=FALSE, tidy=TRUE, results='hold',comment= ""}
Forward.primer <- "CCTACGGGAGGCAGCAG"
nchar(Forward.primer)
Reverse.primer <- "GGACTACHVGGGTWTCTAAT"
nchar(Reverse.primer)
```
Now we read in the names of the fastq files, and perform string manipulation to get matched lists of the forward and reverse fastq files.
```{r reads, warning=FALSE, message=FALSE, tidy=TRUE, results='hold',comment= ""}
# Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq
fnFs <- sort(list.files(path, pattern="_1.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_2.fastq.gz", full.names = TRUE))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
sample.names
```
## Inspect read quality profiles
Here we plot the quality profile of the forward and reverse reads.
```{r quality profile, warning=FALSE, message=FALSE, tidy=TRUE, results='hold',comment= ""}
plotQualityProfile(c(fnFs[1], fnRs[1]))
```
The forward and reverse reads are of good quality here. We generally advise trimming the last few nucleotides to avoid less well-controlled errors that can arise there. These quality profiles do not suggest that any additional trimming is needed.
For the purpose of this workflow, we explore three scenarios of different parameters at the filterAndTrim step.
+ **Ensure primers are removed, maintaining 20 + biological.length.variation nucleotides of overlap between forward and reverse reads.**
+ **Maintaining 20 + biological.length.variation nucleotides of overlap between forward and reverse reads.**
+ **Following stringent trimming of poor quality bases.**
## Scenario 1
**Ensure primers are removed, maintaining 20 + biological.length.variation nucleotides of overlap between forward and reverse reads.**
We use the quality profile as a guide and will truncate the forward and reverse reads at position 240 (trimming the last 10 nucleotides), by choosing `truncLen=c(240,240)` to avoid less well-controlled errors and also trim the primers used in this sequencing run from the reads by choosing `trimLeft=c(17,20)`. In theory, if you understand your amplicon sequencing setup, this is sufficient to continue. However, to ensure you are trimming the primers in the correct orientation or if you are unsure of the primer orientation, you can follow the steps as outlined at the [ITS tutorial](https://benjjneb.github.io/dada2/ITS_workflow.html) for steps to identify and trimming primers.
After choosing the trimming parameters, we apply the `dada2` workflow steps and track the number of retained reads. The steps discussed here are presented in abbreviated format. A detailed explanation of the `dada2` workflow can be found [here](https://benjjneb.github.io/dada2/tutorial.html).
## Filter and trim, Learn the error rates, dereplicate, sample inference and merging reads.
Assigning the filenames for the filtered fastq.gz files.
```{r filt path, warning=FALSE, message=FALSE, tidy=TRUE, results='hold',comment= ""}
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
```
We will use standard filtering parameters: `maxN=0` (DADA2 requires no Ns), truncQ=2, rm.phix=TRUE and maxEE=2 along with truncLen=c(240, 240), trimLeft = c(17,20) for maintaining biological length variation and removing the primers on the forward and reverse reads respectively.
```{r filter and trim, warning=FALSE, message=FALSE, tidy=TRUE, results='hide',comment= ""}
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(245, 245), trimLeft = c(17,20), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE) # On Windows set multithread=FALSE
errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
```
## Construct sequence table and remove chimeras
```{r seqtab, warning=FALSE, message=FALSE, tidy=TRUE, results='hide',comment= ""}
seqtab <- makeSequenceTable(mergers)
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
```
## Track reads through the pipeline
```{r track reads, warning=FALSE, message=FALSE, tidy=TRUE, comment= ""}
getN <- function(x) sum(getUniques(x))
track <- cbind(out, getN(dadaFs), getN(dadaRs), getN(mergers), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)
```
The table above shows that we kept majority of our raw reads and there is no large loss of reads with any single step. The chosen parameters for trimming seem to work well for the current dataset.
## Scenario 2
**Maintain 20 + biological.length.variation nucleotides of overlap between forward and reverse reads.**
In this scenario, we test if only truncating the forward and reverse reads to maintain biological length variation of nucleotides overlap truncLen=c(240,240)` is enough for consideration at the trim step.
## Filter and trim, Learn the error rates, dereplicate, sample inference and merging reads.
Assigning the filenames for the filtered fastq.gz files.
```{r filt path no primer, warning=FALSE, message=FALSE, tidy=TRUE, results='hold',comment= ""}
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
```
We will use standard filtering parameters: `maxN=0` (DADA2 requires no Ns), truncQ=2, rm.phix=TRUE and maxEE=2 along with truncLen=c(240, 240) for maintaining biological length variation.
```{r filter and trim no primer, warning=FALSE, message=FALSE, tidy=TRUE, results='hide',comment= ""}
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240, 240), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE) # On Windows set multithread=FALSE
errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
```
## Construct sequence table and remove chimeras
```{r seqtab no primer, warning=FALSE, message=FALSE, tidy=TRUE, results='hide',comment= ""}
seqtab <- makeSequenceTable(mergers)
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
```
## Track reads through the pipeline
```{r track reads no primer, warning=FALSE, message=FALSE, tidy=TRUE, comment= ""}
getN <- function(x) sum(getUniques(x))
track <- cbind(out, getN(dadaFs), getN(dadaRs), getN(mergers), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)
```
The table above shows that there was a majority loss of reads at the `removeBimeraDenovo` step. The presence of primers on the reads contributed to the loss of these reads as majority of th reads were identified as chimeric and hence ensuring primers are removed from the reads is important towards retaining majority of reads for amplicon reads analysis through `dada2`.
## Scenario 3
**Stringent trimming of reads**
In this scenario, we test following strict truncation based on the Quality profile of the reads. From the Quality profile we obtained above, we observe the quality tapers towards the end of the read. We trim about 30 towards the end of the reads using `truncLen=c(220,220)`.
## Filter and trim, Learn the error rates, dereplicate, sample inference and merging reads.
Assigning the filenames for the filtered fastq.gz files.
```{r filt path trimming, warning=FALSE, message=FALSE, tidy=TRUE, results='hold',comment= ""}
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
```
We will use standard filtering parameters: `maxN=0` (DADA2 requires no Ns), truncQ=2, trimLeft = c(17,20), rm.phix=TRUE and maxEE=2 along with truncLen=c(220, 220).
```{r filter and trim trimming, warning=FALSE, message=FALSE, tidy=TRUE, results='hide',comment= ""}
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(220, 220), trimLeft = c(17,20), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE) # On Windows set multithread=FALSE
errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
```
## Construct sequence table and remove chimeras
```{r seqtab trimming, warning=FALSE, message=FALSE, tidy=TRUE, results='hide',comment= ""}
#seqtab <- makeSequenceTable(mergers)
#seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
```
## Track reads through the pipeline
```{r track reads trimming, warning=FALSE, message=FALSE, tidy=TRUE, comment= ""}
getN <- function(x) sum(getUniques(x))
track <- cbind(out, getN(dadaFs), getN(dadaRs), getN(mergers))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged")
rownames(track) <- sample.names
head(track)
```
The table above shows that none of the forward and reverse reads were able to merge due to choosing stringent trimming of the forward and reverse reads. All the reads were lost as we failed to maintain 20 + biological.length.variation nucleotides.
After considering the above 3 scenarios, ensuring to remove primers and maintaining 20 + biological.length.variation seems to be the the appropriate trimming parameters of choice for analyzing amplicon sequence data using `dada2` as chosen in Scenario 1.
## Considerations for poor quality profile data
Here, we discuss the steps to follow for a relatively poor sequence quality dataset.
The dataset used here is a hypervariable V3-V4 16S rRNA region of the bacterial gene was targeted by universal bacterial primer 319F and 806R. The library was sequenced on the Illumina MiSeq using a paired-end 300-bp protocol and v3 reagents with forward and reverse primers present on the sequences obtained from the ENA, with the study ID PRJNA473351. The primers used, Forward Primer : `ACTCCTACGGGAGGCAGCAG` of length 20 bp and Reverse Primer : `GGACTACHVGGGTWTCTAAT` of length 20 bp.
## Obtaining sequence data
```{r poor sequence data, warning=FALSE, message=FALSE, tidy=TRUE, results='hold',comment= ""}
path <- "~/Desktop/poor_quality"
list.files(path)
```
Now we read in the names of the fastq files, and perform string manipulation to get matched lists of the forward and reverse fastq files.
```{r poor sequence data reads, warning=FALSE, message=FALSE, tidy=TRUE, results='hold',comment= ""}
# Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq
fnFs <- sort(list.files(path, pattern="_1.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_2.fastq.gz", full.names = TRUE))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
sample.names
```
## Inspect read quality profiles
```{r quality profile poor data, warning=FALSE, message=FALSE, tidy=TRUE, results='hold',comment= ""}
plotQualityProfile(c(fnFs[1], fnRs[1]))
```
From the obtained Quality profile, we see that the reads are of considerable poor quality especially towards the end and it seems to be worse for the reverse read. We truncate the last 30 nucleotides on the forward read and 60 nucleotides to remove poor quality sequence data. We cannot truncate beyond this as we would need to retain enough bases to maintain 20 + biological.length.variation nucleotides of overlap between the forward and reverse reads.
## Filter and trim, Learn the error rates, dereplicate, sample inference and merging reads.
Assigning the filenames for the filtered fastq.gz files.
```{r filt path poor data, warning=FALSE, message=FALSE, tidy=TRUE, results='hold',comment= ""}
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
```
We will use standard filtering parameters: `maxN=0` (DADA2 requires no Ns), truncQ=2, rm.phix=TRUE and maxEE=2 along with truncLen=c(240, 240), trimLeft = c(20,20).
```{r out 250, warning=FALSE, message=FALSE,tidy=TRUE, results='hide',comment= ""}
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(270,240), maxN=0, maxEE=c(2,2), trimLeft = c(20,20), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE) # On Windows set multithread=FALSE
errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
```
## Construct sequence table and remove chimeras
```{r seqtab poor data, warning=FALSE, message=FALSE, tidy=TRUE, results='hide',comment= ""}
seqtab <- makeSequenceTable(mergers)
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
```
## Track reads through the pipeline
```{r track reads poor data, warning=FALSE, message=FALSE, tidy=TRUE, comment= ""}
getN <- function(x) sum(getUniques(x))
track <- cbind(out, getN(dadaFs), getN(dadaRs), getN(mergers), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)
```
The table above shows that the biggest loss of sequences was at the initial filtering step which is common for a poor quality sequence data. We later retain majority of our raw reads and there is no large loss of reads with any single step.