diff --git a/analysis.Rmd b/analysis.Rmd index 8f468bd..5e7041e 100644 --- a/analysis.Rmd +++ b/analysis.Rmd @@ -102,7 +102,7 @@ Once the analysis has completed, you will have a file called `nemabiome_results. The following is a sample analysis using the DADA2 pipeline (https://benjjneb.github.io/dada2/tutorial.html). This is meant to be an example and not a definitive workflow. First we load packages and set a seed for reproducibility. -``` +```{r libraries, echo=T, results='hide'} library(DECIPHER) packageVersion("DECIPHER") @@ -115,17 +115,20 @@ packageVersion("ShortRead") library(Biostrings) packageVersion("Biostrings") +library(ggplot2) +packageVersion("ggplot2") + set.seed(106) ``` To use the DADA2 pipeline, primers must be removed from amplicons. This can be done in R using cutadapt. Here we specify the path of sequences to be processed (using "/Miseq_Run" from previous examples) and the path of Cutadapt on your machine (we used "/home/gilleardlab/anaconda3/bin/cutadapt"). You will also need to specify the primers you used as the "FWD" and "REV" variables. We start off by creating all possible orientations of the forward and reverse sequences. -``` -path<- "Miseq_Run" +```{r cutadapt_initialization} +path<- "~/MiSeq_Run/" list.files(path) -fnFs<-sort(list.files(path, pattern = "_R1.fastq.gz", full.names = TRUE)) -fnRs <-sort(list.files(path, pattern = "_R2.fastq.gz", full.names = TRUE)) +fnFs<-sort(list.files(path, pattern = "_R1_001.fastq.gz", full.names = TRUE)) +fnRs <-sort(list.files(path, pattern = "_R2_001.fastq.gz", full.names = TRUE)) #identifying the primers @@ -147,7 +150,7 @@ FWD.orients Next we will remove all sequences containing ambiguous bases, and count the number of times our primers are observed in the remaining sequences. -``` +```{r more_cutadapt_initialization} #fltering the sequences fnFs.filtN <- file.path(path, "filtN", basename(fnFs)) fnRs.filtN <- file.path(path, "filtN", basename(fnRs)) @@ -167,8 +170,8 @@ rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[1]]), Now we can use cutadapt to remove primers, and check that they have been successfully removed. -``` -cutadapt <- "/home/gilleardlab/anaconda3/bin/cutadapt" # CHANGE ME to the cutadapt path on your machine +```{r cutadapt, echo=T, results='hide'} +cutadapt <- "/home/stefan/.local/bin/cutadapt" # CHANGE ME to the cutadapt path on your machine system2(cutadapt, args = "--version") # Run shell commands from R #now creating output filenames for cutadapted files, and define parameters to give cutadapt command @@ -202,9 +205,9 @@ rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]), Now that primers are removed, we can begin the DADA2 pipeline. We start with plotting quality profiles of our sequences to determine suitable quality filtering parameters. -``` -cutFs <- sort(list.files(path.cut, pattern = "_R1.fastq.gz", full.names = TRUE)) #remember to match file names -cutRs <- sort(list.files(path.cut, pattern = "_R2.fastq.gz", full.names = TRUE)) +```{r quality_profiles} +cutFs <- sort(list.files(path.cut, pattern = "_R1_001.fastq.gz", full.names = TRUE)) #remember to match file names +cutRs <- sort(list.files(path.cut, pattern = "_R2_001.fastq.gz", full.names = TRUE)) # Extract sample names, assuming filenames have same format, may need to alter!!! get.sample.name <- function(fname) strsplit(basename(fname), "_R")[[1]][1] @@ -216,8 +219,8 @@ sample.names ``` For the purposes of our analysis, we chose to conduct quality filtering with maximum expected errors of 2 for the forward sequence, 5 for the reverse, to truncate after a quality score of 2 or lower, and to filter out amplicons with lengths shorter than 50 base pairs. Again, filters for your own pipeline may differ. After quality filtering, we can run our sequences through learnErrors to create an error model, and then dereplicate our sequences. -``` -#filtering reads +```{r} +#filtering reads filtFs<-file.path(path.cut, "filtered", basename(cutFs)) filtRs<-file.path(path.cut, "filtered", basename(cutRs)) @@ -231,7 +234,7 @@ errF<-learnErrors(filtFs, multithread=TRUE) errR<-learnErrors(filtRs, multithread=TRUE) #visualize the error rates w code below so you can see -# plotErrors(errF, nominalQ=TRUE) +plotErrors(errF, nominalQ=TRUE) # denoising data derepFs<-derepFastq(filtFs, verbose = TRUE) @@ -240,7 +243,7 @@ derepRs<-derepFastq(filtRs, verbose=TRUE) After dereplication, we can denoise our sequences using the DADA algorithm. We will also merge our paired-end sequences, allowing for up to 1 mismatch in the overlap region and returning all sequences whether they merged or not. -``` +```{r denoise_and_merge, echo=T, results='hide'} dadaFs<-dada(derepFs, err=errF, multithread = TRUE) dadaRs<-dada(derepRs, err=errR, multithread=TRUE) @@ -249,7 +252,7 @@ mergers<-mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE, maxMismatch Before we can assign taxonomy, we can construct a sequence table, remove chimeras using the removeBimeraDenovo command and then create a table of how many sequences were retained after each processing step. -``` +```{r chimera_removal} seqtab<-makeSequenceTable(mergers) dim(seqtab) diff --git a/docs/analysis.html b/docs/analysis.html index 3937b4a..824c726 100644 --- a/docs/analysis.html +++ b/docs/analysis.html @@ -424,60 +424,135 @@

Running analysis

DADA2 Example Workflow

The following is a sample analysis using the DADA2 pipeline (https://benjjneb.github.io/dada2/tutorial.html). This is meant to be an example and not a definitive workflow. First we load packages and set a seed for reproducibility.

-
library(DECIPHER)
-packageVersion("DECIPHER")
-
-library(dada2)
-packageVersion("dada2") 
-
-library(ShortRead)
-packageVersion("ShortRead") 
+
library(DECIPHER)
+
## Loading required package: Biostrings
+
## Loading required package: BiocGenerics
+
## Loading required package: parallel
+
## 
+## Attaching package: 'BiocGenerics'
+
## The following objects are masked from 'package:parallel':
+## 
+##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
+##     clusterExport, clusterMap, parApply, parCapply, parLapply,
+##     parLapplyLB, parRapply, parSapply, parSapplyLB
+
## The following objects are masked from 'package:stats':
+## 
+##     IQR, mad, sd, var, xtabs
+
## The following objects are masked from 'package:base':
+## 
+##     anyDuplicated, append, as.data.frame, basename, cbind,
+##     colnames, dirname, do.call, duplicated, eval, evalq, Filter,
+##     Find, get, grep, grepl, intersect, is.unsorted, lapply, Map,
+##     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
+##     pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
+##     setdiff, sort, table, tapply, union, unique, unsplit, which,
+##     which.max, which.min
+
## Loading required package: S4Vectors
+
## Loading required package: stats4
+
## 
+## Attaching package: 'S4Vectors'
+
## The following object is masked from 'package:base':
+## 
+##     expand.grid
+
## Loading required package: IRanges
+
## Loading required package: XVector
+
## 
+## Attaching package: 'Biostrings'
+
## The following object is masked from 'package:base':
+## 
+##     strsplit
+
## Loading required package: RSQLite
+
packageVersion("DECIPHER")
+
+library(dada2)
+
## Loading required package: Rcpp
+
## Registered S3 methods overwritten by 'ggplot2':
+##   method         from 
+##   [.quosures     rlang
+##   c.quosures     rlang
+##   print.quosures rlang
+
packageVersion("dada2") 
+
+library(ShortRead)
+
## Loading required package: BiocParallel
+
## Loading required package: Rsamtools
+
## Loading required package: GenomeInfoDb
+
## Loading required package: GenomicRanges
+
## Loading required package: GenomicAlignments
+
## Loading required package: SummarizedExperiment
+
## Loading required package: Biobase
+
## Welcome to Bioconductor
+## 
+##     Vignettes contain introductory material; view with
+##     'browseVignettes()'. To cite Bioconductor, see
+##     'citation("Biobase")', and for packages 'citation("pkgname")'.
+
## Loading required package: DelayedArray
+
## Loading required package: matrixStats
+
## 
+## Attaching package: 'matrixStats'
+
## The following objects are masked from 'package:Biobase':
+## 
+##     anyMissing, rowMedians
+
## 
+## Attaching package: 'DelayedArray'
+
## The following objects are masked from 'package:matrixStats':
+## 
+##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
+
## The following object is masked from 'package:Biostrings':
+## 
+##     type
+
## The following objects are masked from 'package:base':
+## 
+##     aperm, apply, rowsum
+
packageVersion("ShortRead") 
 
 library(Biostrings)
 packageVersion("Biostrings")
 
+library(ggplot2)
+packageVersion("ggplot2")
+
 set.seed(106)

To use the DADA2 pipeline, primers must be removed from amplicons. This can be done in R using cutadapt. Here we specify the path of sequences to be processed (using “/Miseq_Run” from previous examples) and the path of Cutadapt on your machine (we used “/home/gilleardlab/anaconda3/bin/cutadapt”). You will also need to specify the primers you used as the “FWD” and “REV” variables. We start off by creating all possible orientations of the forward and reverse sequences.

-
path<- "Miseq_Run"
-list.files(path)
-
-fnFs<-sort(list.files(path, pattern = "_R1.fastq.gz", full.names = TRUE)) 
-fnRs <-sort(list.files(path, pattern = "_R2.fastq.gz", full.names = TRUE))
-
-
-#identifying the primers
-FWD<-"ACGTCTGGTTCAGGGTTGTT"
-REV <- "TTAGTTTCTTTTCCTCCGCT"
-
-allOrients<-function(primer) {   #creates all orientations of the primer sequence)
-  require(Biostrings)
-  dna<-DNAString(primer) #biostring works w dnastring objects
-  orients<-c(Forward=dna, Complement=complement(dna), Reverse=reverse(dna),
-             RevComp=reverseComplement(dna))
-  return(sapply(orients, toString)) #now converting back into character vector
-}
-FWD.orients<-allOrients(FWD)
-REV.orients<-allOrients(REV)
-
-FWD.orients
+
##  [1] "cutadapt"                     "filtN"                       
+##  [3] "G1_S1_L001_R1_001.fastq.gz"   "G1_S1_L001_R2_001.fastq.gz"  
+##  [5] "G10_S10_L001_R1_001.fastq.gz" "G10_S10_L001_R2_001.fastq.gz"
+##  [7] "G11_S11_L001_R1_001.fastq.gz" "G11_S11_L001_R2_001.fastq.gz"
+##  [9] "G12_S12_L001_R1_001.fastq.gz" "G12_S12_L001_R2_001.fastq.gz"
+## [11] "G13_S13_L001_R1_001.fastq.gz" "G13_S13_L001_R2_001.fastq.gz"
+## [13] "G14_S14_L001_R1_001.fastq.gz" "G14_S14_L001_R2_001.fastq.gz"
+## [15] "G15_S15_L001_R1_001.fastq.gz" "G15_S15_L001_R2_001.fastq.gz"
+## [17] "G16_S16_L001_R1_001.fastq.gz" "G16_S16_L001_R2_001.fastq.gz"
+## [19] "G17_S17_L001_R1_001.fastq.gz" "G17_S17_L001_R2_001.fastq.gz"
+## [21] "G18_S18_L001_R1_001.fastq.gz" "G18_S18_L001_R2_001.fastq.gz"
+## [23] "G19_S19_L001_R1_001.fastq.gz" "G19_S19_L001_R2_001.fastq.gz"
+## [25] "G2_S2_L001_R1_001.fastq.gz"   "G2_S2_L001_R2_001.fastq.gz"  
+## [27] "G20_S20_L001_R1_001.fastq.gz" "G20_S20_L001_R2_001.fastq.gz"
+## [29] "G21_S21_L001_R1_001.fastq.gz" "G21_S21_L001_R2_001.fastq.gz"
+## [31] "G22_S22_L001_R1_001.fastq.gz" "G22_S22_L001_R2_001.fastq.gz"
+## [33] "G23_S23_L001_R1_001.fastq.gz" "G23_S23_L001_R2_001.fastq.gz"
+## [35] "G24_S24_L001_R1_001.fastq.gz" "G24_S24_L001_R2_001.fastq.gz"
+## [37] "G25_S25_L001_R1_001.fastq.gz" "G25_S25_L001_R2_001.fastq.gz"
+## [39] "G26_S26_L001_R1_001.fastq.gz" "G26_S26_L001_R2_001.fastq.gz"
+## [41] "G3_S3_L001_R1_001.fastq.gz"   "G3_S3_L001_R2_001.fastq.gz"  
+## [43] "G4_S4_L001_R1_001.fastq.gz"   "G4_S4_L001_R2_001.fastq.gz"  
+## [45] "G5_S5_L001_R1_001.fastq.gz"   "G5_S5_L001_R2_001.fastq.gz"  
+## [47] "G6_S6_L001_R1_001.fastq.gz"   "G6_S6_L001_R2_001.fastq.gz"  
+## [49] "G7_S7_L001_R1_001.fastq.gz"   "G7_S7_L001_R2_001.fastq.gz"  
+## [51] "G8_S8_L001_R1_001.fastq.gz"   "G8_S8_L001_R2_001.fastq.gz"  
+## [53] "G9_S9_L001_R1_001.fastq.gz"   "G9_S9_L001_R2_001.fastq.gz"
+
##                Forward             Complement                Reverse 
+## "ACGTCTGGTTCAGGGTTGTT" "TGCAGACCAAGTCCCAACAA" "TTGTTGGGACTTGGTCTGCA" 
+##                RevComp 
+## "AACAACCCTGAACCAGACGT"

Next we will remove all sequences containing ambiguous bases, and count the number of times our primers are observed in the remaining sequences.

-
#fltering the sequences
-fnFs.filtN <- file.path(path, "filtN", basename(fnFs)) 
-fnRs.filtN <- file.path(path, "filtN", basename(fnRs))
-filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = TRUE)
-
-#counting number of primers 
-primerHits <- function(primer, fn) {
-  # Counts number of reads in which the primer is found
-  nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
-  return(sum(nhits > 0))
-}
-rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[1]]), 
-      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.filtN[[1]]), 
-      REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.filtN[[1]]), 
-      REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[[1]]))
+
##                  Forward Complement Reverse RevComp
+## FWD.ForwardReads   25344          0       0       0
+## FWD.ReverseReads       0          0       0       0
+## REV.ForwardReads       0          0       0       0
+## REV.ReverseReads   21503          0       0       0

Now we can use cutadapt to remove primers, and check that they have been successfully removed.

-
cutadapt <- "/home/gilleardlab/anaconda3/bin/cutadapt" # CHANGE ME to the cutadapt path on your machine
+
cutadapt <- "/home/stefan/.local/bin/cutadapt" # CHANGE ME to the cutadapt path on your machine
 system2(cutadapt, args = "--version") # Run shell commands from R
 
 #now creating output filenames for cutadapted files, and define parameters to give cutadapt command
@@ -505,65 +580,282 @@ 

DADA2 Example Workflow

rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]), FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]), REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]), - REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]])) -
+ REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))

Now that primers are removed, we can begin the DADA2 pipeline. We start with plotting quality profiles of our sequences to determine suitable quality filtering parameters.

-
cutFs <- sort(list.files(path.cut, pattern = "_R1.fastq.gz", full.names = TRUE)) #remember to match file names 
-cutRs <- sort(list.files(path.cut, pattern = "_R2.fastq.gz", full.names = TRUE))
-
-# Extract sample names, assuming filenames have same format, may need to alter!!!
-get.sample.name <- function(fname) strsplit(basename(fname), "_R")[[1]][1]
-sample.names <- unname(sapply(cutFs, get.sample.name))
-sample.names
-
-#plotQualityProfile(cutFs[1:2])+ggtitle("Forward")
-#plotQualityProfile(cutRs[1:2])+ggtitle("Reverse")
+
##  [1] "G1_S1_L001"   "G10_S10_L001" "G11_S11_L001" "G12_S12_L001"
+##  [5] "G13_S13_L001" "G14_S14_L001" "G15_S15_L001" "G16_S16_L001"
+##  [9] "G17_S17_L001" "G18_S18_L001" "G19_S19_L001" "G2_S2_L001"  
+## [13] "G20_S20_L001" "G21_S21_L001" "G22_S22_L001" "G23_S23_L001"
+## [17] "G24_S24_L001" "G25_S25_L001" "G26_S26_L001" "G3_S3_L001"  
+## [21] "G4_S4_L001"   "G5_S5_L001"   "G6_S6_L001"   "G7_S7_L001"  
+## [25] "G8_S8_L001"   "G9_S9_L001"

For the purposes of our analysis, we chose to conduct quality filtering with maximum expected errors of 2 for the forward sequence, 5 for the reverse, to truncate after a quality score of 2 or lower, and to filter out amplicons with lengths shorter than 50 base pairs. Again, filters for your own pipeline may differ. After quality filtering, we can run our sequences through learnErrors to create an error model, and then dereplicate our sequences.

-
#filtering reads 
-filtFs<-file.path(path.cut, "filtered", basename(cutFs))
-filtRs<-file.path(path.cut, "filtered", basename(cutRs))
-
-out <- filterAndTrim(cutFs, filtFs, cutRs, filtRs, maxN = 0, maxEE = c(2, 5), 
-                     truncQ = 2, minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = TRUE)  
-head(out)
-out
-
-#learning errors
-errF<-learnErrors(filtFs, multithread=TRUE)
-errR<-learnErrors(filtRs, multithread=TRUE)
-
-#visualize the error rates w code below so you can see
-# plotErrors(errF, nominalQ=TRUE)
-
-# denoising data 
-derepFs<-derepFastq(filtFs, verbose = TRUE)
-derepRs<-derepFastq(filtRs, verbose=TRUE)
+
##                              reads.in reads.out
+## G1_S1_L001_R1_001.fastq.gz      28069     25002
+## G10_S10_L001_R1_001.fastq.gz    30391     27055
+## G11_S11_L001_R1_001.fastq.gz    26771     24258
+## G12_S12_L001_R1_001.fastq.gz    28793     25794
+## G13_S13_L001_R1_001.fastq.gz    28812     26158
+## G14_S14_L001_R1_001.fastq.gz    33590     31499
+
##                              reads.in reads.out
+## G1_S1_L001_R1_001.fastq.gz      28069     25002
+## G10_S10_L001_R1_001.fastq.gz    30391     27055
+## G11_S11_L001_R1_001.fastq.gz    26771     24258
+## G12_S12_L001_R1_001.fastq.gz    28793     25794
+## G13_S13_L001_R1_001.fastq.gz    28812     26158
+## G14_S14_L001_R1_001.fastq.gz    33590     31499
+## G15_S15_L001_R1_001.fastq.gz    26882     25290
+## G16_S16_L001_R1_001.fastq.gz    28561     26653
+## G17_S17_L001_R1_001.fastq.gz    26693     24744
+## G18_S18_L001_R1_001.fastq.gz    27489     25525
+## G19_S19_L001_R1_001.fastq.gz    28710     25830
+## G2_S2_L001_R1_001.fastq.gz      28680     25407
+## G20_S20_L001_R1_001.fastq.gz    26774     24102
+## G21_S21_L001_R1_001.fastq.gz    28060     25687
+## G22_S22_L001_R1_001.fastq.gz    30512     28070
+## G23_S23_L001_R1_001.fastq.gz    26251     23650
+## G24_S24_L001_R1_001.fastq.gz    24075     22083
+## G25_S25_L001_R1_001.fastq.gz    30804     27836
+## G26_S26_L001_R1_001.fastq.gz    27777     21986
+## G3_S3_L001_R1_001.fastq.gz      30407     27917
+## G4_S4_L001_R1_001.fastq.gz      27779     25207
+## G5_S5_L001_R1_001.fastq.gz      35661     32700
+## G6_S6_L001_R1_001.fastq.gz      32204     29821
+## G7_S7_L001_R1_001.fastq.gz      30276     28046
+## G8_S8_L001_R1_001.fastq.gz      26885     24639
+## G9_S9_L001_R1_001.fastq.gz      22455     20541
+
## 100272261 total bases in 436807 reads from 17 samples will be used for learning the error rates.
+
## 100154673 total bases in 436807 reads from 17 samples will be used for learning the error rates.
+
## Warning: Transformation introduced infinite values in continuous y-axis
+
+## Warning: Transformation introduced infinite values in continuous y-axis
+

+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G1_S1_L001_R1_001.fastq.gz
+
## Encountered 5232 unique sequences from 25002 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G10_S10_L001_R1_001.fastq.gz
+
## Encountered 6326 unique sequences from 27055 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G11_S11_L001_R1_001.fastq.gz
+
## Encountered 6039 unique sequences from 24258 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G12_S12_L001_R1_001.fastq.gz
+
## Encountered 4602 unique sequences from 25794 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G13_S13_L001_R1_001.fastq.gz
+
## Encountered 5201 unique sequences from 26158 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G14_S14_L001_R1_001.fastq.gz
+
## Encountered 5727 unique sequences from 31499 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G15_S15_L001_R1_001.fastq.gz
+
## Encountered 5875 unique sequences from 25290 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G16_S16_L001_R1_001.fastq.gz
+
## Encountered 7082 unique sequences from 26653 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G17_S17_L001_R1_001.fastq.gz
+
## Encountered 6133 unique sequences from 24744 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G18_S18_L001_R1_001.fastq.gz
+
## Encountered 6249 unique sequences from 25525 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G19_S19_L001_R1_001.fastq.gz
+
## Encountered 5496 unique sequences from 25830 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G2_S2_L001_R1_001.fastq.gz
+
## Encountered 5681 unique sequences from 25407 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G20_S20_L001_R1_001.fastq.gz
+
## Encountered 6041 unique sequences from 24102 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G21_S21_L001_R1_001.fastq.gz
+
## Encountered 6168 unique sequences from 25687 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G22_S22_L001_R1_001.fastq.gz
+
## Encountered 4610 unique sequences from 28070 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G23_S23_L001_R1_001.fastq.gz
+
## Encountered 6640 unique sequences from 23650 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G24_S24_L001_R1_001.fastq.gz
+
## Encountered 5175 unique sequences from 22083 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G25_S25_L001_R1_001.fastq.gz
+
## Encountered 6443 unique sequences from 27836 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G26_S26_L001_R1_001.fastq.gz
+
## Encountered 5719 unique sequences from 21986 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G3_S3_L001_R1_001.fastq.gz
+
## Encountered 5780 unique sequences from 27917 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G4_S4_L001_R1_001.fastq.gz
+
## Encountered 5610 unique sequences from 25207 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G5_S5_L001_R1_001.fastq.gz
+
## Encountered 5650 unique sequences from 32700 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G6_S6_L001_R1_001.fastq.gz
+
## Encountered 7026 unique sequences from 29821 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G7_S7_L001_R1_001.fastq.gz
+
## Encountered 5377 unique sequences from 28046 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G8_S8_L001_R1_001.fastq.gz
+
## Encountered 5675 unique sequences from 24639 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G9_S9_L001_R1_001.fastq.gz
+
## Encountered 5228 unique sequences from 20541 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G1_S1_L001_R2_001.fastq.gz
+
## Encountered 16218 unique sequences from 25002 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G10_S10_L001_R2_001.fastq.gz
+
## Encountered 19693 unique sequences from 27055 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G11_S11_L001_R2_001.fastq.gz
+
## Encountered 17647 unique sequences from 24258 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G12_S12_L001_R2_001.fastq.gz
+
## Encountered 17386 unique sequences from 25794 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G13_S13_L001_R2_001.fastq.gz
+
## Encountered 17229 unique sequences from 26158 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G14_S14_L001_R2_001.fastq.gz
+
## Encountered 20973 unique sequences from 31499 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G15_S15_L001_R2_001.fastq.gz
+
## Encountered 17507 unique sequences from 25290 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G16_S16_L001_R2_001.fastq.gz
+
## Encountered 18676 unique sequences from 26653 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G17_S17_L001_R2_001.fastq.gz
+
## Encountered 17774 unique sequences from 24744 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G18_S18_L001_R2_001.fastq.gz
+
## Encountered 18617 unique sequences from 25525 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G19_S19_L001_R2_001.fastq.gz
+
## Encountered 18784 unique sequences from 25830 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G2_S2_L001_R2_001.fastq.gz
+
## Encountered 16294 unique sequences from 25407 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G20_S20_L001_R2_001.fastq.gz
+
## Encountered 19700 unique sequences from 24102 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G21_S21_L001_R2_001.fastq.gz
+
## Encountered 18500 unique sequences from 25687 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G22_S22_L001_R2_001.fastq.gz
+
## Encountered 19470 unique sequences from 28070 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G23_S23_L001_R2_001.fastq.gz
+
## Encountered 18173 unique sequences from 23650 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G24_S24_L001_R2_001.fastq.gz
+
## Encountered 16476 unique sequences from 22083 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G25_S25_L001_R2_001.fastq.gz
+
## Encountered 19593 unique sequences from 27836 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G26_S26_L001_R2_001.fastq.gz
+
## Encountered 15743 unique sequences from 21986 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G3_S3_L001_R2_001.fastq.gz
+
## Encountered 19543 unique sequences from 27917 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G4_S4_L001_R2_001.fastq.gz
+
## Encountered 17698 unique sequences from 25207 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G5_S5_L001_R2_001.fastq.gz
+
## Encountered 20321 unique sequences from 32700 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G6_S6_L001_R2_001.fastq.gz
+
## Encountered 19925 unique sequences from 29821 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G7_S7_L001_R2_001.fastq.gz
+
## Encountered 18122 unique sequences from 28046 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G8_S8_L001_R2_001.fastq.gz
+
## Encountered 16517 unique sequences from 24639 total sequences read.
+
## Dereplicating sequence entries in Fastq file: ~/MiSeq_Run//cutadapt/filtered/G9_S9_L001_R2_001.fastq.gz
+
## Encountered 14416 unique sequences from 20541 total sequences read.

After dereplication, we can denoise our sequences using the DADA algorithm. We will also merge our paired-end sequences, allowing for up to 1 mismatch in the overlap region and returning all sequences whether they merged or not.

-
dadaFs<-dada(derepFs, err=errF, multithread = TRUE)
+
dadaFs<-dada(derepFs, err=errF, multithread = TRUE)
 dadaRs<-dada(derepRs, err=errR, multithread=TRUE)
 
 mergers<-mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE, maxMismatch = 1, returnRejects = TRUE)
+
## 21476 paired-reads (in 39 unique pairings) successfully merged out of 21586 (in 57 pairings) input.
+
## Duplicate sequences in merged output.
+
## 21945 paired-reads (in 31 unique pairings) successfully merged out of 22414 (in 74 pairings) input.
+
## Duplicate sequences in merged output.
+
## 19465 paired-reads (in 41 unique pairings) successfully merged out of 20198 (in 122 pairings) input.
+
## Duplicate sequences in merged output.
+
## 21511 paired-reads (in 21 unique pairings) successfully merged out of 21700 (in 41 pairings) input.
+
## Duplicate sequences in merged output.
+
## 21967 paired-reads (in 42 unique pairings) successfully merged out of 22204 (in 69 pairings) input.
+
## Duplicate sequences in merged output.
+
## 26347 paired-reads (in 41 unique pairings) successfully merged out of 26549 (in 64 pairings) input.
+
## Duplicate sequences in merged output.
+
## 20624 paired-reads (in 34 unique pairings) successfully merged out of 21157 (in 95 pairings) input.
+
## Duplicate sequences in merged output.
+
## 21829 paired-reads (in 49 unique pairings) successfully merged out of 22523 (in 134 pairings) input.
+
## Duplicate sequences in merged output.
+
## 19821 paired-reads (in 33 unique pairings) successfully merged out of 20156 (in 71 pairings) input.
+
## Duplicate sequences in merged output.
+
## 20346 paired-reads (in 31 unique pairings) successfully merged out of 20826 (in 125 pairings) input.
+
## Duplicate sequences in merged output.
+
## 20683 paired-reads (in 61 unique pairings) successfully merged out of 21071 (in 96 pairings) input.
+
## Duplicate sequences in merged output.
+
## 21762 paired-reads (in 43 unique pairings) successfully merged out of 22044 (in 72 pairings) input.
+
## Duplicate sequences in merged output.
+
## 17938 paired-reads (in 28 unique pairings) successfully merged out of 18422 (in 54 pairings) input.
+
## Duplicate sequences in merged output.
+
## 20305 paired-reads (in 56 unique pairings) successfully merged out of 21119 (in 147 pairings) input.
+
## Duplicate sequences in merged output.
+
## 22812 paired-reads (in 17 unique pairings) successfully merged out of 22891 (in 31 pairings) input.
+
## Duplicate sequences in merged output.
+
## 18534 paired-reads (in 39 unique pairings) successfully merged out of 19183 (in 88 pairings) input.
+
## Duplicate sequences in merged output.
+
## 17335 paired-reads (in 37 unique pairings) successfully merged out of 17965 (in 70 pairings) input.
+
## Duplicate sequences in merged output.
+
## 22173 paired-reads (in 41 unique pairings) successfully merged out of 22490 (in 76 pairings) input.
+
## Duplicate sequences in merged output.
+
## 18065 paired-reads (in 39 unique pairings) successfully merged out of 18185 (in 59 pairings) input.
+
## Duplicate sequences in merged output.
+
## 22088 paired-reads (in 42 unique pairings) successfully merged out of 23486 (in 167 pairings) input.
+
## Duplicate sequences in merged output.
+
## 20751 paired-reads (in 40 unique pairings) successfully merged out of 21169 (in 85 pairings) input.
+
## Duplicate sequences in merged output.
+
## 27881 paired-reads (in 33 unique pairings) successfully merged out of 28071 (in 58 pairings) input.
+
## Duplicate sequences in merged output.
+
## 24592 paired-reads (in 68 unique pairings) successfully merged out of 25461 (in 208 pairings) input.
+
## Duplicate sequences in merged output.
+
## 23582 paired-reads (in 36 unique pairings) successfully merged out of 23839 (in 62 pairings) input.
+
## Duplicate sequences in merged output.
+
## 20804 paired-reads (in 39 unique pairings) successfully merged out of 21123 (in 85 pairings) input.
+
## Duplicate sequences in merged output.
+
## 16965 paired-reads (in 34 unique pairings) successfully merged out of 17326 (in 74 pairings) input.
+
## Duplicate sequences in merged output.

Before we can assign taxonomy, we can construct a sequence table, remove chimeras using the removeBimeraDenovo command and then create a table of how many sequences were retained after each processing step.

-
seqtab<-makeSequenceTable(mergers)
-dim(seqtab) 
-
-#Remove chimeras 
-seqtab.nochim<-removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
-#inspect distribution sequence lengths
-table(nchar(getSequences(seqtab.nochim)))
-
-#track reads through the pipeline - inspect the number of reads that made it through each step in the pipeline -- IMPORTANT
-
-getN <- function(x) sum(getUniques(x))
-track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, 
-                                                                       getN), rowSums(seqtab.nochim))
-
-# If processing a single sample, remove the sapply calls: e.g. replace
-# sapply(dadaFs, getN) with getN(dadaFs)
-colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", 
-                     "nonchim")
-rownames(track) <- sample.names     #what's considered to be an overlarge drop?? - reads seem fine
-head(track)
+
## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+
## [1]  26 208
+
## Identified 144 bimeras out of 208 input sequences.
+
## 
+## 273 274 281 283 287 288 289 290 291 292 319 421 
+##   2   2  12   3   4   4   8   2  18   7   1   1
+
## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+## Duplicate sequences detected and merged.
+
##              input filtered denoisedF denoisedR merged nonchim
+## G1_S1_L001   28069    25002     24953     21629  21586   20483
+## G10_S10_L001 30391    27055     26921     22528  22414   20281
+## G11_S11_L001 26771    24258     24182     20259  20198   18804
+## G12_S12_L001 28793    25794     25738     21744  21700   21003
+## G13_S13_L001 28812    26158     26083     22264  22204   21187
+## G14_S14_L001 33590    31499     31439     26586  26549   24953

Download list

diff --git a/docs/analysis_files/figure-html/unnamed-chunk-1-1.png b/docs/analysis_files/figure-html/unnamed-chunk-1-1.png new file mode 100644 index 0000000..2541c4c Binary files /dev/null and b/docs/analysis_files/figure-html/unnamed-chunk-1-1.png differ