forked from OBiLab/VarCall2016_Napoli
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathd2t1_dataQC.html
87 lines (86 loc) · 13.7 KB
/
d2t1_dataQC.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
<table>
<tbody>
<tr class="odd">
<td align="left">> #### Learning Objectives</td>
</tr>
</tbody>
</table>
<h1 id="data-qc-pre-processing---practical-handbook">Data QC & Pre-processing - Practical Handbook</h1>
<p><em>Summary</em></p>
<p>During this practical you will take an Illumina paired end data set and complete the following quality control workflow: 1. Assess the quality of paired end data using FastQC 2. Use Trimmomatic to trim/remove poor quality bases/reads 3. Use Trimmomatic to remove 3’ adapter sequences from reads 4. Re-assess the data using FastQC</p>
<h2 id="step1-copy-and-check-the-data-files">Step1: Copy and check the data files</h2>
<p>The data we will be using today comes from <a href="http://goo.gl/dcKbB">this publication</a> where they used next generation sequencing to investigate population genetics of Vibrio cholera, following a cholera outbreak in the aftermath of the Haitian earthquake in 2010. The accession number for the complete data set is <strong>SRA039806</strong>, with the paired end data files having the accession number <strong>SRR308665</strong>.</p>
<p>Paired end 1 data file -> <code>paired_end1.fastq</code></p>
<p>Paired end 2 data file -> <code>paired_end2.fastq</code></p>
<p><strong>NOTE:</strong> remember that raw paired end data is contained in two separate files, one file for each pair.</p>
<p>First of all, open a connection to Cineca's HPC pico.</p>
<pre><code>>$ ssh -X [email protected]</code></pre>
<p>The first task is to copy the data to your scratch directory so you can use them. Use the <code>cd</code> (change directory) command to move to your scratch directory, the location of your scratch directory is defined according to the following format:</p>
<pre><code>/pico/scratch/userexternal/phallast/</code></pre>
<p>in this example my username is <code>phallast</code>, so my scratch directory is found at the path above.</p>
<p>Use the path to your scratch directory in the <code>cd</code> command below:</p>
<pre><code>>$ cd <path to your scratch directory></code></pre>
<p>The command ‘pwd’ which stands for ‘print working directory’ will tell you the path of your current directory.</p>
<pre><code>>$ pwd</code></pre>
<p>In my case the output from the ‘pwd’ command is:</p>
<pre><code>/pico/scratch/userexternal/phallast/</code></pre>
<p>In order to keep things tidy in your scratch directory it is good practise to create separate folders for the different exercises. We will now create a ‘Data_QC’ folder in your scratch directory then use the command <code>cp</code> (copy) to copy the files into it. The <code>mkdir</code> command, short for ‘make directory’ will create a directory with the name you specify:</p>
<pre><code>>$ mkdir Data_QC</code></pre>
<p>Then ‘cd’ to the Data_QC directory you just created:</p>
<pre><code>>$ cd Data_QC</code></pre>
<p>We will now use the <code>cp</code> command to copy the data files from their current location that is <code>/pico/scratch/userexternal/phallast/DataQC</code>:</p>
<pre><code>>$ cp /pico/scratch/userexternal/phallast/DataQC/paired_end1.fastq .
>$ cp /pico/scratch/userexternal/phallast/DataQC/paired_end2.fastq .
>$ cp /pico/scratch/userexternal/phallast/DataQC/TruSeq3-PE-2.fa .</code></pre>
<p>You will notice that there is a space then <strong>a full stop</strong> at the end of the three <code>cp</code> commands below, this is vital as it tells the <code>cp</code> command that you would like the files copied to your current directory. If you replaced the <code>.</code> with a path e.g. <code>home/MyDirectory/</code> then the files would be copied there instead. Note that the <code>TruSeq3-PE-2.fa</code> is a file of adapter sequences that Trimmomatic will need for the adapter removal stage later.</p>
<p>Once you have run the <code>cp</code> command use the <code>ls</code> (list) command to check the files have been copied to your directory.</p>
<pre><code>>$ ls</code></pre>
<p><code>.fastq</code> files are usually too large to open with text editors, but there are Unix commands to help you look at your data. Use the <code>more</code> command to view your <code>.fastq</code> files (note the <code>fastq</code> format as described in the presentation earlier).</p>
<pre><code>>$ more paired_end1.fastq</code></pre>
<p>To exit the <code>more</code> command and return to the command prompt press <code>q</code> or <code>Ctrl+C</code>. Try the <code>head</code> command which displays just the first 10 lines of a file:</p>
<pre><code>>$ head paired_end1.fastq</code></pre>
<h4 id="question-there-is-also-a-unix-command-called-tail-given-what-the-headcommand-does-what-do-you-think-tail-will-do-try-it-on-one-of-your-data-files.">Question – There is also a Unix command called ‘tail’, given what the ‘head’command does what do you think ‘tail’ will do? Try it on one of your data files.</h4>
<p>Now that you have checked the fastq files are there and viewed them you can proceed with the workflow.</p>
<h2 id="step-2-assess-the-quality-of-the-data-using-fastqc">Step 2: Assess the quality of the data using FastQC</h2>
<p>FastQC is a quality control tool for high throughput sequence data.</p>
<p>The FastQC software needs to be made accessible to you on PICO by using the <code>module load</code> command:</p>
<pre><code>>$ module load fastqc/0.11.2</code></pre>
<p>Then launch FastQC as follows:</p>
<pre><code>>$ fastqc &</code></pre>
<p>From the FastQC window click: <code>> File > Open</code></p>
<p>Then browse to the location of the <code>.fastq</code> files (this should be the <code>Data_QC</code> directory you just created in your scratch directory) and select both of the paired end fastq files then click ‘OK’. It might take a minute or two for FastQC to open the files and analyse them.</p>
<p>Work your way through the analysis modules on the left hand side of the FastQC window, using the <a href="http://goo.gl/8nSMkq">FastQC documentation</a> familiarize yourself with what each module is showing. Pay particular attention to the modules below as they will help direct the downstream trimming and adapter removal steps:</p>
<p>‘Basic Statistics’ ‘Per Base Sequence Quality’ ‘Per Sequence Quality Scores’ ‘Overrepresented Sequences’</p>
<p><strong>NOTE:</strong> There is a characteristic drop in quality scores of the 3’ bases of the reads (we will trim these later). The overrepresented sequences will also be removed.</p>
<h4 id="question-what-is-the-total-number-of-sequences-in-each-of-the-paired-end-.fastq-files-hint-use-the-basic-statistics-module">Question – what is the total number of sequences in each of the paired end <code>.fastq</code> files? (hint – use the basic statistics module)</h4>
<ul>
<li>paired_end1.fastq ?</li>
<li>paired_end2.fastq ?</li>
</ul>
<h4 id="question-what-type-of-encoding-is-used-in-the-fastq-files">Question – what type of encoding is used in the <code>fastq</code> files?</h4>
<h4 id="question-what-is-the-length-of-the-sequences-in-the-fastq-files">Question – what is the length of the sequences in the fastq files?</h4>
<p><strong>Note:</strong> Do not close the FastQC window as we will compare the original files to the ones we will produce after adapter removal, and quality filtering.</p>
<h2 id="step-3-use-trimmomatic-to-remove-adapter-sequences">Step 3: Use Trimmomatic to remove adapter sequences</h2>
<p>Trimmomatic is a java tool for performing a range of trimming tasks on Illumina paired end and single end read data. The documentation can be found <a href="http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf">here</a>.</p>
<p>Go back to the overrepresented sequences module of FastQC. This is where FastQC would tell you if a significant proportion (>1%) of your reads are contaminated with adapter sequences. As you can see from the ‘Possible Source’ column, FastQC has found a number of reads contaminated with TruSeq adapter sequences, so we will run Trimmomatic to remove the adapter sequences.</p>
<p>The command below will run Trimmomatic and remove any adapter sequence from the reads. Remember we will need to make the Trimmomatic software available to you by using the ‘module load’ command</p>
<pre><code>$ module load trimmomatic/0.33
$ java -Xmx1G -jar /cineca/prod/applications/trimmomatic/0.33/binary/bin/trimmomatic-0.33.jar PE -phred33 -threads 8 -trimlog logfile paired_end1.fastq paired_end2.fastq Left_paired.fastq Left_unpaired.fastq Right_paired.fastq Right_unpaired.fastq ILLUMINACLIP:TruSeq3-PE-2.fa:2:40:15 MINLEN:36</code></pre>
<p>The parameters used for Trimmomatic are defined as follows: - <code>PE</code>: data is paired end - <code>-phred33</code>: Quality scores are 33 offset - <code>-threads 8</code>: number of threads to use - <code>-trimlog logfile</code>: name of logfile for summary information - <code>paired_end1.fastq</code>: name of input fastq file for left reads - <code>paired_end2.fastq</code>: name of input fastq file for right reads - <code>Left_paired.fastq</code>: paired trimmed output fastq file for left reads - <code>Left_unpaired.fastq</code>: unpaired trimmed output fastq file for left reads - <code>Right_paired.fastq</code>: paired trimmed output fastq file for right reads - <code>Right_unpaired.fastq</code>: unpaired trimmed output fastq file for right reads - <code>ILLUMINACLIP</code>: parameters for the adapter clipping - <code>TruSeq3-PE-2.fa</code>: text file of adapter sequences to search for - <code>:2:40:15</code>: adapter-read alignment settings – see manual - <code>MINLEN:36</code>: delete reads trimmed below length MINLEN</p>
<h4 id="question-according-to-the-trimmomatic-screen-output-what-is-the-number-and-percentage-of-read-pairs-that-both-survived-adapter-trimming">Question – According to the Trimmomatic screen output, what is the number and percentage of read pairs that ‘both survived’ adapter trimming?</h4>
<h4 id="question---given-that-the-original-fastq-files-each-contained-4052587-reads-how-many-pairs-of-reads-have-been-trimmed-and-then-deleted-by-trimmomatic-in-this-step">Question - Given that the original fastq files each contained 4052587 reads, how many pairs of reads have been trimmed and then deleted by Trimmomatic in this step?</h4>
<p><strong>Note:</strong> Remember that trimmomatic only deletes reads if the length after trimming of adapter sequences is less than MINLEN (which we set to 36bp).</p>
<h2 id="step-4-use-trimmomatic-to-trim-low-quality-bases">Step 4: Use Trimmomatic to trim low quality bases</h2>
<p>The FastQC ‘Per Base Sequence Quality’ and ‘Per Sequence Quality Scores’ modules have already told us that there could be some issues with the quality scores of the last few bases of the reads. We also know from the presentation that the quality of 3’ bases of sequence reads does tend to decrease. We will use Trimmomatic to trim poor quality bases from the 3’ end of the reads. Trimmomatic also checks the 5’end for poor quality bases. The command below will carry out the trimming on the adapter trimmed fastq files we created above:</p>
<pre><code>$ java -Xmx1G -jar /cineca/prod/applications/trimmomatic/0.33/binary/bin/trimmom atic-0.33.jar PE -phred33 -threads 8 -trimlog logfile2 Left_paired.fastq Right_paired.fastq Left_trim_paired.fastq Left_trim_unpaired.fastq Right_trim_paired.fastq Right_trim_unpaired.fastq LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36</code></pre>
<p>The parameters used for Trimmomatic are defined as follows: - <code>PE</code>: data is paired end - <code>-phred33</code>: Quality scores are 33 offset - <code>-threads 8</code>: number of threads to use - <code>-trimlog logfile2</code>: name of logfile for summary information - <code>Left_paired.fastq</code>: name of input adapter trimmed left fastq file - <code>Right_paired.fastq</code>: name of input adapter trimmed right fastq file - <code>Left_trim_paired.fastq</code>: paired trimmed output fastq file for left reads - <code>Left_unpaired.fastq</code>: unpaired trimmed output fastq file for left reads - <code>Right_paired.fastq</code>: paired trimmed output fastq file for right reads - <code>Right_unpaired.fastq</code>: unpaired trimmed output fastq file for right reads - <code>LEADING:3</code>: Trim 5’ bases with quality score < 3 - <code>TRAILING:3</code>: Trim 3’ bases with quality score < 3 - <code>SLIDINGWINDOW:4:15</code>: see manual for explanation - <code>MINLEN:36</code>: delete reads trimmed below length MINLEN</p>
<h4 id="question-according-to-the-trimmomatic-screen-output-what-is-the-number-and-percentage-of-read-pairs-that-both-survived-low-quality-base-trimming">Question – According to the Trimmomatic screen output, what is the number and percentage of read pairs that ‘both survived’ low quality base trimming?</h4>
<h4 id="question---given-that-the-adapter-trimmed-fastq-files-contained-2165674-reads-how-many-pairs-of-reads-have-been-trimmed-and-then-deleted-by-trimmomatic-in-this-step">Question - Given that the adapter trimmed fastq files contained 2165674 reads, how many pairs of reads have been trimmed and then deleted by Trimmomatic in this step?</h4>
<h2 id="step-5-assess-again-the-quality-of-the-data-using-fastqc">Step 5: Assess again the quality of the data using FastQC</h2>
<p>Use FastQC to open the trimmed fastq files.</p>
<p>Go through the analysis modules on the left hand side to see what has changed compared to the original fastq files.</p>
<h4 id="question-what-are-the-main-changes-you-see-in-the-read-quality">Question – What are the main changes you see in the read quality?</h4>
<div class="figure">
<img src="img/cong.png" alt="cong" /><p class="caption">cong</p>
</div>
<h3 id="well-doneyou-have-successfully-qcd-and-pre-processed-a-set-of-paired-end-read-files">Well done……you have successfully QC’d and pre-processed a set of paired end read files!</h3>