-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathlab_qc_aftermapping.Rmd
177 lines (109 loc) · 7.65 KB
/
lab_qc_aftermapping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
title: "Quality control"
subtitle: "Workshop on RNA-Seq"
output:
bookdown::html_document2:
highlight: textmate
toc: true
toc_float:
collapsed: true
smooth_scroll: true
print: false
toc_depth: 4
number_sections: true
df_print: default
code_folding: none
self_contained: false
keep_md: false
encoding: 'UTF-8'
css: "assets/lab.css"
include:
after_body: assets/footer-lab.html
---
```{r,child="assets/header-lab.Rmd"}
```
These steps are to be carried out after the fastq files have been aligned to the reference.
# Mapper log files
The first step after you have finished your mapping is to get a general feel of how the mapping went. Most mapping programs produce some sort of summary output, either to a file or to standard out. For example, if using the mapper Bowtie you need to pipe that output to a file to see the summary statistics.
The most important parts to look at are the proportion of uniquely mapping, multi-mapping and unmapped reads. We ideally want the uniquely mapping reads to be as high as possible. Multi-mapping or unmapped reads could indicate poor quality of the reads, adapter contamination or other reasons for low quality scores.
## MultiQC logs summary
After mapping with star of all samples, we ran MultiQC to summarize all the log-files. In this case we had a folder structure with **sampleName.hisat2_summary.txt**, so to make sure that MultiQC understands what is the sample name, we used the `-dd 2` command (e.g. it splits up the path and names the samples after the second last part).
```{sh,eval=FALSE,chunk.title=TRUE}
multiqc -d -dd 2 .
```
# RSeQC
The RSeQC package is one of many tools to get basic mapping statistics from your BAM files. This package provides a number of useful modules that can comprehensively evaluate high throughput sequence data, especially RNA-seq data. Some basic modules quickly inspect sequence quality, nucleotide composition bias, PCR bias and GC bias, while RNA-seq specific modules evaluate sequencing saturation, mapped reads distribution, coverage uniformity, strand specificity, etc. You can read more about the package at [the RSeQC website](http://rseqc.sourceforge.net/).
The RSeQC package contains many steps that are equivalent to FastQC analysis, e.g. read quality, sequence composition (NVC), GC-bias etc, but the results may be different since many of the low quality reads may not map to the genome and therefore will not be included in the BAM file.
Some steps of the RSeQC package require a file with gene annotations in BED format. These can be downloaded from sources such as UCSC, RefSeq and Ensembl. In this case, the RSeQC team have already created annotation files in some common formats that can be downloaded from their website, but if you have data for a less studied organism you may need to create a BED-file on your own.
Two annotation files have already been downloaded into subdirectory `references/annotations` for you to use. These are: **Mus_musculus.GRCm38.101.HouseKeepingGenes.bed** and **Mus_musculus.GRCm38.101.bed**.
In this tutorial, we will not run all the different parts of the RSeQC package, only the most relevant ones for this experiment. The different scripts in the RSeQC package are well described on [their website](http://rseqc.sourceforge.net/), so read the instructions there and specify the input/output files to fit your file names and folder structure.
The steps that we are going to run are:
1. geneBody_coverage.py
2. infer_experiment.py
3. junction_saturation.py
4. read_distribution.py
<i class="fas fa-exclamation-circle"></i> The **geneBody_coverage.py** script takes a very long time to run, so we have created a subsection of annotations to run it on. Use the file ` Mus_musculus.GRCm38.101.HouseKeepingGenes.bed`. If you want to run multiple files this might take to long also. Then you can even further reduce the file by only using the first 500 house keeping genes. For more details look at code below how to do that and how to run all files in a loop.
<i class="fas fa-lightbulb"></i> When running **read_distribution.py**, an out file cannot be specified. Instead you need to pipe (`>`) the output to a file, or look at the output in the terminal.
<i class="fas fa-clipboard-list"></i> Run RSeQC for one sample and have a look at your output.
* Do most of your reads map to genes?
* Are the RNAseq data stranded or unstranded and if so on what strand is the read coming from?
* Do you have even coverage along the genes?
* Do the reads cover most splice junctions?
## Creating Bed12 file
**Optional**
If you are working on a organism that does not have a annotation file in bed12 that is needed to use this RSeQC it is possible to transform a GTF file to Bed12 file. This is done by using tools from uscs and needs a extra step to first go from GTF to genePred and then from genePred to bed. *gtfToGenePred* and *genePredToBed* can both be downloaded from ucsc utilities homepage.
```{sh,eval=FALSE,chunk.title=TRUE}
#To unzip all annotation files if you don't have
gunzip references/annotations/*.gz
gtfToGenePred \
references/annotations/Mus_musculus.GRCm38.101.gtf \
references/annotations/Mus_musculus.GRCm38.101.genePred
genePredToBed \
references/annotations/Mus_musculus.GRCm38.101.genePred\
references/annotations/Mus_musculus.GRCm38.101.bed
```
## MultiQC summary
MutliQC can also include the information from RSeQC. Rerun MultiQC after you have done your RSeQC steps.
It was created using commands:
```{sh,eval=FALSE,chunk.title=TRUE}
multiqc -f -d -dd 2 .
```
<i class="fas fa-comments"></i> Have a look at the reports. What is your conclusion, do your samples look good? Is there anything that looks strange in any sample, or do you feel comfortable using all the samples in your analysis?
***
# Code
This is all that is needed to run all the steps above on data on course data.
```{sh ,eval=FALSE,chunk.title=TRUE}
##########################################
### RUN RSEQC scripts on bamfiles ###
#########################################
cd ~/RNAseq/
mkdir results/03-postAlignmentQC/geneBody_coverage
gunzip rawData/references/annotations/Mus_musculus.GRCm38.101.HouseKeepingGenes.bed.gz
head -n 500 rawData/references/annotations/Mus_musculus.GRCm38.101.HouseKeepingGenes.bed > rawData/references/annotations/Mus_musculus.GRCm38.101.HouseKeepingGenes.first500.bed
geneBody_coverage.py -r rawData/references/annotations/Mus_musculus.GRCm38.101.HouseKeepingGenes.first500.bed \
-i results/02-mapping/ -o results/03-postAlignmentQC/geneBody_coverage/allSamples
mkdir results/03-postAlignmentQC/read_distribution
mkdir results/03-postAlignmentQC/infer_experiment
mkdir results/03-postAlignmentQC/junction_saturation
mkdir results/03-postAlignmentQC/junction_annotation
gunzip rawData/references/annotations/Mus_musculus.GRCm38.101.bed
for i in $(ls results/02-mapping/*.sorted.bam); do
sample=${i%.sorted.bam}
sample=${sample#results/02-mapping/}
echo "running read distribution on $sample"
read_distribution.py -i $i \
-r rawData/references/annotations/Mus_musculus.GRCm38.101.bed \
>results/03-postAlignmentQC/read_distribution/$sample.readCoverage.txt
infer_experiment.py -i $i \
-r rawData/references/annotations/Mus_musculus.GRCm38.101.bed \
>results/03-postAlignmentQC/infer_experiment/$sample.infer_experiment.txt
junction_saturation.py -i $i \
-r rawData/references/annotations/Mus_musculus.GRCm38.101.bed \
-o results/03-postAlignmentQC/junction_saturation/$sample
junction_annotation.py -i $i \
-r rawData/references/annotations/Mus_musculus.GRCm38.101.bed \
-o results/03-postAlignmentQC/junction_annotation/$sample
done
cd ~/RNAseq/results/
multiqc -f -d -dd 2 .
```