lab_qc_aftermapping.Rmd

---
title: "Quality control"
subtitle: "Workshop on RNA-Seq"
output:
  bookdown::html_document2:
    highlight: textmate
    toc: true
    toc_float:
      collapsed: true
      smooth_scroll: true
      print: false
    toc_depth: 4
    number_sections: true
    df_print: default
    code_folding: none
    self_contained: false
    keep_md: false
    encoding: 'UTF-8'
    css: "assets/lab.css"
    include:
      after_body: assets/footer-lab.html
---

```{r,child="assets/header-lab.Rmd"}
```

These steps are to be carried out after the fastq files have been aligned to the reference.

# Mapper log files

The first step after you have finished your mapping is to get a general feel of how the mapping went. Most mapping programs produce some sort of summary output, either to a file or to standard out. For example, if using the mapper Bowtie you need to pipe that output to a file to see the summary statistics.

The most important parts to look at are the proportion of uniquely mapping, multi-mapping and unmapped reads. We ideally want the uniquely mapping reads to be as high as possible. Multi-mapping or unmapped reads could indicate poor quality of the reads, adapter contamination or other reasons for low quality scores.


## MultiQC logs summary

After mapping with star of all samples, we ran MultiQC to summarize all the log-files. In this case we had a folder structure with **sampleName.hisat2_summary.txt**, so to make sure that MultiQC understands what is the sample name, we used the `-dd 2` command (e.g. it splits up the path and names the samples after the second last part).

```{sh,eval=FALSE,chunk.title=TRUE}

multiqc -d -dd 2 .
```


# RSeQC

The RSeQC package is one of many tools to get basic mapping statistics from your BAM files. This package provides a number of useful modules that can comprehensively evaluate high throughput sequence data, especially RNA-seq data. Some basic modules quickly inspect sequence quality, nucleotide composition bias, PCR bias and GC bias, while RNA-seq specific modules evaluate sequencing saturation, mapped reads distribution, coverage uniformity, strand specificity, etc. You can read more about the package at [the RSeQC website](http://rseqc.sourceforge.net/).

The RSeQC package contains many steps that are equivalent to FastQC analysis, e.g. read quality, sequence composition (NVC), GC-bias etc, but the results may be different since many of the low quality reads may not map to the genome and therefore will not be included in the BAM file.


Some steps of the RSeQC package require a file with gene annotations in BED format. These can be downloaded from sources such as UCSC, RefSeq and Ensembl. In this case, the RSeQC team have already created annotation files in some common formats that can be downloaded from their website, but if you have data for a less studied organism you may need to create a BED-file on your own.


Two annotation files have already been downloaded into subdirectory `references/annotations` for you to use. These are: **Mus_musculus.GRCm38.101.HouseKeepingGenes.bed** and **Mus_musculus.GRCm38.101.bed**.


In this tutorial, we will not run all the different parts of the RSeQC package, only the most relevant ones for this experiment. The different scripts in the RSeQC package are well described on [their website](http://rseqc.sourceforge.net/), so read the instructions there and specify the input/output files to fit your file names and folder structure.

The steps that we are going to run are:

1. geneBody_coverage.py
2. infer_experiment.py
3. junction_saturation.py
4. read_distribution.py

<i class="fas fa-exclamation-circle"></i> The **geneBody_coverage.py** script takes a very long time to run, so we have created a subsection of annotations to run it on. Use the file ` Mus_musculus.GRCm38.101.HouseKeepingGenes.bed`. If you want to run multiple files this might take to long also. Then you can even further reduce the file by only using the first 500 house keeping genes. For more details look at code below how to do that and how to run all files in a loop.

<i class="fas fa-lightbulb"></i> When running **read_distribution.py**, an out file cannot be specified. Instead you need to pipe (`>`) the output to a file, or look at the output in the terminal.

<i class="fas fa-clipboard-list"></i> Run RSeQC for one sample and have a look at your output.

* Do most of your reads map to genes?
* Are the RNAseq data stranded or unstranded and if so on what strand is the read coming from?
* Do you have even coverage along the genes?
* Do the reads cover most splice junctions?

## Creating Bed12 file

**Optional**

If you are working on a organism that does not have a annotation file in bed12 that is needed to use this RSeQC it is possible to transform a GTF file to Bed12 file. This is done by using tools from uscs and needs a extra step to first go from GTF to genePred and then from genePred to bed. *gtfToGenePred* and *genePredToBed*  can both be downloaded from ucsc utilities homepage.


```{sh,eval=FALSE,chunk.title=TRUE}

#To unzip all annotation files if you don't have
gunzip references/annotations/*.gz

 gtfToGenePred \
   references/annotations/Mus_musculus.GRCm38.101.gtf \
   references/annotations/Mus_musculus.GRCm38.101.genePred

 genePredToBed \
   references/annotations/Mus_musculus.GRCm38.101.genePred\
   references/annotations/Mus_musculus.GRCm38.101.bed

```


## MultiQC summary


MutliQC can also include the information from  RSeQC. Rerun MultiQC after you have done your RSeQC steps.

It was created using commands:

```{sh,eval=FALSE,chunk.title=TRUE}
multiqc -f -d -dd 2 .

```

<i class="fas fa-comments"></i> Have a look at the reports. What is your conclusion, do your samples look good? Is there anything that looks strange in any sample, or do you feel comfortable using all the samples in your analysis?

***
# Code

This is all that is needed to run all the steps above on data on course data.
```{sh ,eval=FALSE,chunk.title=TRUE}


##########################################
###  RUN RSEQC scripts on bamfiles     ###
#########################################

cd ~/RNAseq/

mkdir results/03-postAlignmentQC/geneBody_coverage

gunzip rawData/references/annotations/Mus_musculus.GRCm38.101.HouseKeepingGenes.bed.gz  
head -n 500 rawData/references/annotations/Mus_musculus.GRCm38.101.HouseKeepingGenes.bed > rawData/references/annotations/Mus_musculus.GRCm38.101.HouseKeepingGenes.first500.bed

geneBody_coverage.py -r rawData/references/annotations/Mus_musculus.GRCm38.101.HouseKeepingGenes.first500.bed \
  -i results/02-mapping/ -o results/03-postAlignmentQC/geneBody_coverage/allSamples

mkdir results/03-postAlignmentQC/read_distribution
mkdir results/03-postAlignmentQC/infer_experiment
mkdir results/03-postAlignmentQC/junction_saturation
mkdir results/03-postAlignmentQC/junction_annotation

gunzip rawData/references/annotations/Mus_musculus.GRCm38.101.bed  

for i in $(ls results/02-mapping/*.sorted.bam); do

  sample=${i%.sorted.bam}
  sample=${sample#results/02-mapping/}


  echo "running read distribution on $sample"

  read_distribution.py  -i $i \
  -r rawData/references/annotations/Mus_musculus.GRCm38.101.bed \
  >results/03-postAlignmentQC/read_distribution/$sample.readCoverage.txt

  infer_experiment.py  -i $i  \
  -r rawData/references/annotations/Mus_musculus.GRCm38.101.bed \
  >results/03-postAlignmentQC/infer_experiment/$sample.infer_experiment.txt

  junction_saturation.py -i $i  \
  -r rawData/references/annotations/Mus_musculus.GRCm38.101.bed \
  -o results/03-postAlignmentQC/junction_saturation/$sample

  junction_annotation.py -i $i  \
  -r rawData/references/annotations/Mus_musculus.GRCm38.101.bed \
  -o results/03-postAlignmentQC/junction_annotation/$sample


done

cd ~/RNAseq/results/

multiqc -f -d -dd 2 .


```