-
Notifications
You must be signed in to change notification settings - Fork 2
Home
In this workshop, we will be performing both genome-guided and genome-free transcript reconstruction.
For genome-guided reconstruction, we will leverage the Tuxedo2 protocol, leveraging HISAT2 tool for aligning RNA-Seq reads to a target genome, and StringTie for reconstructing transcript isoforms.
We will also leverage Trinity for genome-free transcript reconstruction, and compare our de novo assembled transcripts to the genome-guided reconstructions by aligning our Trinity transcripts to the target genome using GMAP and visualizing all the data using IGV.
Our genome and RNA-Seq data can be found at:
% ls ~/workspace/data/trinity/rnaseq/*
.
/home/ubuntu/workspace/data/trinity/rnaseq/minigenome.fa
/home/ubuntu/workspace/data/trinity/rnaseq/samples.txt
/home/ubuntu/workspace/data/trinity/rnaseq/minigenome.gtf
/home/ubuntu/workspace/data/trinity/rnaseq/data:
sA_rep1_1.fastq.gz sA_rep2_2.fastq.gz sB_rep1_1.fastq.gz sB_rep2_2.fastq.gz
sA_rep1_2.fastq.gz sA_rep3_1.fastq.gz sB_rep1_2.fastq.gz sB_rep3_1.fastq.gz
sA_rep2_1.fastq.gz sA_rep3_2.fastq.gz sB_rep2_1.fastq.gz sB_rep3_2.fastq.gz
The genome data correspond to a small subset of the Drosophila genome (~200 genes) and the RNA-Seq data correspond to two conditions (sA & sB), with each containing three biological replicates (rep1,2,3).
The minigenome.gtf contains the reference transcript structure annotations, and the 'samples.txt' file describes the relationships between samples, replicates, and their corresponding paired-end RNA-Seq fastq files.
Create a 'transreco' directory in your workspace where you will perform your analyses.
% cd ~/workspace
% mkdir transreco
% cd transreco
Copy over the genome and rna-seq data to your current working directory 'transreco'.
% cp -r ~/workspace/data/trinity/rnaseq/* .
Before loading the genome into IGV, you'll need to index the genome fasta file:
% samtools faidx minigenome.fa
Load the 'minigenome.fa' into IGV as a new genome, and then load in the annotations 'minigenome.gtf'. Setting the annotations to 'squished' will enable you to see all the different reference isoforms for each gene.
![](wiki/images/minigenome_IGV.png)
Before we can align the reads to the genome using hisat2, we must first prepare (index) the target genome. This is a multi-step process involving the following steps, all involving tools included in the hisat2 installation:
Extract the splice sites from the genome annotation file:
% hisat2_extract_splice_sites.py minigenome.gtf > minigenome.gtf.ss
Extract the exon records from the genome annotation file:
% hisat2_extract_exons.py minigenome.gtf > minigenome.gtf.exons
Now build the genome index using these new files along with the genome and annotation file:
% hisat2-build --exon minigenome.gtf.exons \
--ss minigenome.gtf.ss \
-p 2 minigenome.fa minigenome.fa
Note, it's useful to run 'ls -ltr' to see which new files exist in your working directory.
Align the reads for each replicate separate. Each replicate corresponds to a pair of fastq files. Align each like so:
% hisat2 --dta -x minigenome.fa -p 2 \
-1 data/sA_rep1_1.fastq.gz -2 data/sA_rep1_2.fastq.gz \
| samtools view -Sb -F 4 \
| samtools sort -o sA_rep1.cSorted.hisat2.minigenome.fa.bam
% hisat2 --dta -x minigenome.fa -p 2 \
-1 data/sA_rep2_1.fastq.gz -2 data/sA_rep2_2.fastq.gz \
| samtools view -Sb -F 4 \
| samtools sort -o sA_rep2.cSorted.hisat2.minigenome.fa.bam
% hisat2 --dta -x minigenome.fa -p 2 \
-1 data/sA_rep3_1.fastq.gz -2 data/sA_rep3_2.fastq.gz \
| samtools view -Sb -F 4 \
| samtools sort -o sA_rep3.cSorted.hisat2.minigenome.fa.bam
% hisat2 --dta -x minigenome.fa -p 2 \
-1 data/sB_rep1_1.fastq.gz -2 data/sB_rep1_2.fastq.gz \
| samtools view -Sb -F 4 \
| samtools sort -o sB_rep1.cSorted.hisat2.minigenome.fa.bam
% hisat2 --dta -x minigenome.fa -p 2 \
-1 data/sB_rep2_1.fastq.gz -2 data/sB_rep2_2.fastq.gz \
| samtools view -Sb -F 4 \
| samtools sort -o sB_rep2.cSorted.hisat2.minigenome.fa.bam
% hisat2 --dta -x minigenome.fa -p 2 \
-1 data/sB_rep3_1.fastq.gz -2 data/sB_rep3_2.fastq.gz \
| samtools view -Sb -F 4 \
| samtools sort -o sB_rep3.cSorted.hisat2.minigenome.fa.bam
Note, for many samples and replicates, we would often script out the process instead of running each manually. Cutting/pasting commands is easy enough here, though.
The following command will run 'samtools index' on each of the .bam files:
% for file in *.bam
do
samtools index $file
echo indexed $file
done
Load a bam file from each of the sample types into IGV (ie. just rep1 from sA and sB) for viewing.
![](wiki/images/igv-hisat2.png)
For each of the bam files, we'll reconstruct transcript isoforms. Later, we'll merge those individual replicate-based transcript sets into a single non-redundant set of final transcripts.
Run StringTie on each of the bam files by running the following command, which will loop through the bam files run StringTie and generate a separate .stringtie.GTF output file containing transcript structures based on each set of aligned reads:
% for file in *.bam
do
stringtie $file -o $file.stringtie.GTF
echo done running stringtie on $file
done
use 'ls -ltr' to see the files just generated.
Next, merge these gtf files into our final set of genome-based reconstructed transcripts:
% stringtie --merge -o stringtie.final.GTF *.stringtie.GTF
Load the 'stringtie.final.GTF' file into IGV. You might relocate the tier within the viewer, set to 'expanded' and change the color to your preference.
![](wiki/images/igv-stringtie.png)
How well do the reconstructed transcripts compare to the input reads? to the reference annotations?
De novo transcript reconstruction involves assembling transcripts directly from the reads themselves and not requiring any genome sequence. We will do this using the Trinity software, and then align our reconstructed transcripts to our target genome as a way of investigating just how well it worked.
Set the TRINITY_HOME environmental variable to the path of the Trinity software installation:
% export TRINITY_HOME=~/trinityrnaseq-Trinity-v2.5.1
Run Trinity using the 'samples.txt' file like so:
% $TRINITY_HOME/Trinity --samples_file samples.txt \
--seqType fq --CPU 2 --max_memory 1G
This may take a few minutes to run and you can follow its progress in the terminal as it's executing.
Once Trinity is complete, you will find the output file as 'trinity_out_dir/Trinity.fasta'.
Have a look at this output file. How many transcripts does it contain?
There's a utility in the Trinity toolkit that will provide you some basic stats:
% $TRINITY_HOME/util/TrinityStats.pl trinity_out_dir/Trinity.fasta
Align the Trinity-reconstructed transcripts to the genome using GMAP.
Run the following to build a gmap index for the genome, required prior to searching:
% gmap_build -D . -T . -d minigenome.fa.gmap -k 13 minigenome.fa
Now, align the transcripts to the genome:
% gmap -D . -d minigenome.fa.gmap trinity_out_dir/Trinity.fasta \
-f samse -n 0 -x 50 -t 2 -B 5 \
| samtools view -Sb | samtools sort -o Trinity.gmap.bam
Index the bam file for viewing in IGV:
% samtools index Trinity.gmap.bam
Load the Trinity.gmap.bam file into IGV.
![](wiki/images/IGV-wTrinity.png)
How do the Trinity-reconstructed transcripts compare to the reference annotation and the genome-guided StringTie reconstructions? Better/worse/different?
That's all we have for the quick intro into transcript reconstruction from RNA-Seq data. For more details on Trinity reconstruction or using the Trinity protocol for downstream investigation of transcript expression, structure, function, etc., visit our website at http://trinityrnaseq.github.io