This repository houses the data components used by the public POTAGE web server (http://crobiad.agwine.adelaide.edu.au/potage).
This information is relevant for those seeking to:
- Add unpublished gene expression data sets to their own POTAGE instance
- Would like to contribute a published gene expression dataset back to the community of POTAGE users.
POTAGE currently houses a limited number of publically available gene expression data sets:
- IWGSC RNA-Seq tissue series under
expression/001_iwgsc
- Meiose data under
expression/002_meiose
While we plan to add additional public data sets over time, we welcome contributions from the public via pull requests.
Here we present an example workflow for including additional RNA-Seq datasets in POTAGE. We omit the read QC and alignment steps. We assume the user is able to generate a valid BAM file per data point which is to be displayed in POTAGE. The appropriate reference for aligning the reads when using a splice-aware aligner is MiPS_genes_PlusMinus2kb_allTranscripts__AND__HCS_UNMAPPED_CDS.fasta.gz
(TO BE ADDED TO REPO).
Alternatively the user may resort to aligning their reads to transcripts, or even quantify the expression values using an alignment-free approach such as Salmon or kallisto.
We include some of the settings used for aligning RNA-Seq reads from the Meiosis dataset
STAR \
--runMode alignReads \
--readFilesIn ${R1} ${R2} \
--readFilesCommand pigz -dcp2 \
--outFileNamePrefix ${OUTBASENAME} \
--outSAMtype BAM SortedByCoordinate \
--outFilterMultimapScoreRange 0 \
--outFilterMultimapNmax 5 \
--outFilterMismatchNoverLmax 0.00 \
--outFilterMatchNminOverLread 1.0 \
--outSJfilterOverhangMin 35 20 20 20 \
--outSJfilterCountTotalMin 10 3 3 3 \
--outSJfilterCountUniqueMin 5 1 1 1 \
--alignEndsType EndToEnd \
--alignSoftClipAtReferenceEnds No \
--outSAMstrandField intronMotif \
--alignIntronMax 10000 \
--alignMatesGapMax 10000 \
--outSAMattrRGline ID:${SAMPLE} PL:Illumina PU:${SAMPLE} LB:${SAMPLE} SM:${SAMPLE%_?} || exit 1
For each of the the (per data point - merge if necessary) BAM files run
cufflinks -G MIPS_genes_PlusMinus2kb_allTranscripts.gtf data_point.bam \
-o expression/data_point \
--no-update-check --max-multiread-fraction 1 --library-type fr-unstranded
In the case of Chinese Spring Meiosis dataset this could be done as follows:
for f in merged_bams/*.bam; do s=${f##*/};
cufflinks -G MIPS_genes_PlusMinus2kb_allTranscripts.gtf \
${f} expression/${s%.bam};
done
for d in expression/* ; do
LAST=${d}/sorted_genes.fpkm_tracking
sort ${d}/genes.fpkm_tracking > ${LAST}
done
cut -f7 ${LAST} | tr ':-' '\t\t' > genes.common
cp genes.common genes.joined
for d in expression/*; do
tmp1=$(mktemp tmp.fpkms.joined.XXXXXXXXXX)
paste genes.joined <(cut -f10 ${d}/sorted_genes.fpkm_tracking) > ${tmp1} && mv ${tmp1} genes.joined
done
printf 'gene_id\tstart\tstop' > header.txt
for d in expression/*; do
s=${d##*/}
printf "\t${s}" >> header.txt
done
cat header.txt <(printf '\n') <(tail -n +2 genes.joined) > meiose.tsv
For example, for the Chinese Spring Meiosis dataset config.cfg
might look as follows:
#Expression dataset configuration
ShortName Mei
LongName Chinese Spring - Meiosis
Unit FPKM
FileName meiose.tsv
POTAGE picks up the expression data sets from a directory specified in potage.cfg
, by default this is expression
. For an additional dataset to be loaded into POTAGE we create a new sub-directory under expression
, for example:
mkdir expression/002_meiose
We then place the file with expression values (meiose.tsv
) and the data set configuration file config.cfg
in the newly created sub-directory