Snakemake is a Bioinformatics tool for managing a workflow. This tool proves valuable when analyzing a large amount of data with multiple tools. This script was made as a learning tool for workflow manager. There is also Nextflow to manage large analysis workflows. Here, Snakemake was used to run everything that is usually run on Linux with RNA-Seq Analyses (here is my long winded version of an RNA-seq analysis on Mice p53 gene mutation).
The Snakemake file was developed for analyzing sequencing data from a recent publication - Dietary walnut altered gene expressions related to tumor growth, survival, and metastasis in breast cancer patients: a pilot clinical trial. The raw sequences were downloaded from the Sequence Read Archive Run Selector using sra-tools.
- The genome and the gtf files were downloaded and an index was created of the genome.
- The minikraken was downloaded and extracted.
- Homo sapiens rRNA sequences were downloaded from NCBI.
- PhiX sequences were downloaded from Illumina.
- FastQC
- Trimmomatic
- STAR
- featureCounts (conda Subread)
- Bowtie2
- bwa
- Samtools
- Bam2Fastx
- MultiQC
To run Snakemake, a big memory cluster node was used. To run, in the same folder as the snakefile I used snakemake -j 80
which tells Snakemake to use 80 cores. Snakemake if not given a file name searches current directory for a file named snakefile.
The output of running Snakemake is a QC folder with results for all of the steps as well as MultiQC which makes nice HTML pages to summarize the results.
MultiQC doesn't have the ability to identify and create summaries for microbial contamination with KrakenUniq.