Skip to content

Latest commit

 

History

History
170 lines (100 loc) · 6.5 KB

README.rst

File metadata and controls

170 lines (100 loc) · 6.5 KB

Welcome to fivepseq readme!

Fivepseq is a software package for analysis of 5′ endpoints distribution in RNA degradome sequencing datasets.

Homepage

The homepage is hosted at Pelechano lab website at http://pelechanolab.com/software/fivepseq/.

User guide

Below is a quick manual to get you started. For detailed instructions and explanations on fivepseq output, please see the user guide at: https://fivepseq.readthedocs.io/en/latest/.

Citation

Nersisyan L, Ropat M, Pelechano V. Improved computational analysis of ribosome dynamics from 5′P degradome data using fivepseq. NAR Genomics and Bioinformatics, 2:4, 2020.

Installation

Fivepseq works with python versions <=3.8. If you have a higher version of python you may run into problems with some dependencies.

Install dependencies:

To set up fivepseq, the following python packages need to be pre-installed manually using pip (if you don't have pip you may install it as described here ).

Paste the following lines into the shell terminal:

git clone https://github.com/joshuagryphon/plastid -b develop
cd plastid
python setup.py install
pip install --upgrade numpy==1.19.5 pysam==0.19.0 cython==0.29.28

To install fivepseq, clone the project from github:

git clone https://github.com/lilit-nersisyan/fivepseq.git
cd fivepseq
python setup.py install

To check if fivepseq was installed correctly, type the following in the command line:

fivepseq --version

This should display the currently installed version of fivepseq. To display commandline arguments you may type:

fivepseq --help

In order to enable exporting vector and portable image files, you'll also need to have phantomjs installed as follows:

conda install phantomjs selenium pillow

Running fivepseq

Fivepseq requires the following files to run:

Aligned reads (.bam)
Alignment index (.bai)
Genomic sequence file (.fasta / .fa)
Genomic annotation file (.gff/ .gtf)

This section assumes that you already have these files. If not, please, refer to the section: Preparing data.

Fivepseq usage

The fivepseq --help command will show fivepseq usage and will list all the arguments.

usage: fivepseq -b B -g G -a A [optional arguments]

Required arguments

-b B   the full path one or many bam/sam files (many files should be provided with a pattern, **within double quotes**: e.g. ["your_bam_folder/*.bam"])
-g G   the full path to the fa/fasta file
-a A   the full path to the gtf/gff/gff3 file

Note:

  • The indexed alignment files should be in the same directory as bam files, with the same name, with .bai extension added.
  • Multiple bam files should be indicated with a pattern placed within double quotes: e.g. ["your_bam_folder/*.bam"]

Commonly, you will run fivepseq by also providing the name of the output folder ('fivepseq' by default) and the title of your run (determined from bam path otherwise):

fivepseq \
   -g <path_to_genome_fasta> \
   -a <path_to_annotation> \
   -b <path_to_bam_file(s) \
   -o <output_directory> \
   -t <title_of_the_run>

Note: this is a single commandline, the backslashes are used to move to a new line for cozy representation: either copy-paste like this or use a single line without the backslashes.

Additional arguments

Type fivepseq --help to see the list of additional arguments. For a detailed description of available arguments, see the User guide at: https://fivepseq.readthedocs.io/en/latest/.

Preprocessing from FASTQ files

Fastq files need to be preprocessed and aligned to the reference genome before proceeding to fivepseq downstream analysis. Preprocessing proceeds with the following steps:

  • quality checks (with FASTQC and MULTIQC),
  • adapter and quality based trimming,
  • UMI extraction (if the library was generated with UMIs),
  • mapping to reference
  • read deduplication (if the library was generated with UMIs),
  • bedgraph generation to view 5'P count distribution in genome viewers

An example of pre-processing pipeline can be found in the preprocess_scripts directory

In order to run this pipeline, you need to have access to common bioinformatics software such as STAR, UMI-tools, bedtools, Samtools, FastQC, MultiQC and cutadapt.

To use it, navigate to the directory where the script is located and use the following command in the prompt:

./fivepseq_preprocess.sh -f [path to directory containing fastq files] -g [path to genome fasta] -a [path to annotation gff/gtf] -i [path to reference index, if exists] -o [output directory] -s [which steps to skip: either or combination of characters {cudqm} ]

The option -s specifies which steps of the pipeline you'd like to skip. Possible values are:

  • c skip trimming adapters with cutadapt
  • u skip UMI extraction
  • d skip deduplication after alignment
  • q skip quality initial check: FASTQC and MULTIQC
  • p skip post-processing quality check: FASTQC and MULTIQC
  • m skip mapping
  • d skip deduplication

You may use any combination of these characters, e.g. use -s cudqm to skip all

This script will produce sub-folders in the output directory, containing results of each step of the pipeline. The bam files will be generated in the align_dedup folder.

In the In addition to performing the steps described above, it also evaluates the distribution of reads across the genome, according to gene classes {"rRNA" "mRNA" "tRNA" "snoRNA" "snRNA" "ncRNA"}. These statistics are kept in the align_rna/rna_stats.txt file.

!!NOTE!! This example pipeline treats files as singl-end libraries. If you have paired-end reads, you should only supply the first read (*_R1* files) to fivepseq.

Have fun!