Skip to content

An application for analysis of 5′ endpoints distribution in RNA sequencing datasets. This is particularly useful for techniques that capture 5′ monophosphorylated RNAs, such as 5PSeq, PARE-seq or GMUC. It may also be useful for ribosome profiling datasets and alike.The main workflow of fivepseq is intended for downstream analysis of alignment fi…

License

Notifications You must be signed in to change notification settings

lilit-nersisyan/fivepseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to fivepseq readme!

Fivepseq is a software package for analysis of 5′ endpoints distribution in RNA degradome sequencing datasets.

Homepage

The homepage is hosted at Pelechano lab website at http://pelechanolab.com/software/fivepseq/.

User guide

Below is a quick manual to get you started. For detailed instructions and explanations on fivepseq output, please see the user guide at: https://fivepseq.readthedocs.io/en/latest/.

Citation

Nersisyan L, Ropat M, Pelechano V. Improved computational analysis of ribosome dynamics from 5′P degradome data using fivepseq. NAR Genomics and Bioinformatics, 2:4, 2020.

Installation

Fivepseq works with python versions <=3.8. If you have a higher version of python you may run into problems with some dependencies.

Install dependencies:

To set up fivepseq, the following python packages need to be pre-installed manually using pip (if you don't have pip you may install it as described here ).

Paste the following lines into the shell terminal:

git clone https://github.com/joshuagryphon/plastid -b develop
cd plastid
python setup.py install
pip install --upgrade numpy==1.19.5 pysam==0.19.0 cython==0.29.28

To install fivepseq, clone the project from github:

git clone https://github.com/lilit-nersisyan/fivepseq.git
cd fivepseq
python setup.py install

To check if fivepseq was installed correctly, type the following in the command line:

fivepseq --version

This should display the currently installed version of fivepseq. To display commandline arguments you may type:

fivepseq --help

In order to enable exporting vector and portable image files, you'll also need to have phantomjs installed as follows:

conda install phantomjs selenium pillow

Running fivepseq

Fivepseq requires the following files to run:

Aligned reads (.bam)
Alignment index (.bai)
Genomic sequence file (.fasta / .fa)
Genomic annotation file (.gff/ .gtf)

This section assumes that you already have these files. If not, please, refer to the section: Preparing data.

Fivepseq usage

The fivepseq --help command will show fivepseq usage and will list all the arguments.

usage: fivepseq -b B -g G -a A [optional arguments]

Required arguments

-b B   the full path one or many bam/sam files (many files should be provided with a pattern, **within double quotes**: e.g. ["your_bam_folder/*.bam"])
-g G   the full path to the fa/fasta file
-a A   the full path to the gtf/gff/gff3 file

Note:

  • The indexed alignment files should be in the same directory as bam files, with the same name, with .bai extension added.
  • Multiple bam files should be indicated with a pattern placed within double quotes: e.g. ["your_bam_folder/*.bam"]

Commonly, you will run fivepseq by also providing the name of the output folder ('fivepseq' by default) and the title of your run (determined from bam path otherwise):

fivepseq \
   -g <path_to_genome_fasta> \
   -a <path_to_annotation> \
   -b <path_to_bam_file(s) \
   -o <output_directory> \
   -t <title_of_the_run>

Note: this is a single commandline, the backslashes are used to move to a new line for cozy representation: either copy-paste like this or use a single line without the backslashes.

Additional arguments

Type fivepseq --help to see the list of additional arguments. For a detailed description of available arguments, see the User guide at: https://fivepseq.readthedocs.io/en/latest/.

Preprocessing from FASTQ files

Fastq files need to be preprocessed and aligned to the reference genome before proceeding to fivepseq downstream analysis. Preprocessing proceeds with the following steps:

  • quality checks (with FASTQC and MULTIQC),
  • adapter and quality based trimming,
  • UMI extraction (if the library was generated with UMIs),
  • mapping to reference
  • read deduplication (if the library was generated with UMIs),
  • bedgraph generation to view 5'P count distribution in genome viewers

An example of pre-processing pipeline can be found in the preprocess_scripts directory

In order to run this pipeline, you need to have access to common bioinformatics software such as STAR, UMI-tools, bedtools, Samtools, FastQC, MultiQC and cutadapt.

To use it, navigate to the directory where the script is located and use the following command in the prompt:

./fivepseq_preprocess.sh -f [path to directory containing fastq files] -g [path to genome fasta] -a [path to annotation gff/gtf] -i [path to reference index, if exists] -o [output directory] -s [which steps to skip: either or combination of characters {cudqm} ]

The option -s specifies which steps of the pipeline you'd like to skip. Possible values are:

  • c skip trimming adapters with cutadapt
  • u skip UMI extraction
  • d skip deduplication after alignment
  • q skip quality initial check: FASTQC and MULTIQC
  • p skip post-processing quality check: FASTQC and MULTIQC
  • m skip mapping
  • d skip deduplication

You may use any combination of these characters, e.g. use -s cudqm to skip all

This script will produce sub-folders in the output directory, containing results of each step of the pipeline. The bam files will be generated in the align_dedup folder.

In the In addition to performing the steps described above, it also evaluates the distribution of reads across the genome, according to gene classes {"rRNA" "mRNA" "tRNA" "snoRNA" "snRNA" "ncRNA"}. These statistics are kept in the align_rna/rna_stats.txt file.

!!NOTE!! This example pipeline treats files as singl-end libraries. If you have paired-end reads, you should only supply the first read (*_R1* files) to fivepseq.

Have fun!

About

An application for analysis of 5′ endpoints distribution in RNA sequencing datasets. This is particularly useful for techniques that capture 5′ monophosphorylated RNAs, such as 5PSeq, PARE-seq or GMUC. It may also be useful for ribosome profiling datasets and alike.The main workflow of fivepseq is intended for downstream analysis of alignment fi…

Topics

Resources

License

Stars

Watchers

Forks