Fivepseq is a software package for analysis of 5′ endpoints distribution in RNA degradome sequencing datasets.
The homepage is hosted at Pelechano lab website at http://pelechanolab.com/software/fivepseq/.
Below is a quick manual to get you started. For detailed instructions and explanations on fivepseq output, please see the user guide at: https://fivepseq.readthedocs.io/en/latest/.
Fivepseq works with python versions <=3.8. If you have a higher version of python you may run into problems with some dependencies.
Install dependencies:
To set up fivepseq, the following python packages need to be pre-installed manually using pip (if you don't have pip you may install it as described here ).
Paste the following lines into the shell terminal:
git clone https://github.com/joshuagryphon/plastid -b develop
cd plastid
python setup.py install
pip install --upgrade numpy==1.19.5 pysam==0.19.0 cython==0.29.28
To install fivepseq, clone the project from github:
git clone https://github.com/lilit-nersisyan/fivepseq.git
cd fivepseq
python setup.py install
To check if fivepseq was installed correctly, type the following in the command line:
fivepseq --version
This should display the currently installed version of fivepseq. To display commandline arguments you may type:
fivepseq --help
In order to enable exporting vector and portable image files, you'll also need to have phantomjs installed as follows:
conda install phantomjs selenium pillow
Fivepseq requires the following files to run:
This section assumes that you already have these files. If not, please, refer to the section: Preparing data.
The fivepseq --help
command will show fivepseq usage and will list all the arguments.
usage: fivepseq -b B -g G -a A [optional arguments]
-b B the full path one or many bam/sam files (many files should be provided with a pattern, **within double quotes**: e.g. ["your_bam_folder/*.bam"])
-g G the full path to the fa/fasta file
-a A the full path to the gtf/gff/gff3 file
Note:
- The indexed alignment files should be in the same directory as bam files, with the same name, with .bai extension added.
- Multiple bam files should be indicated with a pattern placed within double quotes: e.g. ["your_bam_folder/*.bam"]
Commonly, you will run fivepseq by also providing the name of the output folder ('fivepseq' by default) and the title of your run (determined from bam path otherwise):
fivepseq \
-g <path_to_genome_fasta> \
-a <path_to_annotation> \
-b <path_to_bam_file(s) \
-o <output_directory> \
-t <title_of_the_run>
Note: this is a single commandline, the backslashes are used to move to a new line for cozy representation: either copy-paste like this or use a single line without the backslashes.
Type fivepseq --help
to see the list of additional arguments. For a detailed description of available arguments, see the User guide at: https://fivepseq.readthedocs.io/en/latest/.
Fastq files need to be preprocessed and aligned to the reference genome before proceeding to fivepseq downstream analysis. Preprocessing proceeds with the following steps:
- quality checks (with FASTQC and MULTIQC),
- adapter and quality based trimming,
- UMI extraction (if the library was generated with UMIs),
- mapping to reference
- read deduplication (if the library was generated with UMIs),
- bedgraph generation to view 5'P count distribution in genome viewers
An example of pre-processing pipeline can be found in the preprocess_scripts directory
In order to run this pipeline, you need to have access to common bioinformatics software such as STAR, UMI-tools, bedtools, Samtools, FastQC, MultiQC and cutadapt.
To use it, navigate to the directory where the script is located and use the following command in the prompt:
./fivepseq_preprocess.sh -f [path to directory containing fastq files] -g [path to genome fasta] -a [path to annotation gff/gtf] -i [path to reference index, if exists] -o [output directory] -s [which steps to skip: either or combination of characters {cudqm} ]
The option -s
specifies which steps of the pipeline you'd like to skip. Possible values are:
- c skip trimming adapters with cutadapt
- u skip UMI extraction
- d skip deduplication after alignment
- q skip quality initial check: FASTQC and MULTIQC
- p skip post-processing quality check: FASTQC and MULTIQC
- m skip mapping
- d skip deduplication
You may use any combination of these characters, e.g. use -s cudqm
to skip all
This script will produce sub-folders in the output directory, containing results of each step of the pipeline. The bam files will be generated in the align_dedup folder.
In the In addition to performing the steps described above, it also evaluates the distribution of reads across the genome, according to gene classes {"rRNA" "mRNA" "tRNA" "snoRNA" "snRNA" "ncRNA"}. These statistics are kept in the align_rna/rna_stats.txt file.
!!NOTE!! This example pipeline treats files as singl-end libraries. If you have paired-end reads, you should only supply the first read (*_R1* files) to fivepseq.
Have fun!