This repository contains a number of useful scripts to analyze alternative splicing. Most of these scripts were used to process outputs of MISO. Therefore, these scripts are most useful if you have outputs from MISO to process. No guarantees using these scripts, but I hope you can glean something useful from them.
- Identify differentially spliced events between two groups of samples using t-test.
- Create bed files and fasta files from a list of MISO events.
- Generate heatmaps of PSI values either in pair-wise comparison or group comparison.
- Translate nucleotide sequences of cassette exons (or constitutive exons) to amino acid sequences
- Wrapper function to perform ANCHOR on exons to identify exons that may mediate protein-protein interactions.
- Wrapper function to perform MEME and TOMTOM analysis near genic regions of cassette exons to identify enriched motifs near alternatively spliced events and associate RNA binding proteins to these motifs.
- Perform conservation analysis of enriched motifs by calculating GERP scores of sites contributing to each motif.
- Visualize distribution of sites contributing to each enriched motif.
- Python 2.6 or 2.7
- SciPy
- NumPy
- Bio (for UniProt/SwissProt annotations)
- R 3.0.0 or R 3.0.1. May also work with R 2.14 and 2.15.
- ggplot2
- extrafont
- grid
- RColorBrewer
- gplots
cd dir_to_place_repository
git clone https://github.com/jakeyeung/alternative-splicing.git
- Perform t-test using
t_test_miso_output.py
. - Adjust p-values for multiple-test correction (Benjamini-Hochberg method) using
adjust_pvalues.R
- Append gene names to output file
Inputs for t_test_miso_output.py
--group1_file
: textfile containing sample names in group 1--group2_file
: textfile containing sample names in group 2--main_dir
: directory containing sample names (should match group 1 and group 2). Inside each directory contains MISO raw outputs (MISO events catalogued by chromosomes)--output_directory
: directory for output--output_filename
: name of output file--min_counts
: minimum junction read counts in a sample to be considered into a t-test. This value depends on the depth of your RNA-Seq data. Best practices suggestmin_counts
of 10 for a depth in the order of 50 million mapped read pairs
Example
cd alternative_splicing_scripts/miso_scripts
python t_test_miso_output.py --group1_file /path/to/group_1_samples.txt \
--group2_file /path/to/group_2_samples.txt \
--main_dir /path/containing/miso/outputs \
--output_directory output/path/file.txt \
--min_counts 10
- Input file
- Output file
Example
cd R
Rscript adjust_pvalues.R output_from_t_test.txt pval_adjusted_output_from_t_test.txt
Positional arguments for append_genenames.py
:
- Input file
- Annotations path (gff3 file containing MISO IDs and gene names). One can download the gff3 annotation file from the MISO website
- Output file
Example
cd alternative_splicing_scripts/miso_scripts
python append_gene_names_to_textfile.py pval_adjusted_file annotations.gff3 output
Scripts used to create bed files:
alternative_splicing_scripts/miso_scripts/extract_coordinates_from_miso_summary.py
alternative_splicing_scripts/miso_scripts/split_beds_into_inclusion_exclusion.py
Scripts and functions used to create fasta files:fastaFromBed
alternative_splicing_scripts/fasta_scripts/remove_chr_from_bed_file.py
(if human genome fasta file does not contain "chr" prefix in front of chromosome locations, you can remove them using this script.)
Quick and dirty bash script example gluing these scripts together: example_workflows/get_beds_fastas.full_pipeline.sh
Arguments from option flags:
--sample1_name
: name of sample 1--sample2_name
: name of sample 2
Positional arguments:
- MISO filtered output for pairwise comparison using Bayes Factor
- Output file
Example:
Rscript R/plot_heatmap_psi_values.R reshaped_miso_output.txt myfigure.eps
- Input file for
R/plot_heatmap_psi_values.R
should be the output after reshaping.
- MISO output file generated from groupwise comparison via t-test
- Filename containing sample names for group 1
- Filename containing sample names for group 2
- Output file
- MISO output in matrix format (reshaped)
- Output EPS file
- Translate nucleotide sequences (stored as
.fasta
format) to amino acid sequences. - Run ANCHOR on amino acid sequences.
- Plot enrichment of cassette exons compared to constitutive exons
Script: database_scripts/index_exon_info_to_pkl.py
Takes as input .gtf
file containing gene annotations from ENSEMBL and stores data as a python dictionary (.pkl
file)
Positional arguments:
.gtf
file of gene annotations from ENSEMBL.- Output for pickled python dictionary.
Script: database_scripts/index_unitprot_db.py
Positional arguments:
- Output pickle path
Other required arguments:
-f
: Path for.dat
file containing uniprot annotations from UniProt website.
Option flags:
- `-c --constitutive_exons: use this if your nucleotide sequences come from constitutive exons rather than cassette exons. --help for more information.
Positional arguments:
- Ensembl dictionary in pkl format (created from alternative_splicing_scripts/database_scripts/index_exon_info_to_pkl.py]
- Fasta files (nucleotides)
- 1 | 2 | 3. 2 extracts cassette exon. 1 extracts upstream exon. 3 extracts downstream exon.
- Output path
Option flags for motif_scripts/run_anchor_batch.py
:
-h, --help for more information
Positional arguments:
- Protein summary file from
database_scripts/create_dna_protein_summary_file.py
- Output directory
- Summary output file, placed inside output directory.
Example workflow for running ANCHOR on both inclusion and exclusion exons, see: example_workflows/run_anchor.sh
Script: alternative_splicing_scripts/motif_scripts/plot_anchor_results_comparisons.py
Option flags:
-h, --help for details
Usage:
plot_anchor_results_comparisons.py anchor_results1.txt anchor_results2.txt
- Index
.gtf
file and.dat
file from ENSEMBL and UniProt, respectively. - Create protein summary file.
- Annotate protein summary file with UniProt annotations.
Script: annotate_genes_with_swissprot.py
Outputs: creates a file of amino acids for each exon (if it found a match to the .gtf
file) and adds UniProt annotations. This allows one to understand the potential functions of protein segments encoded in exons.
Positional arguments:
- Protein summary file, created from
database_scripts/create_dna_protein_summary_file.py
. - Output file of proteins with UniProt/SwissProt annotations.
Required flags:
- -f Indexed file of UniProt database (i.e. output from
database_scripts/index_unitprot_db.py
).
Bash script that glues a number of python scripts together: example_workflows/meme_run_pipeline.full_pipeline.sh
Outputs: MEME results from the 14 inputs (see Positional Argument 3) and TOMTOM results summarizing the 14 inputs comparing discovered motifs with experimentally derived motifs from RNA binding proteins. The results can be found in main_dir/motif_outputs/meme_run1/summary/tomtom.summary
by default.
Positional arguments:
- Random seed - set to 0 (I found the choice of random seed no longer matters, since we do not randomly take a subset of rows, rather take the entire set of constitutive exons).
- Evalue - motif with a minimum E-value before MEME stops searching for additional motifs. Play around with this value (10^-5, 10^-10). Note: I found that just because MEME stops searching for additional motifs because the motif it just found was above minimum E-value, it does not mean that there does not exist any more motifs wtih E-values below the threshold. Therefore, sometimes setting E-value to something lower (e.g. 10^1) and then manually filtering out motifs that match the E-values may be less prone to false negatives.
- Main directory containing 3 directories:
fasta_files_100bp/fasta
,fasta_files_100bp/fasta_shuffled
andmotif_outputs
.fasta_files_100bp/fasta
contains 14.fasta
files (i.e. inputs to MEME discovery):- exon_1_inclusion.fasta, exon_1_exclusion.fasta
- exon_2_inclusion.fasta, exon_2_exclusion.fasta
- exon_3_inclusion.fasta, exon_3_exclusion.fasta
- intron_1_5p_inclusion.fasta, intron_1_5p_inclusion.fasta
- intron_1_3p_inclusion.fasta, intron_1_3p_inclusion.fasta
- intron_2_5p_inclusion.fasta, intron_2_5p_inclusion.fasta
- intron_3_3p_inclusion.fasta, intron_3_3p_inclusion.fasta
fasta_files_100bp/fasta_shuffled
contains the same *.fasta files as infasta_files_100bp/fasta
but shuffled using MEME toolfasta-shuffle-letters
. Names are identical.motif_outputs
: does not need anything inside. Output from MEME will be found here in directory namedmeme_run_seed_$MYSEED_evalue_$MYEVALUE
.
- Path to MEME database. Format should conform with MEME.
- Calculate GERP scores for discovered motifs using the motif discovery pipeline. To compare GERP scores for a null distribution, use
-n, --null_mode
inmotif_scripts/summarize_meme_results.py
- Plot GERP scores from discovered motifs against the null to see if there is a statistically significant enrichment.
Bash script gluing python scripts together: example_workflows/get_gerp_scores.sh
Outputs: Creates a MEME summary file that incorporates GERP conservation scores with discovered motifs. Of interest are:
.pkl
files with GERP information (e.g. may be namedsummary.gerp_updated.pkl
), which are used for plotting the GERP scores..summary
files with GERP information, basically the.pkl
file converted into a text file which can be opened in a spreadsheet or text editor.
Positional arguments:
- Main directory, full directory of motif outputs (e.g. if my MEME outputs are in
main_dir/motif_outputs/meme_run1
) then main directory ismain_dir/motif_outputs
) - Name of directory for motif outputs (e.g., if MEME outputs are in
main_dir/motif_outputs/meme_run1
, then set tomeme_run1
) - Filename containing TOMTOM summary (e.g., by default it would be the file
tomtom.summary
found inmain_dir/motif_outputs/meme_run1/summary
- Directory containing GERP base-wise conservation scores (i.e. *.maf.rates for chr1, chr2, chr3... chr22, chrX, chrY). They can be downloaded here(6.3GB!)
To calculate GERP scores for a null set, set the following option flags motif_scripts/summarize_meme_results
:
*-n, --null_mode
calculates a null distribution from discovered motifs.
Script: motif_scripts/plot_anchor_results_comparison.py
Positional arguments:
- GERP scores from discovered motifs:
.pkl
file with GERP information created fromexample_workflows/get_gerp_scores.sh
. - Null GERP scores:
.pkl
file with GERP information created fromexample_workflows/get_gerp_scores.sh
with-n, --null_mode
flag inmotif_scripts/summarize_meme_results.py
.
Script: alternative_splicing_scripts/motif_scripts
Positional argumemnts:
- Pickle file from
alternative_splicing_scripts/motif_scripts/summarize_meme_results.py
(e.g. may be namedsummary.gerp_updated.pkl
, see here)
Outputs:
Displays a plot for MEME positions.