Skip to content

Files

Latest commit

c014bc8 · May 14, 2024

History

History
59 lines (47 loc) · 3.38 KB

README.md

File metadata and controls

59 lines (47 loc) · 3.38 KB

Preparing splicing-related models

A simple utility to process VCF files and generate correct input for a set of models that require sequence information (or that require specific format to run a web-based application).

Motivation

There are a lot of available methods that predict splicing-related information (e.g. splice sites, branchpoints, splicing regulatory elements). Since they were not originally designed to predict the effect of genetic variants, it is not straightforward to use these models for that task. This tool simplifies that goal: generates reference and mutated sequences from VCF files in the proper format to run several models for all variants at once, and contains utilities to process the output and generate a VCF with a final score (usually mutated allele - reference allele).

Requirements

  • The variants should be annotated with ensembl VEP so that strand information can be retrieved (and therefore the proper sequence context of the variant can be extracted).
  • Processing scripts do not expect chromosome notation to contain chr string.

Citation

If you find this package useful in any way, please consider citing the work for which it was developed:
Computational prediction of human deep intronic variation

Instalation

git clone https://github.com/PedroBarbosa/Prepare_SplicingPredictors.git
cd Prepare_SplicingPredictors
conda env create --file conda_environment.yaml 
conda activate prepareSplicingTools
pip install .

Running

To run this utility, just call the vcf2seq and select the models you want to generate input for (check the available options with vcf2seq --help).

vcf2seq input.vcf.gz reference_genome.fa outbasename --maxentscan --splicerover ...

For models that predict splice sites, it may be necessary to set the splice site flag (--ss donor, --ss acceptor). Then, within each model folder (src folder in this repo), there are instructions on how to run each model and a script (get_mutation_effects.py) to process the output and generate a VCF with the predictions.

Note: Do not change the fasta headers of the generated sequences, since the get_mutation_effects.py scripts require original names for proper processing.

Supported models

General methods

Splice site prediction

Splicing regulatory elements

Branchpoint signals

Limitations

For most models, only single-nucleotide variants (SNVs) are supported.

Contact

pbarbosa@lasige.di.fc.ul.pt