A simple utility to process VCF files and generate correct input for a set of models that require sequence information (or that require specific format to run a web-based application).
There are a lot of available methods that predict splicing-related information (e.g. splice sites, branchpoints, splicing regulatory elements). Since they were not originally designed to predict the effect of genetic variants, it is not straightforward to use these models for that task. This tool simplifies that goal: generates reference and mutated sequences from VCF files in the proper format to run several models for all variants at once, and contains utilities to process the output and generate a VCF with a final score (usually mutated allele - reference allele).
- The variants should be annotated with ensembl VEP so that strand information can be retrieved (and therefore the proper sequence context of the variant can be extracted).
- Processing scripts do not expect chromosome notation to contain
chr
string.
If you find this package useful in any way, please consider citing the work for which it was developed:
Computational prediction of human deep intronic variation
git clone https://github.com/PedroBarbosa/Prepare_SplicingPredictors.git
cd Prepare_SplicingPredictors
conda env create --file conda_environment.yaml
conda activate prepareSplicingTools
pip install .
To run this utility, just call the vcf2seq
and select the models you want to generate input for (check the available options with vcf2seq --help
).
vcf2seq input.vcf.gz reference_genome.fa outbasename --maxentscan --splicerover ...
For models that predict splice sites, it may be necessary to set the splice site flag (--ss donor
, --ss acceptor
).
Then, within each model folder (src
folder in this repo), there are instructions on how to run each model and a script (get_mutation_effects.py
) to process the output and generate a VCF with the predictions.
Note: Do not change the fasta headers of the generated sequences, since the get_mutation_effects.py
scripts require original names for proper processing.
For most models, only single-nucleotide variants (SNVs) are supported.