From 525365acd4ab5a76f38913e1e9b9f9dec000453e Mon Sep 17 00:00:00 2001 From: "pedro.barbosa" Date: Fri, 5 Jul 2024 14:23:33 +0100 Subject: [PATCH] Add info to README --- README.md | 35 ++++++++++++++++++++++++++--------- 1 file changed, 26 insertions(+), 9 deletions(-) mode change 100644 => 100755 README.md diff --git a/README.md b/README.md old mode 100644 new mode 100755 index 91b7738..d828701 --- a/README.md +++ b/README.md @@ -1,12 +1,27 @@ [![pytest](https://github.com/PedroBarbosa/dress/actions/workflows/run_tests.yml/badge.svg)](https://github.com/PedroBarbosa/dress/actions/workflows/run_tests.yml) -# DRESS - Deep learning-based Resource for Exploring Splicing Signatures +#### DRESS - Deep learning-based Resource for Exploring Splicing Signatures -A toolkit for generating synthetic datasets related to RNA splicing. +This software aims to provide a flexible framework for RNA splicing research guided by performant deep learning models. -As for now, the package contains two commands: - - `generate` to generate synthetic datasets from a start sequence. - - `filter` to filter datasets by desired levels of splice site probability, PSI or dPSI. +At its core, dress relies on genetic programming to +perform data augmentations that create diversity in the semantic space of a deep learning oracle (e.g., SpliceAI, Pangolin). The augmentations - created through perturbations on exon triplets - are constrained with [grammars](https://en.wikipedia.org/wiki/Context-free_grammar), which restrict perturbations based on knowledge about the splicing domain. The package is build on top of [GeneticEngine](https://github.com/alcides/GeneticEngine), a framework for Genetic Programming in Python. + +Importantly, the expressive power of these grammars enhances model interpretability. The genotype of individuals (synthetic sequences) in the genetic programming population is closely related to RNA splicing concepts. + + + +These ideas were accepted for publication at Genetic +and Evolutionary Computation Conference (GECCO'24). The preprint is available at [arXiv](https://arxiv.org/abs/2407.02984v1). The sotware includes an extensive set of options to control the evolutionary search based on the desired in silico experiments and problem specific requirements. For example, by default dress avoids perturbing splice site regions (within a given genomic window), but that can easily be changed to avoid perturbing entire exons, forcing the search to focus on upstream/downstream introns of alternatively spliced exons. + +The software can be used for several applications, including: +- Generate local datasets for explainableAI (e.g., train intrinsically interpretable models, like LIME). +- Generate synthetic sequences for feature attribution analysis and motif discovery (e.g., TF-MoDISCo). +- Ablation studies for studying alternative splicing. + +As for now, the package contains two modules: + - `generate` is the main command. It generates synthetic data from a start sequence using a grammar and a deep learning oracle. + - `filter` command filters sequences based on desired levels of splice site probability, PSI or dPSI. ## Installation @@ -22,12 +37,14 @@ pip install . ## Running example -To run an evolutionary search with exon 6 of FAS gene: +dress accepts bed or tabular (see input example [here](data/examples/generate/raw_input/FAS_exon6/data.tsv)) inputs, and requires a locally stored pre-computed exon cache, along with the human reference genome. -`dress generate data/examples/generate/raw_input/FAS_exon6/data.tsv` +First, download the necessary files. The required transcript structure cache (from GENCODE v44) can be downloaded from [here](https://app.box.com/s/tbh293kqh1s9nbi624esl0c18maxuhss). Then, download the human genome hg38 (for example from [here](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/GRCh38.primary_assembly.genome.fa.gz), uncompress it and optionally simplify chromosome headers with `sed '/^>/s/ .*//'`. Place both files in a single directory, which is given in `--cache_dir`. By default, it expects this data to be in `data/cache`. -The required transcript structure cache (from GENCODE v44) can be downloaded from [here](https://app.box.com/s/tbh293kqh1s9nbi624esl0c18maxuhss). Then, download the human genome hg38 (for example from [here](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/GRCh38.primary_assembly.genome.fa.gz), uncompress it and optionally simplify chromosome headers with `sed '/^>/s/ .*//'`. Then, put both files in a single directory, which is given in `--cache_dir`. By default, it expects this data to be in `data/cache`. +To run an evolutionary search with an input exon (for example, the exon 6 of FAS gene), we simply call: + +`dress generate data/examples/generate/raw_input/FAS_exon6/data.tsv` The full list of argument options can be inspected with `dress generate --help` or by looking at one of the pre-configured yaml files at `dress/configs/generate*`. -Full documentation and tutorials will be available soon. \ No newline at end of file +Full documentation and tutorials detailing the complete capabilities of this software will be available soon. \ No newline at end of file