Skip to content

Commit

Permalink
Merge pull request #5 from PedroBarbosa/dev_expl
Browse files Browse the repository at this point in the history
Version 0.1.0
  • Loading branch information
PedroBarbosa authored Apr 16, 2024
2 parents a9d79bb + c491a87 commit 4928171
Show file tree
Hide file tree
Showing 946 changed files with 194,932 additions and 1,534 deletions.
48 changes: 48 additions & 0 deletions .github/workflows/run_tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: pytest

on:
push:
branches:
- main
pull_request:
branches:
- main

jobs:
build:
runs-on: ubuntu-latest
strategy:
max-parallel: 5
fail-fast: false
matrix:
python-version: ["3.10", "3.11"]

steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
- name: Add conda to system path
run: |
# $CONDA is an environment variable pointing to the root of the miniconda directory
echo $CONDA/bin >> $GITHUB_PATH
- name: Install dependencies
run: |
conda config --add channels conda-forge
conda config --add channels bioconda
conda install -y python=${{ matrix.python-version }} meme=5.5.5
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
pip install flake8
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pip install pytest
# Ignore tests that should use GPU
pytest --ignore=tests/black_box_test.py --ignore=tests/prune_archive_test.py
4 changes: 1 addition & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -171,9 +171,7 @@ __pycache__/
build/
dist/

tests/motifs
data/cache/*
data/examples/explain
dress/datasetevaluation
dress/datasetexplanation
output*
.devcontainer
28 changes: 21 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,33 @@
[![pytest](https://github.com/PedroBarbosa/dress/actions/workflows/run_tests.yml/badge.svg)](https://github.com/PedroBarbosa/dress/actions/workflows/run_tests.yml)

# DRESS - Deep learning-based Resource for Exploring Splicing Signatures

A toolkit for generating synthetic datasets related to RNA splicing.

## Running example

As for now, the package contains two commands:
- `generate` to generate synthetic datasets from a start sequence.
- `filter` to filter datasets by desired levels of PSI or dPSI.

- `filter` to filter datasets by desired levels of splice site probability, PSI or dPSI.

## Installation

Clone the repo, take care of dependencies with `conda` or `mamba` and install the package with `pip`:

```
git clone https://github.com/PedroBarbosa/dress.git
cd dress
conda env create -f conda_env.yml
conda activate dress
pip install .
```

## Running example

To run an evolutionary search with exon 6 of FAS gene:

`dress generate data/examples/generate/raw_input/FAS_exon6/data.tsv`

To skip running the black box model (e.g, SpliceAI), run with `--dry_run`, which will return as fitness the proportional index (between 0 and 1) of the individual in the population.
The required transcript structure cache (from GENCODE v44) can be downloaded from [here](https://app.box.com/s/tbh293kqh1s9nbi624esl0c18maxuhss). Then, download the human genome hg38 (for example from [here](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/GRCh38.primary_assembly.genome.fa.gz), uncompress it and optionally simplify chromosome headers with `sed '/^>/s/ .*//'`. Then, put both files in a single directory, which is given in `--cache_dir`. By default, it expects this data to be in `data/cache`.

The full list of argument options can be inspected with `dress generate --help` or by looking at the yaml configuration file at `dress/configs/generate.yaml`.
The full list of argument options can be inspected with `dress generate --help` or by looking at one of the pre-configured yaml files at `dress/configs/generate*`.

The required transcript structure cache (from GENCODE v44) can be downloaded from [here](https://app.box.com/s/tbh293kqh1s9nbi624esl0c18maxuhss). Then, download the human genome hg38 (for example from [here](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/GRCh38.primary_assembly.genome.fa.gz), uncompress it and optionally simplify chromosome headers with `sed '/^>/s/ .*//'`. Then, put both files in a single directory, which is given in `--cache_dir`. By default, it expects this data to be in `data/cache`.
Full documentation and tutorials will be available soon.
17 changes: 6 additions & 11 deletions conda_env.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,27 +4,22 @@ channels:
- bioconda
dependencies:
- python=3.10.12
# - cudatoolkit=11.2.2
# - cudnn=8.1.0
- meme=5.5.3
- pip=23.2.1
- pip=24.0
- meme=5.5.5
- pip:
- tensorflow==2.15.0.post1
- spliceai==1.3.1
- GeneticEngine==0.8.5
- torch --index-url https://download.pytorch.org/whl/cu113
- torch==2.2.0
- git+https://github.com/tkzeng/Pangolin.git@5cf94b8
- pandas==2.0.3
- numpy==1.23.5
- seaborn
- pyranges
- biopython
- dna_features_viewer
- statsmodels
- pyranges==0.0.129
- biopython==1.83
- loguru
- tqdm
- rich_click
- jsonschema
- pyyaml
- pandarallel
- scikit-learn==1.4.0

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
chr10 89010751 89010815 chr10:89010753-89010815 . + FAS ENST00000652046 ENSG00000026103 protein_coding
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
>chr10:89010439-89012181(+)_ENST00000652046
AAGATGAATAAAATGGCCCCTAATTTACAAAGTGCCATTGAAAATTATAAAGGAATTATTCTGCCAGGCTTTTGAATTTCTCCTGTATTTTTTTTTCTAGATGTGAACATGGAATCATCAAGGAATGCACACTCACCAGCAACACCAAGTGCAAAGAGGAAGGTAATTATTTTTTTACGGTTATATTCTCCTTTCCCCCAACCCCATGGAAAGATGTGAAGAAAAACCAATCACTCTTGATTACTAGAAAGTCCTTTATTTAATCTTAAAGATTGCTTATTTTCATATAAAATGTCCAATGTTCCAACCTACAGGATCCAGATCTAACTTGGGGTGGCTTTGTCTTCTTCTTTTGCCAATTCCACTAATTGTTTGGGGTAAGTTCTTGCTTTGTTCAAACTGCAGATTGAAATAACTTGGGAAGTAGTTCACAAAGATTTGCCTCATTCTTACCTATAAAAAGCTACCACTTTGGTAGATTTATGTATTGTTAATTTCTTGCCCCTGAATGCAGCCTTGAGAGCTGACTGATAAGAACAAATGAAATTATTCCTCAGCTAGTTTCTGAGCAACAGTTTTGGGGCATTGAGTGGTATTCTCATCCTTCCTATGAACAGGTGTTCTCTGCAGCAGCAGAATTGGCCAAAAATCAGAAGCAATTCTTCACTATTCATTGAGATCTCCCTATGCAAAAAGAGAACACAAGAAGCAAAGGCATTCCCAGGAAACACATTGCAGGGAACACTTTAAAAACTTGTACTTCACTGCCTCCTCTTCCTCGGCCTAATTGCTTGTTTTTAATTATTTCTCCTTCTTAACTTAAAATACTATGGGGACACATGTTATACAAAGGTGACTTAGTAGAGTCAGTAGAAAAGCCAAAATTAGATATTATCATAATTAGTCTAGAAAAATCCCTTTAAGTCATTCATCAACTACAGGGTCACACCAACTTTCAGTAACTTAGAAGTATTCAATTTTCCCTTCTCAGAACAATTATCTGTTTCTTCAGTTCAGTTGAAGAAGAAAGTTTGCCTTGCCTTTAGCGGTTGTTTAGCTGAAAATACATTTGGGATATTTAAGCACTGTAATTGTGCTCAGAGACATACAGATTCTTCTATCTCACATTGACTTTAATGCATACACCTATTGAGTATGTATGCTTGAGTTATTTGTGTGTGTATTTCATTTCTGGGCATCCATAGCAAGTTGATGTTGACTTGCTTGTCCTACGGCTTCTGCATCCTGCCATAGTCTTGCCGTCCACATCTTTGCTGGACAGAGAGTGGTGCTTGCCATATGGTAAGTCAAAAGCCATCTCCTTGCTAGGCCAGCCTGTGGTAATTAGATGACTAATTAAGATATGTCCTTTCACTAGAACACTTGACTTAGTAGTACGAAAGTTCCAAAATCAGCGGTCTCCTGCGATGTTTGGCCACTTTTAAGTTTCACTGAATTTCTCCTTTTTCCTTCTTATATTTCTCTTAGTGTGAAAGTATGTTCTCACATGCATTCTACAAGGCTGAGACCTGAGTTGATAAAATTTCTTTGTTCTTTCAGTGAAGAGAAAGGAAGTACAGAAAACATGCAGAAAGCACAGAAAGGAAAACCAAGGTTCTCATGAATCTCCAACTTTAAATCCTGTAGGTATTGAAATAGGTATCAGCTTTCCTTGAAAAGAAAAATAGAGAAATTAGTGATTTGGCTTTTTGTTACTTCCTTTTACTTTTTTGTTTCTTGTTT
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
header acceptor_idx donor_idx tx_id exon
chr10:89010439-89012181(+) 100;314;1560 161;376;1642 ENST00000652046 chr10:89010753-89010815
2 changes: 2 additions & 0 deletions data/examples/generate/raw_input/large_seq/data.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
exon Strand gene_name transcript_id transcript_type dPSI
chr5:82058495-82058602 + ATG10 ENST00000282185 protein_coding -0.24
6 changes: 3 additions & 3 deletions dress/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,14 @@
from dress.datasetgeneration.validate_args import GENERATE_GROUP_OPTIONS
from dress.datasetfiltering.validate_args import FILTER_GROUP_OPTIONS
from dress.datasetgeneration import run as generate
from dress.datasetfiltering import run as filter
from dress.datasetfiltering import run as filtration

import os

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

click.rich_click.SHOW_METAVARS_COLUMN = False

click.rich_click.APPEND_METAVARS_HELP = True
click.rich_click.OPTION_GROUPS = {
**GENERATE_GROUP_OPTIONS,
Expand Down Expand Up @@ -41,8 +42,7 @@ def cli():


cli.add_command(generate.generate, "generate")
cli.add_command(filter.filter, "filter")

cli.add_command(filtration.filter, "filter")

if __name__ == "__main__":
cli()
Original file line number Diff line number Diff line change
@@ -1,64 +1,65 @@
---
# Mandatory arguments (input) should be provided as command line arguments to the script.
generate:
dry_run: false
verbosity: 0
shuffle_input:
seed: 0
model: 'spliceai'
model_scoring_metric: 'mean'
pangolin_mode: 'ss_usage'
model: spliceai
model_scoring_metric: mean
pangolin_mode: ss_usage
pangolin_tissue: null
disable_gpu: null
outdir: 'output'
disable_gpu: false
outdir: output
outbasename: null
preprocessing:
cache_dir: 'data/cache/'
genome: 'data/cache/Homo_sapiens.GRCh38.dna.primary_assembly.fa'
cache_dir: data/cache/
genome: data/cache/Homo_sapiens.GRCh38.dna.primary_assembly.fa
use_full_sequence: false
fitness:
minimize_fitness: false
fitness_function: 'bin_filler'
fitness_function: bin_filler
fitness_threshold: 0.0
archive:
archive_size: 5000
archive_diversity_metric: 'normalized_shannon'
prune_archive_individuals: false
prune_at_generations: null
archive_diversity_metric: normalized_shannon
prune_archive_individuals: true
prune_at_generations:
population:
population_size: 1000
individual:
individual_representation: 'tree_based'
population_size: 500
selection:
selection_method: 'tournament'
selection_method: tournament
tournament_size: 5
custom_mutation_operator: false
custom_mutation_operator_weight: 0.9
mutation_probability: 0.9
crossover_probability: 0.01
custom_mutation_operator: true
custom_mutation_operator_weight: 0.8
mutation_probability: 0.7
crossover_probability: 0.25
operators_weight:
- 0.6
- 0.8
elitism_weight:
- 0.05
- 0.0
novelty_weight:
- 0.35
- 0.1
update_weights_at_generation: null
stopping:
stopping_criterium:
- archive_size
- time
stop_at_value:
- 5000
- 30
- 5
stop_when_all: false
tracking_evolution:
disable_tracking: false
track_full_population: false
track_full_archive: false
grammar:
which_grammar: random
max_diff_units: 6
snv_weight: 0.33
insertion_weight: 0.33
deletion_weight: 0.33
snv_weight: 0.2
insertion_weight: 0.4
deletion_weight: 0.4
motif_substitution_weight: 0
motif_ablation_weight: 0
max_insertion_size: 5
max_deletion_size: 5
acceptor_untouched_range:
Expand All @@ -68,3 +69,9 @@ generate:
- -3
- 6
untouched_regions: null
motif_db: ATtRACT
motif_search: fimo
subset_rbps: encode
min_nucleotide_probability: 0.15
min_motif_length: 5
pvalue_threshold: 0.001
77 changes: 77 additions & 0 deletions dress/configs/generate_binfiller_pwm_grammar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Mandatory arguments (input) should be provided as command line arguments to the script.
generate:
dry_run: false
verbosity: 0
shuffle_input:
seed: 0
model: spliceai
model_scoring_metric: mean
pangolin_mode: ss_usage
pangolin_tissue: null
disable_gpu: false
outdir: outdir
outbasename:
preprocessing:
cache_dir: data/cache/
genome: data/cache/Homo_sapiens.GRCh38.dna.primary_assembly.fa
use_full_sequence: false
fitness:
minimize_fitness: false
fitness_function: bin_filler
fitness_threshold: 0.0
archive:
archive_size: 5000
archive_diversity_metric: normalized_shannon
prune_archive_individuals: true
prune_at_generations:
population:
population_size: 500
selection:
selection_method: tournament
tournament_size: 5
custom_mutation_operator: false
custom_mutation_operator_weight: 0.8
mutation_probability: 0.7
crossover_probability: 0.25
operators_weight:
- 0.8
elitism_weight:
- 0.0
novelty_weight:
- 0.1
update_weights_at_generation:
stopping:
stopping_criterium:
- archive_size
- time
stop_at_value:
- 5000
- 5
stop_when_all: false
tracking_evolution:
disable_tracking: false
track_full_population: false
track_full_archive: false
grammar:
which_grammar: motif_based
max_diff_units: 6
snv_weight: 0.1
insertion_weight: 0.25
deletion_weight: 0.25
motif_ablation_weight: 0.2
motif_substitution_weight: 0.2
max_insertion_size: 5
max_deletion_size: 5
acceptor_untouched_range:
- -10
- 2
donor_untouched_range:
- -3
- 6
untouched_regions:
motif_db: ATtRACT
motif_search: fimo
subset_rbps: encode
min_nucleotide_probability: 0.15
min_motif_length: 5
pvalue_threshold: 0.001
Loading

0 comments on commit 4928171

Please sign in to comment.