Skip to content

Commit

Permalink
Merge pull request #142 from jyaacoub/jyaacoub-patch-1
Browse files Browse the repository at this point in the history
Update README.md
  • Loading branch information
jyaacoub authored Oct 16, 2024
2 parents 5d511fd + ef59d06 commit b4f3d60
Showing 1 changed file with 55 additions and 47 deletions.
102 changes: 55 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,55 @@
# MutDTA [![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/release/python-31011/) ![Python Tests](https://github.com/jyaacoub/MutDTA/actions/workflows/python-app.yml/badge.svg?branch=main)
Improving the precision oncology pipeline by providing binding affinity purtubations predictions on a pirori identified cancer driver genes.

# Current Progress
- [ ] Data preprocessing
- [x] PDBbind - simple enough to use. Protein seq were downloaded from UniProt.
- [x] PLATINUM - Working on getting mutated sequence data (done with mutalyzer)
- [ ] KIBA and Davis - Kinase proteins are not super relevant but could be useful for pretraining since datasets are limited.
- [ ] GENIE - This will require some physical docking methods since we have no binding affinity data for this.
- [x] Docking baseline
- [x] Set up Docking on cluster
- [x] Build scripts to automate ligand and protein prep (including grid for binding site).
- [x] Run docking on PDBbind dataset
- [ ] Model baseline
- [ ] DGraphDTA
- [x] Evaluate pretrained model on PDBbind dataset
- [x] Train model on *refined-set* PDBbind dataset and evaluate
- [ ] Train model on *general* PDBbind dataset and evaluate

# AutoDock Vina Procedure
See [README/VINA_PROCEDURE.md](./docs/VINA_PROCEDURE.MD) for detailed steps

## Contribution
Try to follow [conventional commits](https://gist.github.com/Zekfad/f51cb06ac76e2457f11c80ed705c95a3).

### Quick examples
* `feat: new feature`
* `fix(scope): bug in scope`
* `feat!: breaking change` / `feat(scope)!: rework API`
* `chore(deps): update dependencies`

### Commit types
* `build`: Changes that affect the build system or external dependencies (example scopes: gulp, broccoli, npm)
* `ci`: Changes to CI configuration files and scripts (example scopes: Travis, Circle, BrowserStack, SauceLabs)
* **`chore`: Changes which doesn't change source code or tests e.g. changes to the build process, auxiliary tools, libraries**
* `docs`: Documentation only changes
* **`feat`: A new feature**
* **`fix`: A bug fix**
* `perf`: A code change that improves performance
* `refactor`: A code change that neither fixes a bug nor adds a feature
* `revert`: Revert something
* `style`: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
* `test`: Adding missing tests or correcting existing tests

### Reminders
* Put newline before extended commit body
* More details at **[conventionalcommits.org](https://www.conventionalcommits.org/)**
# H4H Directory structure and where to find things:

## H4H:

Everything important (e.g.: model weights) should be accessible through the shared project directory.

I primarily used my home directory to store the code since I could sync it up with GitHub whereas on the shared project directory we have no internet access to perform “git pull/push” operations.

Thus for files/folders that were too large to store in HOME, I used symbolic links to folders located in the shared project directory. However, I forget exactly how I laid everything out and I no longer have access to the VPN to connect and check.

Nonetheless, all the important stuff we would need, like model checkpoints, should be stored in the shared project directory.

## GitHub \- [https://github.com/jyaacoub/MutDTA/tree/main](https://github.com/jyaacoub/MutDTA/tree/main)

[Training splits](https://github.com/jyaacoub/MutDTA/tree/main/splits) can be found on the GitHub page as well as all my most recent code.

#

# GitHub issues

All the issues we encountered with this project are tracked via GitHub. I list some of the more relevant issues below:

## Summary of model checkpoints/issues (found in [MutDTA/results/](https://github.com/jyaacoub/MutDTA/tree/main/results)):

Basically the only ones that matter are `results/model_checkpoints` and `v103`. The rest are just some tests I did to resolve/debug issues.

* [`results/model_checkpoints`](https://github.com/jyaacoub/MutDTA/tree/main/results) \- These are the models trained on *random splits*
* [`v103`](https://github.com/jyaacoub/MutDTA/issues/103) \- **pocket-only** representation checkpoints
* [`v113`](https://github.com/jyaacoub/MutDTA/issues/113) \- new training split where we excluded highly targeted (OncoKB) proteins from training.
* This leads to consistently worse performance across the board.
* [`v115`](https://github.com/jyaacoub/MutDTA/issues/115) \- since "aflow" (alphaflow edge weights) models had a smaller dataset (due to memory issues when running Alphaflow on AA sequences 1200+) we *artificially reduced the sizes of the training sets* for the other models so that we could have a fair comparison
* This didn't change much.
* [`v128`](https://github.com/jyaacoub/MutDTA/issues/128) \- Test to see if new splits were the issue (they were)

## OncoKB distribution drift issue with splits \- [Issue \#131](https://github.com/jyaacoub/MutDTA/issues/131)

When we originally started looking into OncoKB I selected highly targeted proteins from OncoKB to be excluded from training sets.

- This caused a big distribution drift issue and resulted in much worse performance, particularly with PDBbind.

Stats on the distribution differences between the manually curated oncokb dataset split vs a random split can be found on the [issue page.](https://github.com/jyaacoub/MutDTA/issues/131#issuecomment-2276366754)

- click the *details* button to see figures.

## Missing Amino Acids in PDBs for PDBbind \- [Issue\#102](https://github.com/jyaacoub/MutDTA/issues/102)

This means for the pocket versions of our models we can’t readily use existing scripts to get the pocket sequence graph based on the PDBs provided.

- It is possible to fix this, but it needs a LOT of effort since we would also need to retrain the PDBbind models that used graphs with the missing residues.

## Pocket representation version of our models \- [Issue\#103](https://github.com/jyaacoub/MutDTA/issues/103)

This tracks how the pocket representation of Davis and Kiba models was built. The [pull request 135](https://github.com/jyaacoub/MutDTA/pull/135) resolves this with the results in the [CSV files](https://github.com/jyaacoub/MutDTA/pull/135/files#diff-470793793283a1e1b2c3c5055749ddb946413c66b5581a70bb502db544660642).

##

0 comments on commit b4f3d60

Please sign in to comment.