-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #142 from jyaacoub/jyaacoub-patch-1
Update README.md
- Loading branch information
Showing
1 changed file
with
55 additions
and
47 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,47 +1,55 @@ | ||
# MutDTA [![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/release/python-31011/) ![Python Tests](https://github.com/jyaacoub/MutDTA/actions/workflows/python-app.yml/badge.svg?branch=main) | ||
Improving the precision oncology pipeline by providing binding affinity purtubations predictions on a pirori identified cancer driver genes. | ||
|
||
# Current Progress | ||
- [ ] Data preprocessing | ||
- [x] PDBbind - simple enough to use. Protein seq were downloaded from UniProt. | ||
- [x] PLATINUM - Working on getting mutated sequence data (done with mutalyzer) | ||
- [ ] KIBA and Davis - Kinase proteins are not super relevant but could be useful for pretraining since datasets are limited. | ||
- [ ] GENIE - This will require some physical docking methods since we have no binding affinity data for this. | ||
- [x] Docking baseline | ||
- [x] Set up Docking on cluster | ||
- [x] Build scripts to automate ligand and protein prep (including grid for binding site). | ||
- [x] Run docking on PDBbind dataset | ||
- [ ] Model baseline | ||
- [ ] DGraphDTA | ||
- [x] Evaluate pretrained model on PDBbind dataset | ||
- [x] Train model on *refined-set* PDBbind dataset and evaluate | ||
- [ ] Train model on *general* PDBbind dataset and evaluate | ||
|
||
# AutoDock Vina Procedure | ||
See [README/VINA_PROCEDURE.md](./docs/VINA_PROCEDURE.MD) for detailed steps | ||
|
||
## Contribution | ||
Try to follow [conventional commits](https://gist.github.com/Zekfad/f51cb06ac76e2457f11c80ed705c95a3). | ||
|
||
### Quick examples | ||
* `feat: new feature` | ||
* `fix(scope): bug in scope` | ||
* `feat!: breaking change` / `feat(scope)!: rework API` | ||
* `chore(deps): update dependencies` | ||
|
||
### Commit types | ||
* `build`: Changes that affect the build system or external dependencies (example scopes: gulp, broccoli, npm) | ||
* `ci`: Changes to CI configuration files and scripts (example scopes: Travis, Circle, BrowserStack, SauceLabs) | ||
* **`chore`: Changes which doesn't change source code or tests e.g. changes to the build process, auxiliary tools, libraries** | ||
* `docs`: Documentation only changes | ||
* **`feat`: A new feature** | ||
* **`fix`: A bug fix** | ||
* `perf`: A code change that improves performance | ||
* `refactor`: A code change that neither fixes a bug nor adds a feature | ||
* `revert`: Revert something | ||
* `style`: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc) | ||
* `test`: Adding missing tests or correcting existing tests | ||
|
||
### Reminders | ||
* Put newline before extended commit body | ||
* More details at **[conventionalcommits.org](https://www.conventionalcommits.org/)** | ||
# H4H Directory structure and where to find things: | ||
|
||
## H4H: | ||
|
||
Everything important (e.g.: model weights) should be accessible through the shared project directory. | ||
|
||
I primarily used my home directory to store the code since I could sync it up with GitHub whereas on the shared project directory we have no internet access to perform “git pull/push” operations. | ||
|
||
Thus for files/folders that were too large to store in HOME, I used symbolic links to folders located in the shared project directory. However, I forget exactly how I laid everything out and I no longer have access to the VPN to connect and check. | ||
|
||
Nonetheless, all the important stuff we would need, like model checkpoints, should be stored in the shared project directory. | ||
|
||
## GitHub \- [https://github.com/jyaacoub/MutDTA/tree/main](https://github.com/jyaacoub/MutDTA/tree/main) | ||
|
||
[Training splits](https://github.com/jyaacoub/MutDTA/tree/main/splits) can be found on the GitHub page as well as all my most recent code. | ||
|
||
# | ||
|
||
# GitHub issues | ||
|
||
All the issues we encountered with this project are tracked via GitHub. I list some of the more relevant issues below: | ||
|
||
## Summary of model checkpoints/issues (found in [MutDTA/results/](https://github.com/jyaacoub/MutDTA/tree/main/results)): | ||
|
||
Basically the only ones that matter are `results/model_checkpoints` and `v103`. The rest are just some tests I did to resolve/debug issues. | ||
|
||
* [`results/model_checkpoints`](https://github.com/jyaacoub/MutDTA/tree/main/results) \- These are the models trained on *random splits* | ||
* [`v103`](https://github.com/jyaacoub/MutDTA/issues/103) \- **pocket-only** representation checkpoints | ||
* [`v113`](https://github.com/jyaacoub/MutDTA/issues/113) \- new training split where we excluded highly targeted (OncoKB) proteins from training. | ||
* This leads to consistently worse performance across the board. | ||
* [`v115`](https://github.com/jyaacoub/MutDTA/issues/115) \- since "aflow" (alphaflow edge weights) models had a smaller dataset (due to memory issues when running Alphaflow on AA sequences 1200+) we *artificially reduced the sizes of the training sets* for the other models so that we could have a fair comparison | ||
* This didn't change much. | ||
* [`v128`](https://github.com/jyaacoub/MutDTA/issues/128) \- Test to see if new splits were the issue (they were) | ||
|
||
## OncoKB distribution drift issue with splits \- [Issue \#131](https://github.com/jyaacoub/MutDTA/issues/131) | ||
|
||
When we originally started looking into OncoKB I selected highly targeted proteins from OncoKB to be excluded from training sets. | ||
|
||
- This caused a big distribution drift issue and resulted in much worse performance, particularly with PDBbind. | ||
|
||
Stats on the distribution differences between the manually curated oncokb dataset split vs a random split can be found on the [issue page.](https://github.com/jyaacoub/MutDTA/issues/131#issuecomment-2276366754) | ||
|
||
- click the *details* button to see figures. | ||
|
||
## Missing Amino Acids in PDBs for PDBbind \- [Issue\#102](https://github.com/jyaacoub/MutDTA/issues/102) | ||
|
||
This means for the pocket versions of our models we can’t readily use existing scripts to get the pocket sequence graph based on the PDBs provided. | ||
|
||
- It is possible to fix this, but it needs a LOT of effort since we would also need to retrain the PDBbind models that used graphs with the missing residues. | ||
|
||
## Pocket representation version of our models \- [Issue\#103](https://github.com/jyaacoub/MutDTA/issues/103) | ||
|
||
This tracks how the pocket representation of Davis and Kiba models was built. The [pull request 135](https://github.com/jyaacoub/MutDTA/pull/135) resolves this with the results in the [CSV files](https://github.com/jyaacoub/MutDTA/pull/135/files#diff-470793793283a1e1b2c3c5055749ddb946413c66b5581a70bb502db544660642). | ||
|
||
## |