Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate all results #147

Open
jyaacoub opened this issue Nov 13, 2024 · 3 comments
Open

Consolidate all results #147

jyaacoub opened this issue Nov 13, 2024 · 3 comments
Assignees

Comments

@jyaacoub
Copy link
Owner

jyaacoub commented Nov 13, 2024

image

code

import logging
from matplotlib import pyplot as plt

from src.analysis.figures import prepare_df, fig_combined, custom_fig

dft = prepare_df('./results/model_media/model_stats.csv')
dfv = prepare_df('./results/model_media/model_stats_val.csv')

models = {
    'DG': ('nomsa', 'binary', 'original', 'binary'),
    'esm': ('ESM', 'binary', 'original', 'binary'), # esm model
    'aflow': ('nomsa', 'aflow', 'original', 'binary'),
    # 'gvpP': ('gvp', 'binary', 'original', 'binary'),
    'gvpL': ('nomsa', 'binary', 'gvp', 'binary'),
    # 'aflow_ring3': ('nomsa', 'aflow_ring3', 'original', 'binary'),
    'gvpL_aflow': ('nomsa', 'aflow', 'gvp', 'binary'),
    # 'gvpl_esm':('ESM', 'binary', 'gvp', 'binary'),
    # 'gvpL_aflow_rng3': ('nomsa', 'aflow_ring3', 'gvp', 'binary'),
    #GVPL_ESMM_davis3D_nomsaF_aflowE_48B_0.00010636872718329864LR_0.23282479481785903D_2000E_gvpLF_binaryLE
    # 'gvpl_esm_aflow': ('ESM', 'aflow', 'gvp', 'binary'),
}

fig, axes = fig_combined(dft, datasets=['davis', 'kiba', 'PDBbind'], fig_callable=custom_fig,
            models=models, metrics=['cindex', 'mse'],
            fig_scale=(10,5), add_stats=True, title_postfix=" test set performance", box=True)
plt.xticks(rotation=45)

# fig, axes = fig_combined(dfv, datasets=['davis'], fig_callable=custom_fig,
#             models=models, metrics=['cindex', 'mse'],
#             fig_scale=(10,5), add_stats=True, title_postfix=" validation set performance", box=True, fold_labels=True)
# plt.xticks(rotation=45)

Final models - these are the ones we will show in the paper.
    models = {
        'DG': ('nomsa', 'binary', 'original', 'binary'),
        'esm': ('ESM', 'binary', 'original', 'binary'), # esm model
        'aflow': ('nomsa', 'aflow', 'original', 'binary'),
        # 'gvpP': ('gvp', 'binary', 'original', 'binary'),
        'gvpL': ('nomsa', 'binary', 'gvp', 'binary'),
        # 'aflow_ring3': ('nomsa', 'aflow_ring3', 'original', 'binary'),
        'gvpL_aflow': ('nomsa', 'aflow', 'gvp', 'binary'),
        # 'gvpl_esm':('ESM', 'binary', 'gvp', 'binary'),
        # 'gvpL_aflow_rng3': ('nomsa', 'aflow_ring3', 'gvp', 'binary'),
        #GVPL_ESMM_davis3D_nomsaF_aflowE_48B_0.00010636872718329864LR_0.23282479481785903D_2000E_gvpLF_binaryLE
        # 'gvpl_esm_aflow': ('ESM', 'aflow', 'gvp', 'binary'),
    }
jyaacoub added a commit that referenced this issue Dec 1, 2024
@jyaacoub
Copy link
Owner Author

jyaacoub commented Dec 1, 2024

FIG 1

DATASET INFO

TABLE COUNTS

FULL TABLE COUNTS:

 | Dataset   |   Protein |   Compound |  Total Binding Entities |
 |-----------|-----------|------------|-------------------------|
 | davis     |       442 |         68 |                   30056 |
 | kiba      |       229 |       2111 |                  118254 |
 | pdbbind   |      3889 |      12639 |                   19443 |

USED TABLE COUNTS:

Due to memory limitations a couple records were excluded from our runs this is the full count that were actually used.

   Dataset  Protein  Compound  Total Binding Entities
0    davis      439        68                   29852
1     kiba      226      2111                  117590
2  pdbbind     3785     10950                   16265

SEQUENCE LENGTH DISTRIBUTION

non-overlayed or normalized plot

image

normalized and overlayed plot

image

MODEL RESULTS

All (except pocket) results - 2x3 - MSE and Cindex

image

Stratified with pocket results

jyaacoub added a commit that referenced this issue Dec 1, 2024
Still need pocket versions to be plotted.
jyaacoub added a commit that referenced this issue Dec 1, 2024
@jyaacoub
Copy link
Owner Author

jyaacoub commented Dec 2, 2024

FIG 3 - Platinum Dataset

DATASET INFO

TABLE COUNTS

	Unique protein sequence counts: 860
	            Unique protein IDs: 361
	          Unique ligand counts: 197
	                 Total records: 1962

Distribution for the number of mutations per protein

image

pkd distributions (stratified by # of mutations)

image
image

Model results

Stratified figures below are better at showing the model results #147 (comment)

Raw predictive performance

This plot shows the ability for the model to just predict the pkd given the protein sequence and ligand SMILES

  • Davis is terrible - no better than random
  • Kiba is better
  • PDBbind is best with cindex > 0.70

image

Delta predictive performance

Instead of looking at absolute predictive performance this plot show how well the model is able to predict the delta between a mutated and unmutated sequence.

  • Here all 3 are terrible - no better than random.

image

jyaacoub added a commit that referenced this issue Dec 2, 2024
jyaacoub added a commit that referenced this issue Jan 14, 2025
…edictions #147

aflow models will only have 1 pdb and should be tied to the pid. This is an issue for some since the way we grab those files for non-pdbbind proteins is with `f.startswith(pid)` which raises issues when we have two pids one a subsequence of the other (e.g.: PIK3CA.pdb and PIK3CA(Q546K).pdb)

#147
jyaacoub added a commit that referenced this issue Jan 15, 2025
Get mutations in pocket and out of pocket for stratified figures to show if there is a difference.

#147
jyaacoub added a commit that referenced this issue Jan 15, 2025
jyaacoub added a commit that referenced this issue Jan 15, 2025
**Still missing kiba_gvpl_aflow**, the checkpoint is on Graham or Cedar which are both down at the moment. Once they come back on run the following to check if they are there:

```bash
find results/model_checkpoints -type f | grep .*GVPLM_kiba.*nomsaF.*_aflowE_16B.*.model*
```

#147
@jyaacoub
Copy link
Owner Author

jyaacoub commented Jan 15, 2025

Additional figures - stratified performance results

MODEL TEST DATASETS

pocket vs full protein representation

These are the only ones we trained with the pocket representation:

Image

NOTE: I dont think its worth training the remaining 11 model configurations (total of 55 models to train - 1 for each of the 5 folds - this would take at least a couple months at my current pace)

Image

Platinum:

RAW model predictive performance (all vs only mutated vs only wildtype)

Image

In pocket vs out of pocket mutation differences

RAW predictive performance:

Image

DELTA predictive performance:

Image

single vs multiple mutations

RAW predictive performance:

Image

DELTA predictive performance:

Image

@jyaacoub jyaacoub self-assigned this Jan 15, 2025
jyaacoub added a commit that referenced this issue Jan 20, 2025
For comparing against different groups within our test sets.
jyaacoub added a commit that referenced this issue Jan 24, 2025
only davis and kiba checkpoints for DG and aflow models were available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant