KDBNet (kinase–drug binding prediction neural network) is a neural network that integrates protein 3D structure and compound 3D structure to predict compound–protein binding affinity and quantify the prediction uncertainty. Please see our paper in Nature Machine Intelligence for details.
git clone https://github.com/luoyunan/KDBNet.git
cd KDBNet
export PYTHONPATH=$PWD:$PYTHONPATH
This package is tested with Python 3.9 and CUDA 11.6 on Ubuntu 20.04, with access to an Nvidia A40 GPU (48GB RAM), AMD EPYC 7443 CPU (2.85 GHz), and 512G RAM. Run the following to create a conda environment and install the required Python packages (modify pytorch-cuda=11.6
according to your CUDA version).
conda create -n kdbnet python=3.9
conda activate kdbnet
conda install pytorch pytorch-cuda=11.6 -c pytorch -c nvidia
conda install pyg -c pyg
conda install -c conda-forge uncertainty-toolbox rdkit pyyaml
Running the above lines of conda install
should be sufficient to install all KDBNet's required packages (and their dependencies). Specific versions of the packages we tested were listed in requirements.txt
.
- Download example data (~120MB) from Dropbox.
wget https://www.dropbox.com/s/owc45bzbfn05ix4/data.tar.gz tar -xf data.tar.gz
- Run the example code in
scripts/
. The following script trains and evaluates a KDBNet model using the Davis dataset of kinase-drug binding affinity. The script randomly splits 70% as training data, 10% as validation data, and 20% as test data.After the training and testing are finished, the evaluation metrics will be printed in the tarminal. The predicted binding affinity values will be saved as a TSV file in the output directory, with the following formatCUDA_VISIBLE_DEVICES=0 python run_example.py --task davis --n_epochs 100 --output_dir ../output/davis --save_prediction
drug protein y_true y_pred 11667893 MKK7 5 5.15088 ... ... ... ...
- Dataset split. By default, the dataset is randomly split to train/valid/test with a ratio 0.7/0.1/0.2. We also support dataset split by drug, protein, or both, such that the model is tested on unseen proteins, unseen drugs or both. To use this option, set the argument
--split_method
usingdrug
orboth
for the--split_method
method. For example, to test on unseen drugs, run the following scriptWe also support sequence identity-based split such that the model is tested on unseen proteins with sequence identity lower 50% (CUDA_VISIBLE_DEVICES=0 python run_example.py --task davis --n_epochs 100 --split_method drug
--split_method seqid
). To use other sequence identity cutoff, first run MMseqs2 (cluster
function) to generate the clustering file, and change the initialize the dataset class (DAVIS
orKIBA
indta.py
) with the file path, e.g.,dataset = DAVIS(mmseqs_seq_cluster_file='mmseqs_cluster.tsv')
. - Ensemble. To ensemble multiple models, use
--n_ensembles
argument. For example, to ensemble 5 models, runIn the output TSV file, theCUDA_VISIBLE_DEVICES=0 python run_example.py --task davis --n_epochs 100 --n_ensembles 5 --output_dir ../output/davis_ens --save_prediction
y_pred
column is the average of the predictions by the 5 models, and the predictions given by individual model is listed in they_pred_0
,y_pred_1
, ...,y_pred_4
columns. - Uncertainty estimation. To estimate the uncertainty of the model, use
--uncertainty
argument. When--uncertainty
is set toTrue
, the number of ensembles should be greater than 1. The model will output the mean and standard deviation of the prediction.In the output TSV file, there will be aCUDA_VISIBLE_DEVICES=0 python run_example.py --task davis --n_epochs 100 --n_ensembles 5 --uncertainty --output_dir ../output/davis_unc --save_prediction
y_std
column, which is the standard deviation of the predictions by the 5 models and can be used as the uncertainty estimates. - Parallel training. To train ensemble models in parallel, use
--parallel
argument. For example, to train 5 models in parallel, runYou can train a separate model in the ensembles on a single GPU by settingCUDA_VISIBLE_DEVICES=0 python run_example.py --task davis --n_epochs 100 --n_ensembles 5 --uncertainty --parallel
CUDA_VISIBLE_DEVICES=0,1,2,3,4
. When you have fewer GPUs than the number of ensembles, multiple models may be trained on a single GPU. For example, if you have 2 GPUs and want to train an ensemble of 5 models, you can setCUDA_VISIBLE_DEVICES=0,1
, and the 3 models will be trained on GPU 0, and the other 2 models will be trained on GPU 1. - Uncertainty recalibration. Use
--recalibrate
argument to perform uncertainty recalibration. For example, to ensemble 5 models and perform uncertainty recalibration, runIn the output TSV file, there will be aCUDA_VISIBLE_DEVICES=0,1,2 python run_example.py --task davis --n_epochs 100 --n_ensembles 5 --uncertainty --parallel --recalibrate --output_dir ../output/davis_recal --save_prediction
y_std_recalib
column, which is the recalibrated uncertainty.
data/DAVIS/
is the Davis dataset of kinase-drug binding affinity. Thedavis_data.tsv
contains the binding affinity data with the following formatwheredrug protein Kd y 5291 AAK1 10000 5 ... ... ... ...
drug
is the PubChem Compound ID (CID) of the drug,protein
is the kinbase name,Kd
is the binding affinity in nM, andy
is the log-transformed binding affinity (see paper). Thedavis_protein2pdb.yaml
file contains the mapping from a kinase name to its representative PDB structure ID. Thedavis_cluster_id50_cluster.tsv
is the clustering output file of the MMseqs2 clustering algorithm with a sequence identity cutoff 50% (the first column contains the representative sequences and the second column contains cluster members).data/KIBA/
is the KIBA dataset of kinase-inhibitor binding affinity. The files in this directory are similar to those indata/DAVIS/
. In thekiba_data.tsv
file, thedrug
column contains CHEMBL IDs of the drugs, and theprotein
column contains UniProt IDs of the kinases.data/structure
contains several structure files.- The
pockets_structure.json
contains the PDB structure data of representative kinase structures. The file is in JSON format where the key is the PDB ID, and the value is the corresponding PDB structure data in a dictionary format, including the following fields:name
(kinase name),UniProt_id
,PDB_id
,chain
(chain ID in the PDB structure),seq
(pocket protein sequence),coords
(coordinates of the N, CA, C, O atoms of residues in the pocket). Thecoords
is a dictionary with four fields, and each is a list of xyz coordinates of the N/CA/C/O atom in each residue, i.e.,coords={'N':[[x, y, z], ...]], 'CA': [...], 'C': [...], O: [...]}
- The
davis_moil3d_sdf
andkiba_moil3d_sdf
are diretories that contain the 3D structure (SDF format) of molecules in the Davis and KIBA datasets.
- The
data/esm1b
contains the pre-computed protein sequence embeddings by the ESM-1b model. The embeddings were saved as PyTorch tensors in the.pt
format.
Luo, Y., Liu, Y. & Peng, J. Calibrated geometric deep learning improves kinase–drug binding predictions. Nat Mach Intell (2023). https://doi.org/10.1038/s42256-023-00751-0
Please submit GitHub issues or contact Yunan Luo (luoyunan[at]gmail[dot]com) for any questions related to the source code.