Repository implementing idpSAM in PyTorch. IpdSAM is a latent diffusion model for generating Cα conformations of intrinsically disordered proteins (IDPs) and peptides. The model was trained on a dataset of Markov Chain Monte Carlo simulations of 3,259 intrinsically disordered regions whose sequences were obtained from the DisProt database. The simulations were carried out using ABSINTH, an implicit solvent model, implemented in the CAMPARI 4.0 package. Here we provide code and weights of a pre-trained idpSAM model.
This repository can be used for the following applications (see below for more information):
- Generate Cα ensembles with a pre-trained idpSAM model.
- Generate all-atom ensembles with a pre-trained idpSAM model and the cg2all model for all-atom reconstruction.
We recommend to install and run this package in a new Conda environment that you create from the sam.yml
file in this repository. If you follow this strategy, use these commands:
- Clone the repository:
and go into the root directory of the repository.
git clone https://github.com/giacomo-janson/idpsam.git
- Install the dedicated conda environment and dependencies:
conda env create -f sam.yml
- Activate the environment:
conda activate sam
- Install the
sam
Python library in editable mode (it will just put the library in $PYTHONPATH):pip install -e .
- Optional, only if you want to perform all-atom reconstruction when using the idpSAM inference script. Install the cg2all package inside the
sam
environment created above:Note: this is the command for performing a CPU-only installation of cg2all. You can also attempt the GPU installation, which involves more steps. If you can't install cg2all with GPU support, the CPU installation is still good for idpSAM applications. This is because for short peptides cg2all is reasonably fast when running on a CPU.pip install git+http://github.com/huhlim/cg2all
If you want to quickly use idpSAM on the cloud (no installations needed on your system), we have a idpSAM Colab notebook.
You can generate a structural ensemble of a custom peptide sequence via the scripts/generate_ensemble.py
inference script. Its usage is:
python scripts/generate_ensemble.py -c config/models.yaml -s MFDNASTRNNKRERGKRQGKQTRTQRHADRSQT -o peptide -n 1000 -a -d cuda
Here is a description of the arguments:
-c
: configuration file for idpSAM. Use the default one provided in theconfig
directory of the repository.-s
: amino acid sequence of the instrinsically disordered peptide that you want to model.-o
: output path. In this example, the command will save a series of output files namedpeptide.*
. These are DCD trajectory files storing the conformations you generated and PDB files that you can use as topologies for parsing DCD files. Files with theca
code store only Cα atoms (the original output of idpSAM), files with theaa
code store all-atoms conformations reconstructed by the cg2all model as a post-processing step.-n
: number of conformations to generate.-a
: flag for using cg2all to reconstruct all-atom details from Cα traces. You must first install the cg2all package to use this option.-d
: PyTorch device for the idpSAM models. If you want to generate ensembles with large number of conformations, we strongly reccommend to use GPU support, via thecuda
value here. By default, the cg2all mode will run on CPU, since it is still fast.
There are also other options that you can tweak. Use the --help
flag to get the full list of them.
You can easily generate a Cα (and optionally all-atom) ensemble for a custom peptide using a Colab notebook on the cloud and download the ensemble on your local system. The output will consists of DCD files, that you can parse with MDTraj for example. If you plan to generate large ensembles (> 1000 conformations), it will probably take hours of time if using a CPU runtime. If possible, use a GPU runtime to accelerate (few minutes of time) idpSAM.
Launch the notebook using the link below:
- 31/12/2023: initial release.
Janson G and Feig M. Transferable deep generative modeling of intrinsically disordered protein conformations. BioRxiv (2024).