Skip to content

Code for ProSST: A Pre-trained Protein Sequence and Structure Transformer with Disentangled Attention.

License

Notifications You must be signed in to change notification settings

openmedlab/ProSST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProSST


Key Features

This repository provides the official implementation of ProSST: A Pre-trained Protein Sequence and Structure Transformer with Disentangled Attention.

The paper introduces several key contributions to protein language modeling:

  1. Integration of Protein Sequences and Structures: The ProSST model integrates both protein sequences and structures using a structure quantization module and a Transformer architecture with disentangled attention, effectively capturing the relationship between protein residues and their structural context.

  2. Structure Quantization Module: This module converts 3D protein structures into discrete tokens by serializing residue-level local structures and embedding them into a dense vector space, which are then quantized using a pre-trained clustering model to serve as effective protein structure representations.

  3. Disentangled Attention Mechanism: ProSST uses a disentangled attention mechanism to explicitly learn the relationships between protein sequence tokens and structure tokens, improving the model’s ability to capture complex features of protein sequences and structures, and leading to state-of-the-art performance in various protein function prediction tasks.

Links

Dataset Links

ProteinGYM Benchmark

Download the dataset from Google Drive.

Get Started

Installation

git clone https://github.com/ginnm/ProSST.git
cd ProSST
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:$(pwd)

Structure quantizer

from prosst.structure.quantizer import PdbQuantizer
processor = PdbQuantizer(structure_vocab_size=2048) # can be 20, 128, 512, 1024, 2048, 4096
result = processor("example_data/p1.pdb", return_residue_seq=False)

Output:

[407, 998, 1841, 1421, 653, 450, 117, 822, ...]

Download Model

ProSST models have been uploaded to huggingface 🤗 Transformers

from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrianed("AI4Protein/ProSST-2048", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("AI4Protein/ProSST-2048", trust_remote_code=True)

See AI4Protein/ProSST-* for more models.

Zero-shot mutant effect prediction

Example notebook

Zero-shot mutant effect prediction

Run ProteinGYM Benchmark

Download dataset from Google Driver. (This file contains quantized structures within ProteinGYM).

cd example_data
unzip proteingym_benchmark.zip
python zero_shot/proteingym_benchmark.py --model_path AI4Protein/ProSST-2048 \
--structure_dir example_data/structure_sequence/2048

🛡️ License

This project is under the GPL-3.0 license. See LICENSE for details.

📝 Citation

If you find this repository useful, please consider citing this paper:

@article {Li2024.04.15.589672,
	author = {Li, Mingchen and Tan, Yang and Ma, Xinzhu and Zhong, Bozitao and Zhou, Ziyi and Yu, Huiqun and Ouyang, Wanli and Hong, Liang and Zhou, Bingxin and Tan, Pan},
	title = {ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention},
	elocation-id = {2024.04.15.589672},
	year = {2024},
	doi = {10.1101/2024.04.15.589672},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2024/05/17/2024.04.15.589672.1},
	eprint = {https://www.biorxiv.org/content/early/2024/05/17/2024.04.15.589672.1.full.pdf},
	journal = {bioRxiv}
}

About

Code for ProSST: A Pre-trained Protein Sequence and Structure Transformer with Disentangled Attention.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published