ProSST

Key Features

This repository provides the official implementation of ProSST: A Pre-trained Protein Sequence and Structure Transformer with Disentangled Attention.

The paper introduces several key contributions to protein language modeling:

Integration of Protein Sequences and Structures: The ProSST model integrates both protein sequences and structures using a structure quantization module and a Transformer architecture with disentangled attention, effectively capturing the relationship between protein residues and their structural context.
Structure Quantization Module: This module converts 3D protein structures into discrete tokens by serializing residue-level local structures and embedding them into a dense vector space, which are then quantized using a pre-trained clustering model to serve as effective protein structure representations.
Disentangled Attention Mechanism: ProSST uses a disentangled attention mechanism to explicitly learn the relationships between protein sequence tokens and structure tokens, improving the model’s ability to capture complex features of protein sequences and structures, and leading to state-of-the-art performance in various protein function prediction tasks.

Links

Dataset Links

ProteinGYM Benchmark

Download the dataset from Google Drive.

Get Started

Installation

git clone https://github.com/ginnm/ProSST.git
cd ProSST
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:$(pwd)

Structure quantizer

from prosst.structure.quantizer import PdbQuantizer
processor = PdbQuantizer(structure_vocab_size=2048) # can be 20, 128, 512, 1024, 2048, 4096
result = processor("example_data/p1.pdb", return_residue_seq=False)

Output:

[407, 998, 1841, 1421, 653, 450, 117, 822, ...]

Download Model

ProSST models have been uploaded to huggingface 🤗 Transformers

from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrianed("AI4Protein/ProSST-2048", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("AI4Protein/ProSST-2048", trust_remote_code=True)

See AI4Protein/ProSST-* for more models.

Zero-shot mutant effect prediction

Example notebook

Zero-shot mutant effect prediction

Run ProteinGYM Benchmark

Download dataset from Google Driver. (This file contains quantized structures within ProteinGYM).

cd example_data
unzip proteingym_benchmark.zip

python zero_shot/proteingym_benchmark.py --model_path AI4Protein/ProSST-2048 \
--structure_dir example_data/structure_sequence/2048

🛡️ License

This project is under the GPL-3.0 license. See LICENSE for details.

📝 Citation

If you find this repository useful, please consider citing this paper:

@article {Li2024.04.15.589672,
	author = {Li, Mingchen and Tan, Yang and Ma, Xinzhu and Zhong, Bozitao and Zhou, Ziyi and Yu, Huiqun and Ouyang, Wanli and Hong, Liang and Zhou, Bingxin and Tan, Pan},
	title = {ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention},
	elocation-id = {2024.04.15.589672},
	year = {2024},
	doi = {10.1101/2024.04.15.589672},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2024/05/17/2024.04.15.589672.1},
	eprint = {https://www.biorxiv.org/content/early/2024/05/17/2024.04.15.589672.1.full.pdf},
	journal = {bioRxiv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
example_data		example_data
images		images
prosst/structure		prosst/structure
test		test
zero_shot		zero_shot
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProSST

Key Features

Links

Dataset Links

Get Started

Zero-shot mutant effect prediction

Example notebook

Run ProteinGYM Benchmark

🛡️ License

📝 Citation

About

Releases

Packages

Languages

License

openmedlab/ProSST

Folders and files

Latest commit

History

Repository files navigation

ProSST

Key Features

Links

Dataset Links

Get Started

Zero-shot mutant effect prediction

Example notebook

Run ProteinGYM Benchmark

🛡️ License

📝 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages