Integrating Protein Language Models and Automatic Biofoundry for Enhanced Protein Evolution

The official implementation of the paper: Integrating Protein Language Models and Automatic Biofoundry for Enhanced Protein Evolution

Requirements

Python 3.7 or higher
PyTorch
NumPy
copy
random
pickle
itertools
heapq
POT

Usage

Module 1

We aim to analyze the impact of single-point mutations on a given protein sequence. The specific goals are:

Mutation Library Generation: Create a library of mutated sequences by introducing single-point mutations at each position in the original sequence. Given an original sequence with N amino acids and 20 possible mutations at each position, this results in a library of 20×N unique sequences.
Likelihood Calculation: Utilize the Evolutionarily Scaled Model (ESM) to calculate the likelihood of each mutated sequence. The likelihood scores serve as a proxy for evaluating the potential functional stability or desirability of each sequence.
Top Sequence Selection: Based on the calculated likelihoods, rank all mutated sequences and select the top 96 sequences with the highest likelihood scores.

Run the script from the script folder using: python module_1.py

Module 2

Mask Your Mutation Sites: In the module_2.py file, modify line 14. Replace "GB1" with your protein of interest, and use the '' token to substitute the mutation sites, as shown below:

data = [("GB1","MQYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNG<mask><mask><mask>EWTYDDATKTFT<mask>TE")]

Run the script from the script folder using:

python module_2.py

This will generate the file select_96.json.

Finetuning

set the parameters in 'scripts/run_fitness.sh'
set the path in 'tasks/fitness.py': 'path_to_train_data.csv' and 'path_to_test_data.csv'
Run the script from the script folder using:

sh run_fitness.sh

Protein Mutation Analysis

UBC9.py and RPL40A.py are designed for analyzing protein mutation data. Each script provides functions to process protein sequences, generate mutations, and select optimal sequences based on likelihood scores using a pre-trained ESM model. The scripts focus on two different proteins, UBC9 and RPL40A, and are designed to facilitate mutation analysis and data preparation for further study or model training.

Run the script:

python ./Zero-shot/UBC9.py or./Zero-shot/RPL40A.py

Outputs:
- A CSV file containing the top 96 mutations, labels, mutation positions, and amino acids.
- A CSV file with 96 randomly generated mutations for comparative analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Finetuning		Finetuning
Zero-shot		Zero-shot
.DS_Store		.DS_Store
FilteredComboToInd.pkl		FilteredComboToInd.pkl
Fitness.npy		Fitness.npy
README.md		README.md
module_1.py		module_1.py
module_2.py		module_2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Integrating Protein Language Models and Automatic Biofoundry for Enhanced Protein Evolution

Requirements

Usage

Module 1

Module 2

Finetuning

Protein Mutation Analysis

Contact: [email protected]

About

Releases 1

Packages

Languages

HICAI-ZJU/PLMeAE

Folders and files

Latest commit

History

Repository files navigation

Integrating Protein Language Models and Automatic Biofoundry for Enhanced Protein Evolution

Requirements

Usage

Module 1

Module 2

Finetuning

Protein Mutation Analysis

Contact: [email protected]

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages