Sequence_Fitness_Prediction

Introduction

The goal of this project is to predict protein fitness given its sequence information only and enhance the fitness score using machine learning algorithms and language model techniques. The prediction process will be optimized by utilizing ensemble learning, as well as resolving imbalanced data issues through the use of sampling methods including SMOTE and R-oversampling. For protein sequence representation, four different methods are analyzed : One-Hot Encoding, Physiochemical Encoding, UniRep ( Next-Token Prediction Embedding) and ESM (Masked-Token Prediction Embedding). These analysis are implemented over two distint datasets for affinity and stability prediciton.

The main questions addressed are:

How do different representation methods perform in predicting distinct fitness attributes such as stability or affinity?
How do sampling methods perform in the imbalanced protein dataset?
Is ensemble learning over different protein representations helpful in boosting the performance of discriminative models?

Requirements

Python 3.x
Numpy
Pandas
Sklearn
modlamp
Optuna
seaborn
Matplotlib
Scipy

The 2 datasets used in this study are Affinity Binding and NESP for stability prediction. Please refer to the manuscipt for more detailed information on their attributes ( protein type, size, data imbalance, etc.).

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
Affibody		Affibody
Get_Embedding		Get_Embedding
NESP		NESP
MCDA.py		MCDA.py
README.md		README.md
Seq.py		Seq.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sequence_Fitness_Prediction

Introduction

Requirements

About

Releases

Packages

Languages

WoldringLabMSU/Sequence_Fitness_Prediction

Folders and files

Latest commit

History

Repository files navigation

Sequence_Fitness_Prediction

Introduction

Requirements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages