Skip to content

WoldringLabMSU/Sequence_Fitness_Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sequence_Fitness_Prediction

Introduction

The goal of this project is to predict protein fitness given its sequence information only and enhance the fitness score using machine learning algorithms and language model techniques. The prediction process will be optimized by utilizing ensemble learning, as well as resolving imbalanced data issues through the use of sampling methods including SMOTE and R-oversampling. For protein sequence representation, four different methods are analyzed : One-Hot Encoding, Physiochemical Encoding, UniRep ( Next-Token Prediction Embedding) and ESM (Masked-Token Prediction Embedding). These analysis are implemented over two distint datasets for affinity and stability prediciton.

The main questions addressed are:

  • How do different representation methods perform in predicting distinct fitness attributes such as stability or affinity?
  • How do sampling methods perform in the imbalanced protein dataset?
  • Is ensemble learning over different protein representations helpful in boosting the performance of discriminative models?

Requirements

  • Python 3.x
  • Numpy
  • Pandas
  • Sklearn
  • modlamp
  • Optuna
  • seaborn
  • Matplotlib
  • Scipy

The 2 datasets used in this study are Affinity Binding and NESP for stability prediction. Please refer to the manuscipt for more detailed information on their attributes ( protein type, size, data imbalance, etc.).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published