We introduce PlasmidGPT, a generative language model pretrained on 153k engineered plasmid sequences from Addgene (https://www.addgene.org/). PlasmidGPT generates de novo sequences that share similar characteristics with engineered plasmids but show low sequence identity to the training data. We demonstrate its ability to generate plasmids in a controlled manner based on the input sequence or specific design constraint. Moreover, our model learns informative embeddings of both engineered and natural plasmids, allowing for efficient prediction of a wide range of sequence-related attributes.
Python package dependencies:
- torch 2.0.1
- transformers 4.37.2
- pandas 2.2.0
- seaborn 0.13.2
We recommend using Conda to install our packages. For convenience, we have provided a conda environment file with package versions that are compatiable with the current version of the program. The conda environment can be setup with the following comments:
-
Clone this repository:
git clone https://github.com/lingxusb/PlasmidGPT.git cd PlasmidGPT
-
Create and activate the Conda environment:
conda env create -f env.yml conda activate PlasmidGPT
The trained model and tokenizer is availale at huggingface.
pretrained_model.pt
, pretrained PlasmidGPT model, can be accessed hereaddgene_trained_dna_tokenizer.json
, trained BPE tokenizer on Addgene plasmid sequences, can be accessed here
import torch
# load the model
device = 'cpu' # use 'cuda' for GPU
model = torch.load(pt_file_path).to(device)
model.eval()
# start sequence
input_ids = tokenizer.encode(start_sequence, return_tensors='pt').to(device)
# model generation
outputs = model.generate(
input_ids,
max_length=300,
num_return_sequences=1,
temperature=1.0,
do_sample=True,
generation_config=GenerationConfig.from_model_config(model.config)
)
# transform tokens back to DNA ucleotide sequence:
generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
To generate plasmid sequence using the model, please run the following command:
python generate.py --model_dir ../pretrained_model
The ../pretrained_model
folder should contain the model file and the tokenizer.
For a full list of options, please run:
python generate.py -h
Arguments description
argument | description |
---|---|
-h , --help |
show help message and exit |
-m , --model_dir |
path to the directory containing the pretrained model and tokenizer, required |
-s , --start_sequence |
starting DNA sequence for sequence generation |
-f , --fasta_file |
FASTA file containing the starting sequence |
-n , --num_sequences |
number of sequences to generate, default value: 1 |
-l , --max_length |
maximum length of the tokenized generated sequence, default value: 300 |
-t , --temperature |
temperature for sequence generation (controls randomness), default value: 1.0 |
-o , --output |
output file name for the generated sequences, default value: generated_sequence.fasta |
The model output will be stored in the generated_sequence.fasta
file. The script should automatically detect whether to use CUDA (GPU) or CPU based on availability. If you encounter a CUDA-related error when running on a CPU-only machine, the script will handle this by falling back to CPU.
Please also check our jupyter notebook PlasmidGPT_generate.ipynb.
Or, you can easily use our Colab Notebook in the browser. Please make sure to connect to a GPU instance (e.g., T4 GPU). The notebook will automatically download the pretrained model and tokenizer. The plasmid sequence can be generated based on the user's specified start sequence and downloaded in the .fasta
file format.
# calculation of model embeddings
model.config.output_hidden_states = True
# Inference to obtain hidden states
with torch.no_grad():
outputs = model(input_ids)
hidden_states = outputs.hidden_states[-1].cpu().numpy()
hidden_states_mean = np.mean(hidden_states, axis=1).reshape(-1)
embedding.append(hidden_states_mean)
To generate plasmid sequence embeddings, please run the following command:
python embeddings.py [-h] -m MODEL_DIR -f FASTA_FILE [-o OUTPUT_FILE]
Arguments description
argument | description |
---|---|
-h , --help |
show help message and exit |
-m , --model_dir |
path to the directory containing the pretrained model and tokenizer, required |
-f , --fasta_file |
FASTA file containing DNA sequences for the embedding calculation, required |
-o , --output_file |
output file name for saving the embeddings |
The model output will be save in the embeddings.txt
file.
For prediction of attributes, please check our models in the prediction_models
folder.
To predict lab of origin based on input fasta file, please run the following command:
python prediction.py [-h] -m MODEL_DIR -i INPUT_FILE [-e] -nn NN_MODEL -l LAB_LIST [-o OUTPUT_FILE] [-n TOP_N]
The neural network model for lab prediction is provided in ./prediction_models/embedding_prediction_labs.pth
. The lab labels are provided in ./prediction_models/lab_list.txt
.
Arguments description
argument | description |
---|---|
-h , --help |
show help message and exit |
-m , --model_dir |
path to the directory containing the pretrained model and tokenizer, required |
-i , --input_file |
FASTA file or embeddings file as input, required |
-e , --embedding_file |
indicates if the input is an embedding file |
-nn , --nn_model |
path to the neural network model for lab prediction, required |
-l , --lab_list |
path to the file containing the lab labels, required |
-o , --output_file |
output file name for lab predictions |
-n , --top_n |
number of top predictions to output, default value: 10 |
The top predictions will be stored in the file lab_predictions.txt
, where each row corresponds to one input sequence.
We have provided the jupyter notebook PlasmidGPT_predict.ipynb for the prediction of lab of origin.
The Colab Notebook can be easily used in the browser to predict the lab of origin, species, and vector type for the input sequence. The notebook will automatically download all related models and make predictions based on the user's input plasmid sequence. Please use the drop-down list to select the feature to predict, and the top 10 predictions will be displayed.