We present TXpredict, a transcriptome prediction tool that generalizes to novel microbial genomes. By leveraging information learned from a large protein language model (ESM2), TXpredict achieves an average Spearman correlation of 0.53 in predicting gene expressions for new bacterial genomes. We further extend this framework to predict transcriptomes for 900 additional microbial genomes spanning 280 genera, a large proportion of which remain uncharacterized at the transcriptional level. Additionally, TXpredict enables the prediction of condition-specific gene expression, providing a powerful tool for understanding microbial adaptation and facilitating rational design of gene regulatory sequences.
Our transcriptome prediciton models are available from Huggingface.
We have provided Colab notebooks for transcriptome prediction in the web browser. Please also check our Colab instruction
- The only required inputs are genome sequence file (.fna or .fasta) and the annotation file (.gtf, .gff or .gff3). Please check our example data
- Please connect to a GPU instance (e.g. T4, Runtime -> Change runtime type -> T4 GPU).
- It takes ~20min to predict transcriptome for a genome with 4k genes.
We deeply appreciate the experimental works and datasets that make our work possible.