High Similarity in Predictions Across Perturbation Conditions #277

ellieujin · 2024-12-31T13:55:58Z

Hello, and thank you for this amazing tool!

I recently fine-tuned scGPT using perturb-seq data to tackle a perturbation prediction task. Specifically, I aimed to predict gene expression levels for each perturbation condition where a single gene was perturbed at a time.

Here are the key details:

Fine-tuning process: I fine-tuned scGPT using perturb-seq data with standard hyperparameters and followed the recommended pipeline.
Prediction results: After fine-tuning, I generated predictions for all perturbation conditions. However, I observed that the Pearson correlation R2 values between the predicted gene expression profiles across different perturbations are consistently around 0.99, suggesting highly similar predictions regardless of the perturbation.

This high similarity in predictions was unexpected, as I anticipated more variation in the predicted expression profiles for different perturbations.

My questions:

Have others encountered similar results when using scGPT for perturb-seq data or similar tasks?
What could be the possible reasons for this behavior? Could it be related to:
- Model architecture or loss function configuration?
- Insufficient fine-tuning or suboptimal hyperparameters?
- Data preprocessing or the inherent nature of perturb-seq data?
What strategies would you recommend to improve the model's sensitivity to different perturbations and generate more distinct predictions?

Thank you for your time and support! I'm looking forward to any insights or suggestions on how to address this issue.

jumbokun · 2025-01-14T08:27:59Z

Hi @ellieujin!
I am currently stuck in finding the so-called condition tokens. I checked the input to the model step by step but I found that seems only the binned gene values are gathered, not like the paper claimed: emb = gene_id + gene_value + contional_token.
I wonder how you changed your Pertuabation Conditions? Was that part of the so-called condition tokens?

Thanks a lot in advance for your help!

ellieujin · 2025-01-15T06:29:07Z

Hi @jumbokun,
I guess the TransformerGenerator in scgpt/model/generation_model.py on GitHub handles the gene_id + gene_value + condition_token embedding, especially in the _encode part.

When setting the perturbation conditions, I simply used the condition column from adata.obs before processing it with PertData.

Hope that clears things up!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Similarity in Predictions Across Perturbation Conditions #277

High Similarity in Predictions Across Perturbation Conditions #277

ellieujin commented Dec 31, 2024

jumbokun commented Jan 14, 2025

ellieujin commented Jan 15, 2025 •

edited

Loading

High Similarity in Predictions Across Perturbation Conditions #277

High Similarity in Predictions Across Perturbation Conditions #277

Comments

ellieujin commented Dec 31, 2024

My questions:

jumbokun commented Jan 14, 2025

ellieujin commented Jan 15, 2025 • edited Loading

ellieujin commented Jan 15, 2025 •

edited

Loading