Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Similarity in Predictions Across Perturbation Conditions #277

Open
ellieujin opened this issue Dec 31, 2024 · 2 comments
Open

High Similarity in Predictions Across Perturbation Conditions #277

ellieujin opened this issue Dec 31, 2024 · 2 comments

Comments

@ellieujin
Copy link

Hello, and thank you for this amazing tool!

I recently fine-tuned scGPT using perturb-seq data to tackle a perturbation prediction task. Specifically, I aimed to predict gene expression levels for each perturbation condition where a single gene was perturbed at a time.

Here are the key details:

  1. Fine-tuning process: I fine-tuned scGPT using perturb-seq data with standard hyperparameters and followed the recommended pipeline.
  2. Prediction results: After fine-tuning, I generated predictions for all perturbation conditions. However, I observed that the Pearson correlation R2 values between the predicted gene expression profiles across different perturbations are consistently around 0.99, suggesting highly similar predictions regardless of the perturbation.

This high similarity in predictions was unexpected, as I anticipated more variation in the predicted expression profiles for different perturbations.

My questions:

  1. Have others encountered similar results when using scGPT for perturb-seq data or similar tasks?
  2. What could be the possible reasons for this behavior? Could it be related to:
    • Model architecture or loss function configuration?
    • Insufficient fine-tuning or suboptimal hyperparameters?
    • Data preprocessing or the inherent nature of perturb-seq data?
  3. What strategies would you recommend to improve the model's sensitivity to different perturbations and generate more distinct predictions?

Thank you for your time and support! I'm looking forward to any insights or suggestions on how to address this issue.

@jumbokun
Copy link

Hi @ellieujin!
I am currently stuck in finding the so-called condition tokens. I checked the input to the model step by step but I found that seems only the binned gene values are gathered, not like the paper claimed: emb = gene_id + gene_value + contional_token.
I wonder how you changed your Pertuabation Conditions? Was that part of the so-called condition tokens?

Thanks a lot in advance for your help!

@ellieujin
Copy link
Author

ellieujin commented Jan 15, 2025

Hi @jumbokun,
I guess the TransformerGenerator in scgpt/model/generation_model.py on GitHub handles the gene_id + gene_value + condition_token embedding, especially in the _encode part.

When setting the perturbation conditions, I simply used the condition column from adata.obs before processing it with PertData.

Hope that clears things up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants