Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get lineage phenotypes #1

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Get lineage phenotypes #1

wants to merge 8 commits into from

Conversation

plsteinberg
Copy link
Collaborator

@plsteinberg plsteinberg commented Sep 20, 2024

Modified run for assigning phenotpes to lineages.

papermill -p config_yaml XBB_config.yaml lineage_phenotypes.ipynb results/lineage_phenotypes.ipynb

Added:

Modified:

  • lineage_phenotypes.ipnyb only outputs the intermediate files and lineage_phenotypes.csv and replaced "clade" to "lineage" throughout notebook and config.
  • Split the spike pseudovirus DMS input data by region (RBD, S2, NDT, other) and saved intermediate file pivot_spike_pseudovirus_DMS_XBB.1.5.csv.
  • XBB_configs.yaml selects for spike RBD (region=RBD) and spike non-RBD (region=S2). Should I keep NDT and other as well?
  • XBB_configs.yaml only uses XBB as reference. Should we eventually create a seperate <variant>_config.yaml for BA2.86 and BA.2? In that case, it would make sense to provide the config in the papermill command rather than in the notebook itself.
  • Overwrote mutation_phenotypes.csv and mutation_phenotypes_randomized.csv while running lineage_phenotypes.ipnyb. Sorry, I am guessing it would be good to keep the original ones too.

Removed from SARS2-spike-predictor-phenos.ipnyb:

Results:

Header of lineage_phenotypes.csv

'lineage', 'date', 'parent', 'spike muts from Wuhan-Hu-1',
     'number spike muts from Wuhan-Hu-1', 'spike muts from XBB.1.5',
     'descendant of XBB',
     'spike pseudovirus DMS RBD human sera escape relative to XBB.1.5',
     'spike pseudovirus DMS RBD ACE2 binding relative to XBB.1.5',
     'spike pseudovirus DMS RBD spike mediated entry relative to XBB.1.5',
     'spike pseudovirus DMS S2 human sera escape relative to XBB.1.5',
     'spike pseudovirus DMS S2 ACE2 binding relative to XBB.1.5',
     'spike pseudovirus DMS S2 spike mediated entry relative to XBB.1.5',
     'RBD yeast-display DMS ACE2 affinity relative to XBB.1.5',
     'RBD yeast-display DMS RBD expression relative to XBB.1.5',
     'RBD yeast-display DMS escape relative to XBB.1.5',
     'EVEscape relative to XBB.1.5', 'Hamming distance relative to XBB.1.5'

Missing from Trevor's list in slack:

  • number of ORF1ab mutations
  • number of accessory protein mutations

Copy link
Member

@jbloom jbloom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plsteinberg, nice to see work on this.

Honestly, the scope of what you are doing here is too much for me to provide detailed code reviews, so I am approving this (so you can merge whatever you want) but also making some high-level comments of things you might want to consider changing before you merge. Also tagging @trvrb in case he has input.

  • The README seems out of date after your changes. (This is fine if this is just an intermediate pull request, but wanted to note it).
  • You can use your judgment how much to keep my code versus just totally replace it. Repo might be clunky if you add without replacing. Don't worry too much about overwriting stuff. We can always go back in git history to get older things.
  • Regarding your question on NTD. Basically, spike has several domains. It can be divided various ways, but typically it is NTD - RBD - [a bit more of S1 sometimes called SD1] - S2. The NTD is probably the second most important part evolutionarily (at least for antibody escape) after the RBD. So you do not want to drop the NTD (or any region really). I think it would be reasonable to divide in either of the following ways: (a) RBD and everything else; (b) NTD, RBD, and everything else. The "everything else" category will mostly include S2, but will also include bits of S1.
  • This is really minor, but a few of your CSVs have a first column that is just the index (empty column with sequential numbers). Write your CSV files using pandas with .to_csv(<file>, index=False) to get rid of this.
  • I think you might eventually want different configs for different variants as best I remember, but I'm not sure.

@plsteinberg
Copy link
Collaborator Author

Thanks for your comments!

Good point, I will update the README with the commands and descriptions of the second notebook once I get other people's input on whether this is what we want/need.

And I just changed the config to also include the NTD, thanks!

@trvrb
Copy link
Member

trvrb commented Dec 23, 2024

@plsteinberg and @jbloom --- Is this now ready to merge? Is looks like it might be?

@jbloom
Copy link
Member

jbloom commented Jan 2, 2025

@plsteinberg is probably better suited than me to answer that question, but I think yes. Although I should not that it looks like Cornelius is no longer maintaining his Pango lineage definition JSON starting as of a few months ago.

@plsteinberg
Copy link
Collaborator Author

plsteinberg commented Jan 10, 2025

sorry! It does not seem like Marlin ended up using the lineage_phenotypes, so I don't know if it should be merged. I am not sure if that is because we need to be using different data (ie not use the yeast RBD data or use updated data in general) or I need to fix my code. If there is a specific ask let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants