Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up to support contribution of selection alerts #81

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

alimanfoo
Copy link
Contributor

@alimanfoo alimanfoo commented Aug 21, 2024

Resolves #80.

  • Upgrade the conda environment - a newer version of malariagen_data is needed to support authenticated access to GCS.
  • Upgrade cohorts analysis.
  • Depend on google-cloud-sdk instead of gsutil - required to get authentication working.
  • Add documentation on how to save/restore a successful workflow build to GCS.
  • Fix snakemake workflow to ensure a new workflow run will reuse outputs restored from GCS.

@alimanfoo
Copy link
Contributor Author

Hi @sanjaynagi,

I've done some maintenance here to get the build working with newer versions of malariagen_data, needed to be able to do authenticated access to GCS.

I've also added some examples to the README on how to save the outputs from a successful build to GCS, and then restore them to a local filesystem (maybe on a different computer).

Basically, this command saves a build to GCS:

gsutil -m rsync -r build/ gs://vo_selection_atlas_dev_us_central1/build/2024-08-21/

...and then this command restores a build from GCS to a local filesystem (maybe on a different computer from the one used to create the build):

rm -r build/*
gsutil -m rsync -r gs://vo_selection_atlas_dev_us_central1/build/2024-08-21/ build/
find build -type f -exec touch {} +

However, if I then run snakemake -c1 on the computer where the build has been restored to, it starts rerunning big parts of the workflow. I.e., it doesn't just build the jupyter book site, which is what I was hoping for.

Initially I thought this would be because of file modification times, because the input code files might have a newer modification time than the build files restored from GCS (e.g., if the build was run and saved to GCS, and then the code files where cloned or checked out to a different computer later - git does not preserve file modification times.)

That's why I added the find ... touch ... command, to manually update all the timestamps on the restored build files.

But even with that, I'm still finding the workflow is rerunning too much. I suspect it's something to do with checkpointing, but I'm not sure.

Perhaps the workflow needs to be broken up? Separate out the book build into a different workflow? That way someone who just wants to author or edit an alert page could do so and be sure to run only the book rebuild?

@sanjaynagi
Copy link
Collaborator

sanjaynagi commented Jan 2, 2025

I've now split the workflow into two separate workflows, one for doing the analysis, and one for building the site. This is actually really simple, we just have two separate snakefiles (Snakefile-analysis, Snakefile-site-build), everything else basically remains the same.

This should mean that we can downloaded a build from the GCS bucket, and run the site build workflow, and it wont want to re-run the workflow. I cant test this though, as the gsutils command says i dont have permissions. I've tested both split-up workflows on Datalab and they work perfectly, just havent tested with explicitly GCS downloaded data.

The argument to run snakemake workflows with a specific snakefile, is simply:

snakemake --snakefile workflow/Snakefile-site-build 

Silly that its took me 4 months to attempt this, because it only took about 20 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set up to support contribution of selection alerts
2 participants