-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(docs): add information on how to speed up large pipelines #1599
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the added information here is already present in the doc... but, perhaps not finding it is a good indicator that we need to adjust things. See the discussion/comment inline
# Loading large datasets | ||
|
||
## Running a large pipeline | ||
|
||
To speed up the execution time of large pipelines (such as the Short Variants), add additional workers nodes to the dataproc cluster you create. | ||
|
||
``` | ||
./deployctl dataproc-cluster start variants --num-workers 32 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc is kind of already about running large pipelines, so this feels a little redundant. Adding additional workers is documented in Step 4 -- however, that step instructs using preemptible workers instead of persistent ones. Is there a big difference here in your experience?
This being said, things are still a little confusing; perhaps we could have an intro paragraph here that more clearly lays out selecting the number of dataproc nodes/loading pods, and the relationship between the two (namely, that they should be equal)? Or maybe a tl;dr that overviews the process to make following the doc easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think these are two similar, but different steps of the overall pipeline process that each need more resources to allow large datasets to be processed in a reasonable amount of time.
The added part about including non-preemptible workers speeds up the execution of the pipeline that produces the final hail table (e.g. ./deployctl data-pipeline run --cluster v4p1 gnomad_v4_variants
). Last time I ran the full variants pipeline, secondary workers and preemptible workers both did nothing to speed up execution of this computational portion of the pipeline.
The part documented in step 4 speeds up the loading of the final hail table into Elasticsearch (e.g. ./deployctl elasticsearch load-datasets --dataproc-cluster es gnomad_v4_variants
). This section is certainly better documented than the little bit I added.
I do think it would be nice to have all the information of how to run a computationally expensive pipeline then load the resulting hail table into Elasticsearch in a single doc.
Adds one small paragraph about adding more workers to a dataproc cluster to run the computational portion of the variants pipeline in a shorter timeframe (~2 hours).
I had meant to add this last time I ran the browser pipeline, and found this information in my Slack messages. I figure its preferable to have it in the docs, as they currently only document how to speed up the hail table -> elasticsearch portion, not the generation of the hail table portion.
Edited for clarity