feat(docs): add information on how to speed up large pipelines #1599

rileyhgrant · 2024-07-23T17:27:05Z

Adds one small paragraph about adding more workers to a dataproc cluster to run the computational portion of the variants pipeline in a shorter timeframe (~2 hours).

I had meant to add this last time I ran the browser pipeline, and found this information in my Slack messages. I figure its preferable to have it in the docs, as they currently only document how to speed up the hail table -> elasticsearch portion, not the generation of the hail table portion.

Edited for clarity

sjahl

I think the added information here is already present in the doc... but, perhaps not finding it is a good indicator that we need to adjust things. See the discussion/comment inline

sjahl · 2024-07-30T13:31:52Z

deploy/docs/LoadingLargeDatasets.md

 # Loading large datasets

+## Running a large pipeline
+
+To speed up the execution time of large pipelines (such as the Short Variants), add additional workers nodes to the dataproc cluster you create.
+
+```
+./deployctl dataproc-cluster start variants --num-workers 32
+```


This doc is kind of already about running large pipelines, so this feels a little redundant. Adding additional workers is documented in Step 4 -- however, that step instructs using preemptible workers instead of persistent ones. Is there a big difference here in your experience?

This being said, things are still a little confusing; perhaps we could have an intro paragraph here that more clearly lays out selecting the number of dataproc nodes/loading pods, and the relationship between the two (namely, that they should be equal)? Or maybe a tl;dr that overviews the process to make following the doc easier?

Ah, I think these are two similar, but different steps of the overall pipeline process that each need more resources to allow large datasets to be processed in a reasonable amount of time.

The added part about including non-preemptible workers speeds up the execution of the pipeline that produces the final hail table (e.g. ./deployctl data-pipeline run --cluster v4p1 gnomad_v4_variants). Last time I ran the full variants pipeline, secondary workers and preemptible workers both did nothing to speed up execution of this computational portion of the pipeline.

The part documented in step 4 speeds up the loading of the final hail table into Elasticsearch (e.g. ./deployctl elasticsearch load-datasets --dataproc-cluster es gnomad_v4_variants). This section is certainly better documented than the little bit I added.

I do think it would be nice to have all the information of how to run a computationally expensive pipeline then load the resulting hail table into Elasticsearch in a single doc.

feat(docs): add information on how to speed up large pipelines

c0a3e6e

rileyhgrant self-assigned this Jul 23, 2024

rileyhgrant requested review from phildarnowsky-broad and sjahl July 23, 2024 17:28

sjahl reviewed Jul 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(docs): add information on how to speed up large pipelines #1599

feat(docs): add information on how to speed up large pipelines #1599

rileyhgrant commented Jul 23, 2024 •

edited

Loading

sjahl left a comment

sjahl Jul 30, 2024

rileyhgrant Jul 30, 2024 •

edited

Loading

feat(docs): add information on how to speed up large pipelines #1599

Are you sure you want to change the base?

feat(docs): add information on how to speed up large pipelines #1599

Conversation

rileyhgrant commented Jul 23, 2024 • edited Loading

sjahl left a comment

Choose a reason for hiding this comment

sjahl Jul 30, 2024

Choose a reason for hiding this comment

rileyhgrant Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

rileyhgrant commented Jul 23, 2024 •

edited

Loading

rileyhgrant Jul 30, 2024 •

edited

Loading