Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(docs): add information on how to speed up large pipelines #1599

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 23 additions & 13 deletions deploy/docs/LoadingLargeDatasets.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,19 @@
# Loading large datasets

## Running a large pipeline

To speed up the execution time of large pipelines (such as the Short Variants), add additional workers nodes to the dataproc cluster you create.

```
./deployctl dataproc-cluster start variants --num-workers 32
```
Comment on lines 1 to +9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is kind of already about running large pipelines, so this feels a little redundant. Adding additional workers is documented in Step 4 -- however, that step instructs using preemptible workers instead of persistent ones. Is there a big difference here in your experience?

This being said, things are still a little confusing; perhaps we could have an intro paragraph here that more clearly lays out selecting the number of dataproc nodes/loading pods, and the relationship between the two (namely, that they should be equal)? Or maybe a tl;dr that overviews the process to make following the doc easier?

Copy link
Contributor Author

@rileyhgrant rileyhgrant Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think these are two similar, but different steps of the overall pipeline process that each need more resources to allow large datasets to be processed in a reasonable amount of time.

The added part about including non-preemptible workers speeds up the execution of the pipeline that produces the final hail table (e.g. ./deployctl data-pipeline run --cluster v4p1 gnomad_v4_variants). Last time I ran the full variants pipeline, secondary workers and preemptible workers both did nothing to speed up execution of this computational portion of the pipeline.

The part documented in step 4 speeds up the loading of the final hail table into Elasticsearch (e.g. ./deployctl elasticsearch load-datasets --dataproc-cluster es gnomad_v4_variants). This section is certainly better documented than the little bit I added.

I do think it would be nice to have all the information of how to run a computationally expensive pipeline then load the resulting hail table into Elasticsearch in a single doc.


## Loading large datasets into Elasticsearch

To speed up loading large datasets into Elasticsearch, spin up many temporary pods and spread indexing across them.
Then move ES shards from temporary pods onto permanent pods.

## 1. Set [shard allocation filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-allocation-filtering.html)
### 1. Set [shard allocation filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-allocation-filtering.html)

This configures existing indices so that data stays on permanent data pods and does not migrate to temporary ingest pods. You can issue the following request to the `_all` index to pin existing indices to a particular nodeSet.

Expand All @@ -13,11 +23,11 @@ curl -u "elastic:$ELASTICSEARCH_PASSWORD" "http://localhost:9200/_all/_settings"
EOF
```

## 2. Add nodes to the es-ingest node pool
### 2. Add nodes to the es-ingest node pool

This can be done by adjusting the `pool_num_nodes` variable in our [terraform deployment](https://github.com/broadinstitute/gnomad-terraform/blob/8519ea09e697afc7993b278f1c2b4240ae21c8a4/exac-gnomad/services/browserv4/main.tf#L99) and opening a PR to review and apply the infrastructure change.

## 3. When the new es-ingest GKE nodes are ready, Add temporary pods to Elasticsearch cluster
### 3. When the new es-ingest GKE nodes are ready, Add temporary pods to Elasticsearch cluster

```
./deployctl elasticsearch apply --n-ingest-pods=48
Expand All @@ -27,7 +37,7 @@ The number of ingest pods should match the number of nodes in the `es-ingest` no

Watch pods' readiness with `kubectl get pods -w`.

## 4. Create a Dataproc cluster and load a Hail table.
### 4. Create a Dataproc cluster and load a Hail table.

The number of workers in the cluster should match the number of ingest pods.

Expand All @@ -37,7 +47,7 @@ The number of workers in the cluster should match the number of ingest pods.
./deployctl dataproc-cluster stop es
```

## 5. Determine available space on the persistent data nodes, and how much space you'll need for the new data
### 5. Determine available space on the persistent data nodes, and how much space you'll need for the new data

First, Look at the total size of all indices in Elasticsearch to see how much storage will be required for permanent pods. Add up the values in the `store.size` column output from the [cat indices API](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/cat-indices.html).

Expand All @@ -57,7 +67,7 @@ Depending on how much additional space you need, you can take one of three optio
2. You need slightly more space, so you can increase the size of the disks in the current persistent data nodeSet.
3. You need at least as much space as ~80% of a data pod holds (around ~1.4TB as of this writing), so you add another persistent data node.

### Option 1: No modifications
#### Option 1: No modifications

You don't need to make modifications, so you can simply move your new index to the permanent data pods. Skip to [Move data to persitent data pods](#6-move-data-to-persistent-data-pods)

Expand All @@ -67,7 +77,7 @@ curl -u "elastic:$ELASTICSEARCH_PASSWORD" "http://localhost:9200/_all/_settings"
EOF
```

### Option 2: Increase disk size
#### Option 2: Increase disk size

You can increase the size of the disks on the of the existing data pods. This will do an online resize of the disks. It's a good idea to ensure you have a recent snapshot of the cluster before doing this. See [Elasticsearch snapshots](./ElasticsearchSnapshots.md) for more information.

Expand All @@ -79,7 +89,7 @@ Then apply the changes:
./deployctl elasticsearch apply --n-ingest-pods=48
```

### Option 3: Add another persistent data node
#### Option 3: Add another persistent data node

Edit [elasticsearch.yaml.jinja2](../manifests/elasticsearch/elasticsearch.yaml.jinja2) and add a new pod to the persistent nodeSet by incrementing the `count` parameter in the `data-{green,blue}` nodeSet. Note that when applied, this will cause data movement as Elasticsearch rebalances shards across the persistent nodeSet. This is generally low-impact, but it's a good idea to do this during a low-traffic period.

Expand All @@ -89,7 +99,7 @@ Apply the changes:
./deployctl elasticsearch apply --n-ingest-pods=48
```

## 6. Move data to persistent data pods
### 6. Move data to persistent data pods

Set [shard allocation filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-allocation-filtering.html) on new indices to move shards to the persistent data nodeSet. Do this for any newly loaded indices as well as any pre-existing indices that will be kept. Replace $INDEX with the name of indicies you need to move.

Expand Down Expand Up @@ -117,7 +127,7 @@ curl -u "elastic:$ELASTICSEARCH_PASSWORD" -XPUT "localhost:9200/_cluster/setting

```

## 7. Once data is done being moved, remove the ingest pods
### 7. Once data is done being moved, remove the ingest pods

```
./deployctl elasticsearch apply --n-ingest-pods=0
Expand All @@ -129,11 +139,11 @@ Watch the cluster to ensure that the ingest pods are successfully terminated:
kubectl get pods -w
```

## 8. Once the ingest pods are terminated, resize the es-ingest node pool to 0
### 8. Once the ingest pods are terminated, resize the es-ingest node pool to 0

Set the `pool_num_nodes` varible for the es-ingest node pool to 0 in our [terraform deployment](https://github.com/broadinstitute/gnomad-terraform/blob/8519ea09e697afc7993b278f1c2b4240ae21c8a4/exac-gnomad/services/browserv4/main.tf#L99) and open a PR to review and apply the infrastructure change.

## 9. Clean up, delete any unused indices.
### 9. Clean up, delete any unused indices.

If cleaning up unused indices affords you enough space to remove a persistent data node, you can do so by editing the `count` parameter in the `data-{green/blue}` nodeSet in [elasticsearch.yaml.jinja2](../manifests/elasticsearch/elasticsearch.yaml.jinja2). Note that applying this will cause data movement, and it's a good idea to do this during a low-traffic period.

Expand All @@ -143,7 +153,7 @@ If cleaning up unused indices affords you enough space to remove a persistent da

Lastly, besure to update relevant [Elasticsearch index aliases](./ElasticsearchIndexAliases.md) and [clear caches](./RedisCache.md).

## References
### References

- [Elastic Cloud on K8S](https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-overview.html)
- [Run Elasticsearch on ECK](https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-elasticsearch-specification.html)
Expand Down