-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicit Comms Object Not Cleared After Cluster State Change or Restart #1450
Comments
Thanks for the excellent reproducer. It seems like we need the cached diff --git a/dask_cuda/explicit_comms/comms.py b/dask_cuda/explicit_comms/comms.py
index 0fe5422..4ff5f76 100644
--- a/dask_cuda/explicit_comms/comms.py
+++ b/dask_cuda/explicit_comms/comms.py
@@ -1,4 +1,5 @@
import asyncio
+import weakref
import concurrent.futures
import contextlib
import time
@@ -9,7 +10,8 @@ import distributed.comm
from distributed import Client, Worker, default_client, get_worker
from distributed.comm.addressing import parse_address, parse_host_port, unparse_address
-_default_comms = None
+# Mapping client ID to CommsContext
+_comms_cache: weakref.WeakValueDictionary[str, "CommsContext"] = weakref.WeakValueDictionary()
def get_multi_lock_or_null_context(multi_lock_context, *args, **kwargs):
@@ -53,10 +55,14 @@ def default_comms(client: Optional[Client] = None) -> "CommsContext":
comms: CommsContext
The default comms object
"""
- global _default_comms
- if _default_comms is None:
- _default_comms = CommsContext(client=client)
- return _default_comms
+ # Comms are unique to a client, so we need to know.
+ client = client or default_client()
+ maybe_comms = _comms_cache.get(client.id)
+ if maybe_comms is None:
+ maybe_comms = CommsContext(client=client)
+ _comms_cache[client.id] = maybe_comms
+
+ return maybe_comms That passes a test ensuring that the defualt_comms match the worker addresses for a newly created cluster.
I'll need to better understand the "workers are updated" case. Normally I would recommend a scheduler plugin, but this seems to live on the client. |
Thanks for looking into this @TomAugspurger ! I was thinking of a very similar fix. My one qualm with your current suggestion is that we may want to create a token for the specific worker addresses and from dask.tokenize import tokenize
cache_key = tokenize(client.id, client.scheduler_info()["workers"].keys()) I realize it may be an "edge case", but it's possible that the user could add or loose workers between shuffle operations, and we want to make sure the comms context is updated if the worker addresses are different for the same client. Does that make sense? |
Honestly, in Curator, we lose workers a decent number of times at scale, If we use explicit comms shuffle twice and lose workers in between, that could be problematic. We should ensure this case is captured as well. |
Having never used explicit comms before, what's the expected behavior when the cluster changes (workers added or removed / die) after the comms have been established? I see that |
The original explicit-comms design definitely assumed the cluster was static, and made no attempt to handle worker-failures. I don't think we should try to handle the case that workers are lost during a shuffle. However, I do think it's reasonable to handle the case that workers are lost in between shuffles (unless this support proves unrealistic).
Right - I haven't exactly vetted this idea yet, so it may not be realistic. If that turns out to be the case, it may still make sense to keep track of the worker addresses and warn the user that the workers have changed. |
All that makes sense. Within-shuffle failures seem extremely hard to reason about. Making sure that |
Awesome - Thanks @TomAugspurger ! @VibhuJawa - For now, you may be able to add the following code before you spin up a new dask-cuda cluster to manually clear the cache: import dask_cuda.explicit_comms.comms
dask_cuda.explicit_comms.comms._default_comms = None |
This PR updates the CommContext caching to be keyed by some information about the cluster, rather than a single global. This prevents us from using a stale comms object after the cluster changes (add or remove workers) or is recreated entirely. Closes rapidsai#1450
This PR updates the CommContext caching to be keyed by some information about the cluster, rather than a single global. This prevents us from using a stale comms object after the cluster changes (add or remove workers) or is recreated entirely. Closes rapidsai#1450
This PR updates the CommContext caching to be keyed by some information about the cluster, rather than a single global. This prevents us from using a stale comms object after the cluster changes (add or remove workers) or is recreated entirely. Closes rapidsai#1450
#1451 should take care of this. |
This PR updates the CommContext caching to be keyed by some information about the cluster, rather than a single global. This prevents us from using a stale comms object after the cluster changes (add or remove workers) or is recreated entirely. Closes #1450 Authors: - Tom Augspurger (https://github.com/TomAugspurger) Approvers: - Richard (Rick) Zamora (https://github.com/rjzamora) URL: #1451
Description
The explicit_comms module in dask_cuda does not properly reset or clear up after a cluster state change or when a new cluster is started. This results in default_comms().worker_addresses retaining references to old worker addresses, even after the original cluster has been cleaned up. The communication object does not respect the lifetime of the worker or cluster objects.
Both addresses align
Addressees do not align anymore
Observed Behavior
When the first cluster is cleaned up,
comms.default_comms().worker_addresses
still references the old worker.After starting a new cluster,
default_comms().worker_addresses
continues to reference the worker from the previous cluster rather than updating to the new worker.Expected Behavior
comms.default_comms().worker_addresses should be cleared or reset when the cluster is shut down or workers are updated.
When a new cluster is started, default_comms().worker_addresses should reflect the new worker(s) only.
Additional Context:
This issue led to multiple unexpected CI failures in NeMo Curator (PR #540), which took significant effort to diagnose and debug.
CC: @ayushdg , @sarahyurick who did a lot of work in triaging this issue.
The text was updated successfully, but these errors were encountered: