fix: improve remote runtime reliability on large-scale evaluation #4869

xingyaoww · 2024-11-09T18:58:53Z

End-user friendly description of the problem this fixes or functionality that this introduces

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

When running large-scale evaluation, it could take more than 180 seconds for a pod to switch from Pending to Running - and the ceiling of the time taken is actually unknown (e.g., some images are very large that take a while to pull, maybe the cluster is full and we need to create new node to put the pod, etc).

This PR try to add back the while logic for the "Pending" and "Running" states - we no longer throw RuntimeNotReadyError for these two states and instead wait patiently for it. Without this, evaluation break pretty easily (eval 300 instances need 5-8 restarts which is a little bit unacceptable).

Link of any specific issues this addresses

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:3781da6-nikolaik   --name openhands-app-3781da6   docker.all-hands.dev/all-hands-ai/openhands:3781da6

rbren · 2024-11-09T19:23:03Z

openhands/runtime/impl/remote/remote_runtime.py

+        # Wait for pending status
+        while pod_status in ('Pending', 'Running'):
+            time.sleep(2)


while X: sleep() is definitely not a good practice 😬 we need to bail out eventually

rbren

I think you should just make the stop_after_delay a configurable var, and in eval mode we can set it to something big

xingyaoww · 2024-11-09T19:34:35Z

@rbren maybe smth like this?

rbren · 2024-11-09T19:54:00Z

openhands/runtime/impl/remote/remote_runtime.py

@@ -89,6 +89,7 @@ def __init__(
        )
        self.runtime_id: str | None = None
        self.runtime_url: str | None = None
+        self.runtime_init_timeout = self.config.sandbox.remote_runtime_init_timeout


probably don't need to set this as a member var--can just use self.config.sandbox.remote_runtime_init_timeout

rbren

Yup this looks great!

xingyaoww added 3 commits November 9, 2024 18:55

fix pending status

d7d905f

remote runtime tweak

33c26a2

revert to 180

2e740d1

xingyaoww requested a review from rbren November 9, 2024 18:58

tofarr approved these changes Nov 9, 2024

View reviewed changes

rbren reviewed Nov 9, 2024

View reviewed changes

rbren requested changes Nov 9, 2024

View reviewed changes

xingyaoww added 2 commits November 9, 2024 19:26

revert

7369a0b

add really large timeout for eval specifically

8dec355

xingyaoww requested a review from rbren November 9, 2024 19:34

bump timeout for eval infer as well

c577449

rbren reviewed Nov 9, 2024

View reviewed changes

rbren approved these changes Nov 9, 2024

View reviewed changes

remove extra member var

3781da6

xingyaoww marked this pull request as ready for review November 9, 2024 20:13

xingyaoww enabled auto-merge (squash) November 9, 2024 20:13

xingyaoww merged commit a07e827 into main Nov 9, 2024
12 checks passed

xingyaoww deleted the xw/remote-runtime-reliability branch November 9, 2024 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve remote runtime reliability on large-scale evaluation #4869

fix: improve remote runtime reliability on large-scale evaluation #4869

xingyaoww commented Nov 9, 2024 •

edited by github-actions bot

Loading

rbren Nov 9, 2024

rbren left a comment

xingyaoww commented Nov 9, 2024

rbren Nov 9, 2024

rbren left a comment

fix: improve remote runtime reliability on large-scale evaluation #4869

fix: improve remote runtime reliability on large-scale evaluation #4869

Conversation

xingyaoww commented Nov 9, 2024 • edited by github-actions bot Loading

rbren Nov 9, 2024

Choose a reason for hiding this comment

rbren left a comment

Choose a reason for hiding this comment

xingyaoww commented Nov 9, 2024

rbren Nov 9, 2024

Choose a reason for hiding this comment

rbren left a comment

Choose a reason for hiding this comment

xingyaoww commented Nov 9, 2024 •

edited by github-actions bot

Loading