Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve remote runtime reliability on large-scale evaluation #4869

Merged
merged 7 commits into from
Nov 9, 2024

Conversation

xingyaoww
Copy link
Collaborator

@xingyaoww xingyaoww commented Nov 9, 2024

End-user friendly description of the problem this fixes or functionality that this introduces

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

When running large-scale evaluation, it could take more than 180 seconds for a pod to switch from Pending to Running - and the ceiling of the time taken is actually unknown (e.g., some images are very large that take a while to pull, maybe the cluster is full and we need to create new node to put the pod, etc).

This PR try to add back the while logic for the "Pending" and "Running" states - we no longer throw RuntimeNotReadyError for these two states and instead wait patiently for it. Without this, evaluation break pretty easily (eval 300 instances need 5-8 restarts which is a little bit unacceptable).


Link of any specific issues this addresses


To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:3781da6-nikolaik   --name openhands-app-3781da6   docker.all-hands.dev/all-hands-ai/openhands:3781da6

@xingyaoww xingyaoww requested a review from rbren November 9, 2024 18:58
Comment on lines 290 to 292
# Wait for pending status
while pod_status in ('Pending', 'Running'):
time.sleep(2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while X: sleep() is definitely not a good practice 😬 we need to bail out eventually

Copy link
Collaborator

@rbren rbren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should just make the stop_after_delay a configurable var, and in eval mode we can set it to something big

@xingyaoww xingyaoww requested a review from rbren November 9, 2024 19:34
@xingyaoww
Copy link
Collaborator Author

@rbren maybe smth like this?

@@ -89,6 +89,7 @@ def __init__(
)
self.runtime_id: str | None = None
self.runtime_url: str | None = None
self.runtime_init_timeout = self.config.sandbox.remote_runtime_init_timeout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably don't need to set this as a member var--can just use self.config.sandbox.remote_runtime_init_timeout

Copy link
Collaborator

@rbren rbren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup this looks great!

@xingyaoww xingyaoww marked this pull request as ready for review November 9, 2024 20:13
@xingyaoww xingyaoww enabled auto-merge (squash) November 9, 2024 20:13
@xingyaoww xingyaoww merged commit a07e827 into main Nov 9, 2024
12 checks passed
@xingyaoww xingyaoww deleted the xw/remote-runtime-reliability branch November 9, 2024 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants