-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: improve remote runtime reliability on large-scale evaluation #4869
Conversation
# Wait for pending status | ||
while pod_status in ('Pending', 'Running'): | ||
time.sleep(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while X: sleep()
is definitely not a good practice 😬 we need to bail out eventually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should just make the stop_after_delay
a configurable var, and in eval mode we can set it to something big
@rbren maybe smth like this? |
@@ -89,6 +89,7 @@ def __init__( | |||
) | |||
self.runtime_id: str | None = None | |||
self.runtime_url: str | None = None | |||
self.runtime_init_timeout = self.config.sandbox.remote_runtime_init_timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably don't need to set this as a member var--can just use self.config.sandbox.remote_runtime_init_timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup this looks great!
End-user friendly description of the problem this fixes or functionality that this introduces
Give a summary of what the PR does, explaining any non-trivial design decisions
When running large-scale evaluation, it could take more than 180 seconds for a pod to switch from Pending to Running - and the ceiling of the time taken is actually unknown (e.g., some images are very large that take a while to pull, maybe the cluster is full and we need to create new node to put the pod, etc).
This PR try to add back the while logic for the "Pending" and "Running" states - we no longer throw
RuntimeNotReadyError
for these two states and instead wait patiently for it. Without this, evaluation break pretty easily (eval 300 instances need 5-8 restarts which is a little bit unacceptable).Link of any specific issues this addresses
To run this PR locally, use the following command: