-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][runtime_env] Retry on read error in rt env agent client response read #45513
Conversation
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
A high level question: is it guaranteed to be safe to retry upon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after nit
Co-authored-by: Hongchao Deng <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>
I have reproduced the issue and verified that it is safe to retry when Here's the log from runtime env agent:
It will reply STATUS_OK with correct context. |
agent is still starting up, we submit a lot of tasks to the cluster. The tasks | ||
should wait for the runtime env agent to start up and then run. | ||
https://github.com/ray-project/ray/issues/45353 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not accurate? since we won't wait but just retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we wait every 1s before retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually waits for agent_manager_retry_interval_ms_ = 100ms, until total = agent_register_timeout_ms = 10000ms = 10s and hard error out.
Signed-off-by: Ruiyang Wang <[email protected]>
Test timeout fixed - it turns out one previous test set some timeout-ing os env var and did not set back. Changed all such env vars to monkeypatch. |
…se read (ray-project#45513) Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Richard Liu <[email protected]>
#45353 reports issues on runtime env agent start up process + task pressure. Some times we get
on_read
errors. This PR fixes by retrying on those requests. Specifically, all HTTP errors before we get a well-formed HTTP response are retried transparently with a ray log.Also I took this chance to do some quality-of-life improvements to the runtime env agent:
Added a unit test for this situation.
Fixes #45353.