[core][runtime_env] Retry on read error in rt env agent client response read #45513

rynewang · 2024-05-23T05:44:24Z

#45353 reports issues on runtime env agent start up process + task pressure. Some times we get on_read errors. This PR fixes by retrying on those requests. Specifically, all HTTP errors before we get a well-formed HTTP response are retried transparently with a ray log.

Also I took this chance to do some quality-of-life improvements to the runtime env agent:

Moved forward the log initialization, so we have "process started" logs and "ready to serve traffic" logs separately.
Allow the http client to set method.
Explicitly set HTTP/1.1.
Uses Beast's way to set content-length.
Sets never expire on each step.
Returns NotFound on resolving and connecting, returns Disconnected on on_write and on_read, returns IOError on user error (HTTP non OK).
Retry on NotFound and Disconnected.

Added a unit test for this situation.

Fixes #45353.

Signed-off-by: Ruiyang Wang <[email protected]>

jjyao · 2024-05-25T04:24:33Z

A high level question: is it guaranteed to be safe to retry upon on_read errors?

hongchaodeng

LGTM after nit

src/ray/raylet/runtime_env_agent_client.cc

Co-authored-by: Hongchao Deng <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>

hongchaodeng · 2024-06-03T20:44:16Z

I have reproduced the issue and verified that it is safe to retry when on_read failed.

Here's the log from runtime env agent:

2024-06-03 20:33:53,759 INFO runtime_env_agent.py:380 -- Creating runtime env: {"_ray_commit": "{{RAY_COMMIT_SHA}}", "pip": {"packages": ["emoji"], "pip_check": false}} with timeout 600 seconds.
2024-06-03 20:33:57,751 INFO runtime_env_agent.py:428 -- Successfully created runtime env: {"_ray_commit": "{{RAY_COMMIT_SHA}}", "pip": {"packages": ["emoji"], "pip_check": false}}, the context: {"command_prefix": ["source", "/tmp/ray/session_2024-06-03_20-25-13_549115_257179/runtime_resources/pip/69cd483986f3d2762be1e312f9cd49dac02e9f97/virtualenv/bin/activate", "1>&2", "&&"], "env_vars": {}, "py_executable": "/tmp/ray/session_2024-06-03_20-25-13_549115_257179/runtime_resources/pip/69cd483986f3d2762be1e312f9cd49dac02e9f97/virtualenv/bin/python", "resources_dir": null, "override_worker_entrypoint": null, "java_jars": []}
2024-06-03 20:33:57,751 INFO runtime_env_agent.py:464 -- Runtime env already created successfully. Env: {"_ray_commit": "{{RAY_COMMIT_SHA}}", "pip": {"packages": ["emoji"], "pip_check": false}}, context: {"command_prefix": ["source", "/tmp/ray/session_2024-06-03_20-25-13_549115_257179/runtime_resources/pip/69cd483986f3d2762be1e312f9cd49dac02e9f97/virtualenv/bin/activate", "1>&2", "&&"], "env_vars": {}, "py_executable": "/tmp/ray/session_2024-06-03_20-25-13_549115_257179/runtime_resources/pip/69cd483986f3d2762be1e312f9cd49dac02e9f97/virtualenv/bin/python", "resources_dir": null, "override_worker_entrypoint": null, "java_jars": []}
2024-06-03 20:33:58,753 INFO runtime_env_agent.py:464 -- Runtime env already created successfully. Env: {"_ray_commit": "{{RAY_COMMIT_SHA}}", "pip": {"packages": ["emoji"], "pip_check": false}}, context: {"command_prefix": ["source", "/tmp/ray/session_2024-06-03_20-25-13_549115_257179/runtime_resources/pip/69cd483986f3d2762be1e312f9cd49dac02e9f97/virtualenv/bin/activate", "1>&2", "&&"], "env_vars": {}, "py_executable": "/tmp/ray/session_2024-06-03_20-25-13_549115_257179/runtime_resources/pip/69cd483986f3d2762be1e312f9cd49dac02e9f97/virtualenv/bin/python", "resources_dir": null, "override_worker_entrypoint": null, "java_jars": []}

It will reply STATUS_OK with correct context.

jjyao · 2024-06-03T20:49:17Z

python/ray/tests/test_runtime_env.py

+    agent is still starting up, we submit a lot of tasks to the cluster. The tasks
+    should wait for the runtime env agent to start up and then run.
+    https://github.com/ray-project/ray/issues/45353


This is not accurate? since we won't wait but just retry

we wait every 1s before retry

Actually waits for agent_manager_retry_interval_ms_ = 100ms, until total = agent_register_timeout_ms = 10000ms = 10s and hard error out.

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang · 2024-06-03T22:34:23Z

Test timeout fixed - it turns out one previous test set some timeout-ing os env var and did not set back. Changed all such env vars to monkeypatch.

…se read (ray-project#45513) Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Richard Liu <[email protected]>

best effort updates for our http client

fecd202

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang assigned jjyao May 23, 2024

reduce number of tasks

8ee1ef2

Signed-off-by: Ruiyang Wang <[email protected]>

Merge branch 'master' into rea-probe

b08fc9b

hongchaodeng self-assigned this May 29, 2024

hongchaodeng approved these changes May 29, 2024

View reviewed changes

src/ray/raylet/runtime_env_agent_client.cc Outdated Show resolved Hide resolved

Update src/ray/raylet/runtime_env_agent_client.cc

c36e366

Co-authored-by: Hongchao Deng <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>

jjyao reviewed Jun 3, 2024

View reviewed changes

jjyao approved these changes Jun 3, 2024

View reviewed changes

rynewang and others added 2 commits June 3, 2024 14:11

Merge branch 'master' into rea-probe

9339a7e

Fix test timeout

0572243

Signed-off-by: Ruiyang Wang <[email protected]>

jjyao enabled auto-merge (squash) June 4, 2024 00:29

github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 4, 2024

jjyao merged commit cd1fc84 into ray-project:master Jun 4, 2024
7 checks passed

rynewang deleted the rea-probe branch June 5, 2024 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][runtime_env] Retry on read error in rt env agent client response read #45513

[core][runtime_env] Retry on read error in rt env agent client response read #45513

rynewang commented May 23, 2024 •

edited

Loading

jjyao commented May 25, 2024

hongchaodeng left a comment

hongchaodeng commented Jun 3, 2024

jjyao Jun 3, 2024

rynewang Jun 3, 2024

rynewang Jun 3, 2024

rynewang commented Jun 3, 2024

[core][runtime_env] Retry on read error in rt env agent client response read #45513

[core][runtime_env] Retry on read error in rt env agent client response read #45513

Conversation

rynewang commented May 23, 2024 • edited Loading

jjyao commented May 25, 2024

hongchaodeng left a comment

Choose a reason for hiding this comment

hongchaodeng commented Jun 3, 2024

jjyao Jun 3, 2024

Choose a reason for hiding this comment

rynewang Jun 3, 2024

Choose a reason for hiding this comment

rynewang Jun 3, 2024

Choose a reason for hiding this comment

rynewang commented Jun 3, 2024

rynewang commented May 23, 2024 •

edited

Loading