Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][runtime_env] Retry on read error in rt env agent client response read #45513

Merged
merged 6 commits into from
Jun 4, 2024

Conversation

rynewang
Copy link
Contributor

@rynewang rynewang commented May 23, 2024

#45353 reports issues on runtime env agent start up process + task pressure. Some times we get on_read errors. This PR fixes by retrying on those requests. Specifically, all HTTP errors before we get a well-formed HTTP response are retried transparently with a ray log.

Also I took this chance to do some quality-of-life improvements to the runtime env agent:

  • Moved forward the log initialization, so we have "process started" logs and "ready to serve traffic" logs separately.
  • Allow the http client to set method.
  • Explicitly set HTTP/1.1.
  • Uses Beast's way to set content-length.
  • Sets never expire on each step.
  • Returns NotFound on resolving and connecting, returns Disconnected on on_write and on_read, returns IOError on user error (HTTP non OK).
  • Retry on NotFound and Disconnected.

Added a unit test for this situation.

Fixes #45353.

Signed-off-by: Ruiyang Wang <[email protected]>
@jjyao
Copy link
Collaborator

jjyao commented May 25, 2024

A high level question: is it guaranteed to be safe to retry upon on_read errors?

@hongchaodeng hongchaodeng self-assigned this May 29, 2024
Copy link
Member

@hongchaodeng hongchaodeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after nit

src/ray/raylet/runtime_env_agent_client.cc Outdated Show resolved Hide resolved
Co-authored-by: Hongchao Deng <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
@hongchaodeng
Copy link
Member

I have reproduced the issue and verified that it is safe to retry when on_read failed.

Here's the log from runtime env agent:

2024-06-03 20:33:53,759 INFO runtime_env_agent.py:380 -- Creating runtime env: {"_ray_commit": "{{RAY_COMMIT_SHA}}", "pip": {"packages": ["emoji"], "pip_check": false}} with timeout 600 seconds.
2024-06-03 20:33:57,751 INFO runtime_env_agent.py:428 -- Successfully created runtime env: {"_ray_commit": "{{RAY_COMMIT_SHA}}", "pip": {"packages": ["emoji"], "pip_check": false}}, the context: {"command_prefix": ["source", "/tmp/ray/session_2024-06-03_20-25-13_549115_257179/runtime_resources/pip/69cd483986f3d2762be1e312f9cd49dac02e9f97/virtualenv/bin/activate", "1>&2", "&&"], "env_vars": {}, "py_executable": "/tmp/ray/session_2024-06-03_20-25-13_549115_257179/runtime_resources/pip/69cd483986f3d2762be1e312f9cd49dac02e9f97/virtualenv/bin/python", "resources_dir": null, "override_worker_entrypoint": null, "java_jars": []}
2024-06-03 20:33:57,751 INFO runtime_env_agent.py:464 -- Runtime env already created successfully. Env: {"_ray_commit": "{{RAY_COMMIT_SHA}}", "pip": {"packages": ["emoji"], "pip_check": false}}, context: {"command_prefix": ["source", "/tmp/ray/session_2024-06-03_20-25-13_549115_257179/runtime_resources/pip/69cd483986f3d2762be1e312f9cd49dac02e9f97/virtualenv/bin/activate", "1>&2", "&&"], "env_vars": {}, "py_executable": "/tmp/ray/session_2024-06-03_20-25-13_549115_257179/runtime_resources/pip/69cd483986f3d2762be1e312f9cd49dac02e9f97/virtualenv/bin/python", "resources_dir": null, "override_worker_entrypoint": null, "java_jars": []}
2024-06-03 20:33:58,753 INFO runtime_env_agent.py:464 -- Runtime env already created successfully. Env: {"_ray_commit": "{{RAY_COMMIT_SHA}}", "pip": {"packages": ["emoji"], "pip_check": false}}, context: {"command_prefix": ["source", "/tmp/ray/session_2024-06-03_20-25-13_549115_257179/runtime_resources/pip/69cd483986f3d2762be1e312f9cd49dac02e9f97/virtualenv/bin/activate", "1>&2", "&&"], "env_vars": {}, "py_executable": "/tmp/ray/session_2024-06-03_20-25-13_549115_257179/runtime_resources/pip/69cd483986f3d2762be1e312f9cd49dac02e9f97/virtualenv/bin/python", "resources_dir": null, "override_worker_entrypoint": null, "java_jars": []}

It will reply STATUS_OK with correct context.

Comment on lines +424 to +426
agent is still starting up, we submit a lot of tasks to the cluster. The tasks
should wait for the runtime env agent to start up and then run.
https://github.com/ray-project/ray/issues/45353
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate? since we won't wait but just retry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we wait every 1s before retry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually waits for agent_manager_retry_interval_ms_ = 100ms, until total = agent_register_timeout_ms = 10000ms = 10s and hard error out.

@rynewang
Copy link
Contributor Author

rynewang commented Jun 3, 2024

Test timeout fixed - it turns out one previous test set some timeout-ing os env var and did not set back. Changed all such env vars to monkeypatch.

@jjyao jjyao enabled auto-merge (squash) June 4, 2024 00:29
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 4, 2024
@jjyao jjyao merged commit cd1fc84 into ray-project:master Jun 4, 2024
7 checks passed
@rynewang rynewang deleted the rea-probe branch June 5, 2024 19:09
richardsliu pushed a commit to richardsliu/ray that referenced this pull request Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Raylet should wait for RuntimeEnvAgent to start before receiving tasks.
3 participants