Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix][RayService] Use LRU cache for ServeConfigs #2683

Conversation

MortalHappiness
Copy link
Member

@MortalHappiness MortalHappiness commented Dec 23, 2024

Why are these changes needed?

See the description in the corresponding issue for details.

Related issue number

Closes: #2549

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@MortalHappiness MortalHappiness force-pushed the bugfix/#2549-serveconfigs-memory-leak branch from 59fba01 to 1e581e9 Compare December 23, 2024 18:09
@MortalHappiness MortalHappiness marked this pull request as ready for review December 23, 2024 18:33
@kevin85421
Copy link
Member

cc @rueian would you mind reviewing this PR?

@@ -191,6 +191,8 @@ const (

// KubeRayController represents the value of the default job controller
KubeRayController = "ray.io/kuberay-operator"

ServeConfigLRUSize = 1000
Copy link
Contributor

@rueian rueian Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One small question is: Is it ok to re-apply serve configs due to cache evictions when we have more than 1000 active RayService? If there are more than 1000 active RayServices, the checkIfNeedSubmitServeDeployment function will start reporting false true.

Copy link
Member Author

@MortalHappiness MortalHappiness Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be handled using ObservedGeneration. Currently, we are not using ObservedGeneration to determine whether to update the CR. Once we start using it, this issue should be resolved. Perhaps we should create an issue to handle ObservedGeneration.

As a workaround for now, maybe we can set the cache size to a sufficiently large value? Do you think 1000 is enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but I am also afraid that 1000 is already too large, almost equivalent to the "memory leak" in the original issue. If re-applying serve configs is okay, we should probably shrink the number.

Copy link
Member Author

@MortalHappiness MortalHappiness Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fine. ServeConfig is a YAML string, which is at most KB-level in size. Even with 1000 of them, it would only be at the MB level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to re-apply serve configs

re-apply serveConfig is fine.

Ray encourages users to run multiple Serve applications in a single RayCluster. It's hard for me to imagine a user managing more than 100 RayService CRs with a single KubeRay operator.

@kevin85421 kevin85421 merged commit efbd35e into ray-project:master Dec 27, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RayService][Refactor] ServeConfigs memory leak
3 participants