-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix][RayService] Use LRU cache for ServeConfigs #2683
[Fix][RayService] Use LRU cache for ServeConfigs #2683
Conversation
Closes: ray-project#2549 Signed-off-by: Chi-Sheng Liu <[email protected]>
59fba01
to
1e581e9
Compare
cc @rueian would you mind reviewing this PR? |
@@ -191,6 +191,8 @@ const ( | |||
|
|||
// KubeRayController represents the value of the default job controller | |||
KubeRayController = "ray.io/kuberay-operator" | |||
|
|||
ServeConfigLRUSize = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One small question is: Is it ok to re-apply serve configs due to cache evictions when we have more than 1000 active RayService? If there are more than 1000 active RayServices, the checkIfNeedSubmitServeDeployment
function will start reporting false true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be handled using ObservedGeneration
. Currently, we are not using ObservedGeneration
to determine whether to update the CR. Once we start using it, this issue should be resolved. Perhaps we should create an issue to handle ObservedGeneration
.
As a workaround for now, maybe we can set the cache size to a sufficiently large value? Do you think 1000 is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but I am also afraid that 1000 is already too large, almost equivalent to the "memory leak" in the original issue. If re-applying serve configs is okay, we should probably shrink the number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be fine. ServeConfig is a YAML string, which is at most KB-level in size. Even with 1000 of them, it would only be at the MB level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to re-apply serve configs
re-apply serveConfig is fine.
Ray encourages users to run multiple Serve applications in a single RayCluster. It's hard for me to imagine a user managing more than 100 RayService CRs with a single KubeRay operator.
Why are these changes needed?
See the description in the corresponding issue for details.
Related issue number
Closes: #2549
Checks