[Fix][RayService] Use LRU cache for ServeConfigs #2683

MortalHappiness · 2024-12-23T17:51:43Z

Why are these changes needed?

See the description in the corresponding issue for details.

Related issue number

Closes: #2549

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Closes: ray-project#2549 Signed-off-by: Chi-Sheng Liu <[email protected]>

kevin85421 · 2024-12-25T17:24:05Z

cc @rueian would you mind reviewing this PR?

rueian · 2024-12-26T01:10:47Z

ray-operator/controllers/ray/utils/constant.go

@@ -191,6 +191,8 @@ const (

 	// KubeRayController represents the value of the default job controller
 	KubeRayController = "ray.io/kuberay-operator"
+
+	ServeConfigLRUSize = 1000


LGTM. One small question is: Is it ok to re-apply serve configs due to cache evictions when we have more than 1000 active RayService? If there are more than 1000 active RayServices, the checkIfNeedSubmitServeDeployment function will start reporting false true.

I think this can be handled using ObservedGeneration. Currently, we are not using ObservedGeneration to determine whether to update the CR. Once we start using it, this issue should be resolved. Perhaps we should create an issue to handle ObservedGeneration.

As a workaround for now, maybe we can set the cache size to a sufficiently large value? Do you think 1000 is enough?

I am not sure, but I am also afraid that 1000 is already too large, almost equivalent to the "memory leak" in the original issue. If re-applying serve configs is okay, we should probably shrink the number.

I think it should be fine. ServeConfig is a YAML string, which is at most KB-level in size. Even with 1000 of them, it would only be at the MB level.

Is it ok to re-apply serve configs

re-apply serveConfig is fine.

Ray encourages users to run multiple Serve applications in a single RayCluster. It's hard for me to imagine a user managing more than 100 RayService CRs with a single KubeRay operator.

[Fix][RayService] Use LRU cache for ServeConfigs

1e581e9

Closes: ray-project#2549 Signed-off-by: Chi-Sheng Liu <[email protected]>

MortalHappiness force-pushed the bugfix/#2549-serveconfigs-memory-leak branch from 59fba01 to 1e581e9 Compare December 23, 2024 18:09

MortalHappiness marked this pull request as ready for review December 23, 2024 18:33

MortalHappiness requested a review from kevin85421 December 23, 2024 18:33

MortalHappiness assigned kevin85421 Dec 23, 2024

rueian reviewed Dec 26, 2024

View reviewed changes

rueian approved these changes Dec 26, 2024

View reviewed changes

kevin85421 approved these changes Dec 27, 2024

View reviewed changes

kevin85421 merged commit efbd35e into ray-project:master Dec 27, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix][RayService] Use LRU cache for ServeConfigs #2683

[Fix][RayService] Use LRU cache for ServeConfigs #2683

MortalHappiness commented Dec 23, 2024 •

edited

Loading

kevin85421 commented Dec 25, 2024

rueian Dec 26, 2024 •

edited

Loading

MortalHappiness Dec 26, 2024 •

edited

Loading

rueian Dec 26, 2024

MortalHappiness Dec 26, 2024 •

edited

Loading

kevin85421 Dec 27, 2024

[Fix][RayService] Use LRU cache for ServeConfigs #2683

[Fix][RayService] Use LRU cache for ServeConfigs #2683

Conversation

MortalHappiness commented Dec 23, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

kevin85421 commented Dec 25, 2024

rueian Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

MortalHappiness Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

rueian Dec 26, 2024

Choose a reason for hiding this comment

MortalHappiness Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

kevin85421 Dec 27, 2024

Choose a reason for hiding this comment

MortalHappiness commented Dec 23, 2024 •

edited

Loading

rueian Dec 26, 2024 •

edited

Loading

MortalHappiness Dec 26, 2024 •

edited

Loading

MortalHappiness Dec 26, 2024 •

edited

Loading