You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Kubernetes environments using Kueue for resource management and KubeStellar for multi-cluster orchestration, there's a need for dynamic resource allocation as new clusters join the environment. Currently, there's no standardized way to assign resources from newly joined clusters to existing Kueue clusterqueues, taking into account different resource flavors and the dynamic nature of cluster resources. Typically a Kueue administrator is tasked with discovering a new cluster and its capacity, and then manually distributing new resources across Kueue's clusterqueus.
In the dynamic environment, cluster capacities are subject to change. New nodes may be added and nodes may be removed due to failure or other reasons. These capacity changes need to be watched and appropriate changes to quota should be implemented.
Proposed Solution
Introduce a new Custom Resource Definition (CRD) called ClusterResourceClaim. This CRD will be created by a k8s controller for each new cluster joining KubeStellar. It will manage the assignment of nominal cluster resources (CPU, memory, GPU). The actual assignment of new resources to specific Kueue clusterqueues will be left to the Kueue administrator via manual edit of the ClusterResourceClaim object. In response to the change k8s controller will make changes to specified clusterqueues.
@pdettori AFAIK Multikueue does not offer similar capability. Its all manual. When a new cluster is added, the Kueue admin updates cluster queues (global quota) in the management cluster.
The admin also must copy all local queues and namespaces to the new cluster as well as all cluster queues (using WEC quota).
Although the proposed feature still requires the admin to manually edit ClusterResourceClaim, the rest will be automated. Meaning, the controllers in the integration will apply new capacity to cluster queues as the admin specified in the ClusterResourceClaim. The controllers will also copy localqueues and create clusterqueues (with WEC quota) in wds1 and those object will be copied to the new WEC.
The validating webhook could also validate admin quota assignments in ClusterResourceClaim to guard against overprovisioning.
Feature Description
Problem Statement
In Kubernetes environments using Kueue for resource management and KubeStellar for multi-cluster orchestration, there's a need for dynamic resource allocation as new clusters join the environment. Currently, there's no standardized way to assign resources from newly joined clusters to existing Kueue clusterqueues, taking into account different resource flavors and the dynamic nature of cluster resources. Typically a Kueue administrator is tasked with discovering a new cluster and its capacity, and then manually distributing new resources across Kueue's clusterqueus.
In the dynamic environment, cluster capacities are subject to change. New nodes may be added and nodes may be removed due to failure or other reasons. These capacity changes need to be watched and appropriate changes to quota should be implemented.
Proposed Solution
Introduce a new Custom Resource Definition (CRD) called ClusterResourceClaim. This CRD will be created by a k8s controller for each new cluster joining KubeStellar. It will manage the assignment of nominal cluster resources (CPU, memory, GPU). The actual assignment of new resources to specific Kueue clusterqueues will be left to the Kueue administrator via manual edit of the ClusterResourceClaim object. In response to the change k8s controller will make changes to specified clusterqueues.
Design Details
API Changes
New CRD: ClusterResourceClaim
Data Model Changes
New ClusterResourceClaim object storing cluster resource information and assignments
Updates to ClusterQueue objects to reflect assigned resources from ClusterResourceClaim
Component Design
Watches for new ClusterResourceClaim objects
Validates resource assignments
Updates corresponding ClusterQueue objects
Watches ClusterMetrics objects for capacity changes
Updates ClusterResourceClaim status
Triggers ClusterQueue reconciliation when necessary
Watches for ClusterResourceClaim changes
Adjusts ClusterQueue resources based on claims and current availability
Technical Approach
Implement new controllers using the Kubernetes operator pattern
Implement k8s Validation webhook to watch for misconfigurations
Implement efficient watch mechanisms to minimize API server load
Performance Considerations
Optimize update frequency to balance responsiveness and API server load
Implement caching mechanisms to reduce API calls
Use informers for efficient object tracking
Security Considerations
Implement RBAC to restrict access to ClusterResourceClaim objects
Validate resource assignments to prevent over-allocation
Ensure secure communication between controllers and API server
Alternatives Considered
Open Questions
How to handle resource conflicts between ClusterResourceClaims?
What's the optimal update frequency for ClusterMetrics sync and to avoid thrashing?
Want to contribute?
Additional Context
No response
The text was updated successfully, but these errors were encountered: