Enabling variable resource granularity across instances #3707
Replies: 4 comments 6 replies
-
In a ☕ hour call, @grondo pointed out that we could prototype this by creating an rc1 script for the sub-instance that reads in the R allocated by the parent instance, augments it, and then re-inserts it with all the new resources and finer granularity info. |
Beta Was this translation helpful? Give feedback.
-
From the point of view of Fluxion and its resource model, this can already be done by using hwloc readers. For example, the top level Fluxion instance can use |
Beta Was this translation helpful? Give feedback.
-
Why can't the top level R include all the detail that is known about the system, while the scheduler only schedules nodes (or whatever)? Then R allocated to a job includes all the cores, gpus, caches etc under the allocated nodes, and the scheduler at that level can be configured to schedule at a finer granularity? I guess I don't understand why it helps to have a simpler R at the top level, or where the missing info would be derived at the lower levels (hwloc I guess?) Edit: I missed the coffee discussion so sorry if I'm off base! |
Beta Was this translation helpful? Give feedback.
-
Another possible approach for reducing granularity in a parent scheduler would be to store resources that do not need to be scheduled individually at that level in the aggregate instead of as individual vertices. For instance, on a node-scheduled cluster the top-level resource graph could be composed of nodes with a core count instead of each core being represented as an individual child. This would allow resource requests that call for a count of cores independent of a number of requested nodes, but would presumably reduce complexity of the match (I'm making a big assumption here since I'm not familiar with internals of Fluxion). Then, as above, all subinstances could augment each node with individual cores for more granular scheduling. This approach could work for other resource types (count or "pool" of resources is expanded to individual vertices at a lower level). |
Beta Was this translation helpful? Give feedback.
-
In the original Flux workshop paper, there is a line that states: "A parent scheduler schedules at coarse granularity over a large collection of resources and leases different resource subsets to its children schedulers". Maybe I'm stretching that statement beyond its original intent, but regardless, I think it would be cool/beneficial if higher-level Flux instances could schedule and manage coarser-granularity views of resources, while lower-level Flux instances schedule and manage resources at finer granularities.
As one motivating example, maybe a top-level scheduler see only nodes, the network topology, and cpus/gpus as direct children of the node. Once an allocation is made from the top-level scheduler, additional resource granularity is injected into the allocated resource graph, exposing both new resources and new relationships between resources. For example, it could add L1-L3 caches, socket-level info (and which cores/gpus are on which socket), memory, node-local storage, NIC/network switch resources. This would allow the top-level scheduler to focus solely on optimizing node placement within the network and ensuring the nested instance gets the proper number of cores/gpus (i.e., while reaping the speedups associated with fewer resources in the graph), and it would also allow the sub-instance to make "memory hierarchy aware" pinning decisions and avoid oversubscribing network resources with too many simultaneous parallel launches.
Beta Was this translation helpful? Give feedback.
All reactions