Enabling variable resource granularity across instances #3707

SteVwonder · 2021-06-09T22:31:38Z

SteVwonder
Jun 9, 2021
Maintainer

In the original Flux workshop paper, there is a line that states: "A parent scheduler schedules at coarse granularity over a large collection of resources and leases different resource subsets to its children schedulers". Maybe I'm stretching that statement beyond its original intent, but regardless, I think it would be cool/beneficial if higher-level Flux instances could schedule and manage coarser-granularity views of resources, while lower-level Flux instances schedule and manage resources at finer granularities.

As one motivating example, maybe a top-level scheduler see only nodes, the network topology, and cpus/gpus as direct children of the node. Once an allocation is made from the top-level scheduler, additional resource granularity is injected into the allocated resource graph, exposing both new resources and new relationships between resources. For example, it could add L1-L3 caches, socket-level info (and which cores/gpus are on which socket), memory, node-local storage, NIC/network switch resources. This would allow the top-level scheduler to focus solely on optimizing node placement within the network and ensuring the nested instance gets the proper number of cores/gpus (i.e., while reaping the speedups associated with fewer resources in the graph), and it would also allow the sub-instance to make "memory hierarchy aware" pinning decisions and avoid oversubscribing network resources with too many simultaneous parallel launches.

SteVwonder · 2021-06-09T22:32:31Z

SteVwonder
Jun 9, 2021
Maintainer Author

In a ☕ hour call, @grondo pointed out that we could prototype this by creating an rc1 script for the sub-instance that reads in the R allocated by the parent instance, augments it, and then re-inserts it with all the new resources and finer granularity info.

3 replies

dongahn Jun 9, 2021
Maintainer

This is essentially what Fluxion does -- inserting JGF section into R.

SteVwonder Jun 9, 2021
Maintainer Author

Yeah, I think the idea is to leverage that JGF section: de-serialize it, augment it, and then re-serialize it. And do that within a prototype script called by an RC1 script in the sub-instance. If that works well, then we could take the lessons learned from that and use them to modify Fluxion (e.g., maybe a new augment plugin is added to resource that does a walk on the allocated sub-graph and can modify it before serialization).

dongahn Jun 9, 2021
Maintainer

An approach similar to flux ion-R might be useful to transform an R.

dongahn · 2021-06-09T22:41:17Z

dongahn
Jun 9, 2021
Maintainer

From the point of view of Fluxion and its resource model, this can already be done by using hwloc readers. For example, the top level Fluxion instance can use load-allowlist=node and a nested instance can use load-allowlist=node,socket,core,gpu (or you can even add l1, l2...).

0 replies

garlick · 2021-06-09T23:01:12Z

garlick
Jun 9, 2021
Maintainer

Why can't the top level R include all the detail that is known about the system, while the scheduler only schedules nodes (or whatever)? Then R allocated to a job includes all the cores, gpus, caches etc under the allocated nodes, and the scheduler at that level can be configured to schedule at a finer granularity?

I guess I don't understand why it helps to have a simpler R at the top level, or where the missing info would be derived at the lower levels (hwloc I guess?)

Edit: I missed the coffee discussion so sorry if I'm off base!

3 replies

dongahn Jun 9, 2021
Maintainer

At least at the conceptual level, I think the original thought was including every details at the top level could be a big scalability challenge envisioning for center-wide scheduling. I also missed the coffee hour discussion so I don't know what drove this conversation. Probably multi-tiered storage support...

SteVwonder Jun 10, 2021
Maintainer Author

I think the original thought was including every details at the top level could be a big scalability challenge envisioning for center-wide scheduling

Yeah, exactly. The typically user probably won't care about L1-L3 caches or switch/NIC resources, but there are power users that will. We could certainly include all of those resources at the top-level, but then everyone pays the cost via increased graph traversal costs in Fluxion. If we could dynamically inject those resources into the nested instances, then you get the best of both worlds (scalability at the top level and fine-grained resources in nested instances). All of this is with the caveat that jumping straight to this is a pre-mature optimization, and should only be employed if the scalability at the top-level is demonstrated to be insufficient.

where the missing info would be derived at the lower levels (hwloc I guess?)

Yeah, hwloc could be a source of info for the caches. For the big ATS systems, we could also have custom script that does the insertion of vendor-specific resources.

garlick Jun 10, 2021
Maintainer

Thanks - I had assumed Fluxion was already performant here.

That technique might reduce growth of the content cache too, come to think of it.

grondo · 2021-06-10T15:56:28Z

grondo
Jun 10, 2021
Maintainer

Another possible approach for reducing granularity in a parent scheduler would be to store resources that do not need to be scheduled individually at that level in the aggregate instead of as individual vertices. For instance, on a node-scheduled cluster the top-level resource graph could be composed of nodes with a core count instead of each core being represented as an individual child. This would allow resource requests that call for a count of cores independent of a number of requested nodes, but would presumably reduce complexity of the match (I'm making a big assumption here since I'm not familiar with internals of Fluxion).

Then, as above, all subinstances could augment each node with individual cores for more granular scheduling.

This approach could work for other resource types (count or "pool" of resources is expanded to individual vertices at a lower level).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling variable resource granularity across instances #3707

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Enabling variable resource granularity across instances #3707

SteVwonder Jun 9, 2021 Maintainer

Replies: 4 comments · 6 replies

SteVwonder Jun 9, 2021 Maintainer Author

dongahn Jun 9, 2021 Maintainer

SteVwonder Jun 9, 2021 Maintainer Author

dongahn Jun 9, 2021 Maintainer

dongahn Jun 9, 2021 Maintainer

garlick Jun 9, 2021 Maintainer

dongahn Jun 9, 2021 Maintainer

SteVwonder Jun 10, 2021 Maintainer Author

garlick Jun 10, 2021 Maintainer

grondo Jun 10, 2021 Maintainer

SteVwonder
Jun 9, 2021
Maintainer

Replies: 4 comments 6 replies

SteVwonder
Jun 9, 2021
Maintainer Author

dongahn Jun 9, 2021
Maintainer

SteVwonder Jun 9, 2021
Maintainer Author

dongahn Jun 9, 2021
Maintainer

dongahn
Jun 9, 2021
Maintainer

garlick
Jun 9, 2021
Maintainer

dongahn Jun 9, 2021
Maintainer

SteVwonder Jun 10, 2021
Maintainer Author

garlick Jun 10, 2021
Maintainer

grondo
Jun 10, 2021
Maintainer