Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory-constrained application scheduling #550

Open
SteVwonder opened this issue Dec 19, 2019 · 8 comments
Open

memory-constrained application scheduling #550

SteVwonder opened this issue Dec 19, 2019 · 8 comments

Comments

@SteVwonder
Copy link
Member

@jameshcorbett mentioned that they are running ensembles on memory-constrained applications (i.e., the memory capacity of the node is the limiting resource not the number of cores).

Supporting this use-case in flux will require a few modifications:

  • Need a way to produce a jobspec with memory as a child of node and a sibling of core
    • Easy to specify with flux run (i.e., flux run --slot-shape (core[4],mem[24g])). Potentially can be added to flux mini run (will require modification of Jobspec V1 and some support within flux-core, even if just rejecting jobs of this type)
  • The resource module will need a way to produce a resource graph with multiple memory chunks per node (since a single resource graph-node can only be allocated to a single job currently).
    • May require new parameters to resource and modification of the hwloc parser
  • (Stretch use-case) need to consider the case where heterogeneous jobspecs are submitted (jobspecs with core+memory requirements and jobspec with only cores). In this case, a node's memory may be completely allocated by the core+memory jobspecs and then core-only jobspecs are scheduled on the node too (implicitly oversubscribing the node's memory). Thanks @grondo for pointing this out
    • Potentially can be solved at scheduler level by requesting a default amount of memory per task/core/node/jobspec when not explicitly specified
@dongahn
Copy link
Member

dongahn commented Dec 20, 2019

Thanks @SteVwonder to summarize our discussion yesterday.

(Stretch use-case) need to consider the case where heterogeneous jobspecs are submitted (jobspecs with core+memory requirements and jobspec with only cores). In this case, a node's memory may be completely allocated by the core+memory jobspecs and then core-only jobspecs are scheduled on the node too (implicitly oversubscribing the node's memory). Thanks @grondo for pointing this out
Potentially can be solved at scheduler level by requesting a default amount of memory per task/core/node/jobspec when not explicitly specified.

This is an excellent point. We can either allocate a total memory / core count or just minimum amount of memory (e.g., 256MB). My initial thought is minimum works better?

Still, I am not sure if this value should come from a scheduler config file or from a jobspec as generated by flux core. With the later, this becomes more flexible?

BTW, is there a way for the execution service to enforce this limit without having to use cgroup?

@grondo
Copy link
Contributor

grondo commented Dec 20, 2019

Still, I am not sure if this value should come from a scheduler config file or from a jobspec as generated by flux core. With the later, this becomes more flexible?

If the scheduler doesn't enforce it then a user could craft a jobspec by hand which allocates a core with no memory, or other nonsensical request.

BTW, is there a way for the execution service to enforce this limit without having to use cgroup?

There is no strict enforcement, but memory policy can be set similar to CPU affinity which can keep pages allocated to the processes at least local to NUMA node/nodes. e.g. see numactl(8) --membind option

@dongahn
Copy link
Member

dongahn commented Dec 20, 2019

If the scheduler doesn't enforce it then a user could craft a jobspec by hand which allocates a core with no memory, or other nonsensical request.

I was in terms of a tool like flux jobspec filling in some of the missing request like this. If this doesn't make, I'd be happy to make this come from a sched config.

@grondo
Copy link
Contributor

grondo commented Dec 20, 2019

I was in terms of a tool like flux jobspec filling in some of the missing request like this. If this doesn't make, I'd be happy to make this come from a sched config.

Yeah, that could be done. However, there may be multiple jobspec generators (right now there is flux jobspec flux mini flux run. Though eventually they may all use the same Python module to do their generation, at the end of the day a raw jobspec file can be submitted to the ingest module which could be drafted by hand by savvy users.

We could have the validator(s) reject job requests that do not fulfill a memory component, but we wouldn't be able to amend the jobspec at that point, because it has already been signed by the user.

It just might be a little cleaner to have "defaults" set in the scheduler.

@grondo
Copy link
Contributor

grondo commented Dec 20, 2019

Other quick thoughts:

  • will the resource module organize cores + memory by socket/package/numa node so that memory allocated is "closest" to allocated cores?
  • If a request for a single core + memory exceeds the memory available of a socket, would you allow memory allocation from a remote socket?
  • If you are splitting memory into distinct "chunks" e.g. 1G, will jobspec be required to specify memory as a number of chunks of 1G memory, or could we use the units syntax (e.g. 1024M)?
  • Could users ask for less than the size of a chunk, e.g. 256M (knowing they'd be allocated a minimum)

@dongahn
Copy link
Member

dongahn commented Dec 20, 2019

  • will the resource module organize cores + memory by socket/package/numa node so that memory allocated is "closest" to allocated cores?

We discuss this yesterday. This can be done when the granularity of resource graph contains "socket" layer. Our current RC1 script sets hwloc whitelist to remove the socket layer so this won't work. I told @jameshcorbett that he can set an environment variable to specify an alternative hwloc whitelist. So yes in that case, this will be supported.

If a request for a single core + memory exceeds the memory available of a socket, would you allow memory allocation from a remote socket?

With the socket layer in the graph, I can't do this. When a jobspec says:

slot[1]->core[2]
       ->memory[32]

The traverser will implicitly match this to a socket first at the highest level and see how many cores and memories are underneath it.

Without the socket layer in the graph, this will get the memory from another socket (without explicit scheduling control on sockets). But then you won't get the memory affinity.

@dongahn
Copy link
Member

dongahn commented Dec 20, 2019

  • If you are splitting memory into distinct "chunks" e.g. 1G, will jobspec be required to specify memory as a number of chunks of 1G memory, or could we use the units syntax (e.g. 1024M)?

I will have to go back and look at the code. If more user friendly unit specification is needed, this can be added easily of course.

Could users ask for less than the size of a chunk, e.g. 256M (knowing they'd be allocated a minimum)

They can ask for it. But they will get the whole chunk.

@dongahn
Copy link
Member

dongahn commented Mar 10, 2022

@cmisale is adding memory scheduling as part of KubeFlux. If she sees issues with the current memory scheduling support, she either can use this ticket or create a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants