-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory-constrained application scheduling #550
Comments
Thanks @SteVwonder to summarize our discussion yesterday.
This is an excellent point. We can either allocate a Still, I am not sure if this value should come from a scheduler config file or from a jobspec as generated by flux core. With the later, this becomes more flexible? BTW, is there a way for the execution service to enforce this limit without having to use cgroup? |
If the scheduler doesn't enforce it then a user could craft a jobspec by hand which allocates a core with no memory, or other nonsensical request.
There is no strict enforcement, but memory policy can be set similar to CPU affinity which can keep pages allocated to the processes at least local to NUMA node/nodes. e.g. see |
I was in terms of a tool like flux jobspec filling in some of the missing request like this. If this doesn't make, I'd be happy to make this come from a sched config. |
Yeah, that could be done. However, there may be multiple jobspec generators (right now there is We could have the validator(s) reject job requests that do not fulfill a memory component, but we wouldn't be able to amend the jobspec at that point, because it has already been signed by the user. It just might be a little cleaner to have "defaults" set in the scheduler. |
Other quick thoughts:
|
We discuss this yesterday. This can be done when the granularity of resource graph contains "socket" layer. Our current RC1 script sets hwloc whitelist to remove the socket layer so this won't work. I told @jameshcorbett that he can set an environment variable to specify an alternative hwloc whitelist. So yes in that case, this will be supported.
With the socket layer in the graph, I can't do this. When a jobspec says:
The traverser will implicitly match this to a socket first at the highest level and see how many cores and memories are underneath it. Without the socket layer in the graph, this will get the memory from another socket (without explicit scheduling control on sockets). But then you won't get the memory affinity. |
I will have to go back and look at the code. If more user friendly unit specification is needed, this can be added easily of course.
They can ask for it. But they will get the whole chunk. |
@cmisale is adding memory scheduling as part of KubeFlux. If she sees issues with the current memory scheduling support, she either can use this ticket or create a new one. |
@jameshcorbett mentioned that they are running ensembles on memory-constrained applications (i.e., the memory capacity of the node is the limiting resource not the number of cores).
Supporting this use-case in flux will require a few modifications:
memory
as a child ofnode
and a sibling ofcore
flux run
(i.e.,flux run --slot-shape (core[4],mem[24g])
). Potentially can be added toflux mini run
(will require modification of Jobspec V1 and some support within flux-core, even if just rejecting jobs of this type)resource
module will need a way to produce a resource graph with multiplememory
chunks per node (since a single resource graph-node can only be allocated to a single job currently).resource
and modification of the hwloc parserThe text was updated successfully, but these errors were encountered: