Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin: enforce max resource limits across an association's running jobs #559

Open
6 of 9 tasks
cmoussa1 opened this issue Jan 7, 2025 · 2 comments
Open
6 of 9 tasks
Labels
feature tracking Tracking issue for larger feature made up of smaller issues plugin related to the multi-factor priority plugin

Comments

@cmoussa1
Copy link
Member

cmoussa1 commented Jan 7, 2025

Creating a tracking issue here to outline the idea for enforcing a max number of resources used across an association's set of running jobs. I already have a couple of open issues similar to this but it would probably be useful to re-organize some thoughts after some helpful offline discussion.

The need here is to be able to limit how many resources (e.g nodes, cores) an association can have at any given time across all of their running jobs. As noted in flux-config-policy(5), the limit checks take place before the scheduler sees the request because [the plugin] does not have detailed resource information.

So, it seems a realistic solution here would be to configure a max resources limit that is both a max nodes and a max cores limit. The priority plugin should be able to keep track of both when a job enters RUN state by looking at the jobspec. It can increment/decrement current node and core counts per-association across all of their running jobs. Then, when a submitted job enters DEPEND state, the job's size can be checked to see if adding its resources to the association's currently allocated resources would put them over the max (i.e either over the nodes or cores limits). If so, the job can be held until a currently running job exits.

There are a couple of prerequisites to get this kind of support into flux-accounting:

Tasks

Preview Give feedback
  1. database merge-when-passing new feature plugin
  2. merge-when-passing new feature plugin
  3. new feature plugin

I've done some playing around today with a rough sketch and it looks like the first four tasks listed are pretty straightforward; copying over the jj code from flux-core, I'm able to extract job size counts and add/subtract them from an association's cur_nodes and cur_cores attributes as jobs enter RUN and INACTIVE states.

I'll plan to start opening incremental PRs to add this kind of support into flux-accounting.

@cmoussa1 cmoussa1 added feature tracking Tracking issue for larger feature made up of smaller issues plugin related to the multi-factor priority plugin labels Jan 7, 2025
@cmoussa1
Copy link
Member Author

cmoussa1 commented Jan 8, 2025

Had a helpful offline discussion with @ryanday36 about a possible implementation plan for how this might work in the priority plugin:

The priority plugin will have max_nodes, max_cores, cur_nodes, and cur_cores information stored per-association in its internal map. This information will be able to be queried with flux jobtap query to see where an association is at at any given time.

When a job proceeds to job.state.run, its resource information will be extracted from jobspec. It will use the jj code to count both nnodes and ncores and increment the association's cur_nodes and cur_cores count accordingly.

As jobs get submitted and are running, subsequently submitted jobs will have their resource counts checked in job.state.depend. If the resource counts (nnodes or ncores) would put the association over either their max_nodes or max_cores limit, the job will have an accounting-specific dependency added to it describing that the association has hit their max resources limit, and the job will be held.

Jobs will be held until a currently running job transitions to INACTIVE. When the running job transitions to INACTIVE, its resources will again be extracted from jobspec and decremented from the association's cur_nodes and cur_cores count. Then, when the association's cur_running_jobs count is checked to ensure that they are allowed to have a running job at this moment, the held job's resource count (I need to see if I can retrieve a jobspec in a jobtap plugin with just the jobid??) will be checked to ensure that the association would not be over their max. If not, the job can be released and proceed to RUN.

@cmoussa1
Copy link
Member Author

I am getting closer to being able to actually enforce dependencies on jobs where an association is already at their max resource limits, but while coming up with some test cases I ran into a bit of a hiccup with one particular scenario.

To summarize, here is the current workflow I've written when a job enters job.state.depend:

job.state.depend

  • a check is put in place to make sure the user is not at their max running jobs limit. If they are, a max-running-jobs-user-limit dependency is added, and the callback exits.
  • If the user is not at their max running jobs limit, another follow-up check is put in place to make sure the user is not at their resource limit. The jobspec for the current job is unpacked and the resources are checked, and if the proposed job would put the user over their max, a max-resource-user-limit dependency is added.

In this callback, I have it written so that at most one of these dependencies are added to a job.

job.state.inactive

  • if there is a held job in the user's held_jobs queue, its jobspec is unpacked and resources are checked. If the held job would not put the user over their max resources limit AND it wouldn't put the user over their max running jobs limit, then both dependencies are attempted to be removed from the job to release it (even though only one dependency is added to the job).

The scenario where I am a bit stuck is in the case where a held job (due to a max running jobs limit) would still not satisfy the resource limit for the user (so the job continues to be held); if I have just one dependency added to the job (e.g max-running-jobs-user limit for this example), the dependency message would be confusing because the user is technically under their max running jobs limit but not their max resource limit.

I have two immediate thoughts on how to maybe restructure the dependencies in the plugin to get around this, but would be open to any feedback or advice:

  1. Don't only add one dependency on a job if it would put the user at both their max running jobs limit and their max resources limit. If it would hit both limits, add both dependencies. Then, in job.state.inactive, this callback would have to be reworked to look to remove one or both dependencies if the job has them (is it possible to remove multiple dependencies from a job in the same callback? Would I have to perform a lookup to see which set of dependencies the job has in the first place before trying to remove them?).
  2. Combine the two limits (max running jobs and max resources) into one, more general max-accounting-user limit that could be applied to both. Then, the callback for job.state.inactive could look at a held job and if it meets all limit conditions, then just this limit can be removed. I'm not sure how much of a fan I am of this because it is not immediately clear to the user why their job is being held. They would probably have to query the jobtap plugin and look for their userid and their corresponding limits to see why the job is being held.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature tracking Tracking issue for larger feature made up of smaller issues plugin related to the multi-factor priority plugin
Projects
None yet
Development

No branches or pull requests

1 participant