Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job.state.priority: remove raising exception when no aux item found #568

Merged
merged 2 commits into from
Feb 4, 2025

Conversation

cmoussa1
Copy link
Member

@cmoussa1 cmoussa1 commented Jan 28, 2025

Scenario

A job held in PRIORITY and waiting to be allocated resources gets reprioritized and sent back to job.state.priority or job.priority.get after a restart (clearing its aux items) when a plugin is already loaded and seen the job.

Problem

The multi-factor priority plugin raises an exception on a job in job.state.priority when it cannot find the aux item containing the association information for the job. However, when a system instance is reloaded, the aux items set on a job are cleared. When jobs are reprioritized and are sent back to job.state.priority or job.priority.get, the plugin will raise an exception on the job because it can't unpack the aux item containing the flux-accounting information associated with that job. See the below eventlog for an example:

...
{"timestamp":1737671575.5923743,"name":"priority","context":{"priority":19668}}
{"timestamp":1737736372.7067723,"name":"priority","context":{"priority":19620}}
{"timestamp":1737757970.5174642,"name":"priority","context":{"priority":19743}}
{"timestamp":1737801156.5343764,"name":"priority","context":{"priority":19865}}
{"timestamp":1737822769.2824628,"name":"priority","context":{"priority":19160}}
{"timestamp":1737998525.3946743,"name":"exception","context":{"type":"mf_priority","severity":0,"note":"internal error: bank info is missing","userid":767}}
{"timestamp":1737998525.3947248,"name":"clean"}

This PR removes raising a job exception in job.state.priority in the case where the job does not have an aux item for the accounting information associated with the job. Instead, it attempts to perform another lookup for the flux-accounting information for the association that submitted the job, which is already what it does when a job is submitted before the plugin is loaded with any flux-accounting information.

Fixes #575

@cmoussa1 cmoussa1 added bug-fix A proposal for something that isn't working plugin related to the multi-factor priority plugin labels Jan 28, 2025
@cmoussa1 cmoussa1 changed the title [WIP] job.state.priority: remove raising exception when no aux item found job.state.priority: remove raising exception when no aux item found Jan 31, 2025
@cmoussa1 cmoussa1 marked this pull request as ready for review January 31, 2025 18:30
@cmoussa1 cmoussa1 force-pushed the job.state.priority-fix branch from fcd00d2 to c176db3 Compare January 31, 2025 19:28
@cmoussa1
Copy link
Member Author

cmoussa1 commented Jan 31, 2025

@grondo I believe I've found the source of those job failures I was telling you about a few days ago with the jobs that failed to load the Association aux item after a system instance restart (I wrote up a description with my thoughts in #575). I'm in the process of trying to add a sharness test for this case but I can't seem to figure out how to simulate this scenario. I've added a [WIP] commit to this PR to get me started, but I could use some advice on how to actually execute the restart and send the pending job back to PRIORITY with the aux items cleared. Do you have any advice or are there any examples that I can look at in flux-core for this?

@grondo
Copy link
Contributor

grondo commented Jan 31, 2025

You can take a look at t3200-instance-restart.t which uses a series of flux start commands to simulate restarts.
O/w, you could try just reloading the job-manager module and restart the flux-accounting python service to see if that reproduces the original issue.

@cmoussa1 cmoussa1 force-pushed the job.state.priority-fix branch from f3bf982 to a4d3361 Compare January 31, 2025 22:41
@cmoussa1
Copy link
Member Author

OK, I believe I have a working test file now to reproduce the behavior I was describing in #575, so I've dropped [WIP] from that commit.

@cmoussa1 cmoussa1 requested a review from grondo January 31, 2025 22:48
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM and seems to be a nice solution! Just one comment suggestion inline.

// The flux-accounting information associated with this job could not
// be found by the time this job got to job.state.priority. Attempt to
// look up the association again and attach its information to the job
// with aux_set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Amend the comment to indicate why the aux item may not be set or the bank info is missing. This may help a future developer. E.g. "This could be due to either ..."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, thanks! I've just force-pushed up an amendment to this comment as well as referenced RFC 21 for the job state diagram; the image in that RFC is super helpful.

Problem: the multi-factor priority plugin raises an exception on a job
in job.state.priority when it cannot find the aux item containing the
association information for the job. However, when a system instance is
restarted with pending jobs that are reprioritized, the aux items are
cleared from the job. So, the job will have an exception raised on it
even if the plugin has accounting data for the association that
submitted the job.

Remove raising a job exception in job.state.priority in the case where
the job does not have an aux item for the accounting information
associated with the job. Instead, attempt to perform another lookup for
the flux-accounting information for the association that submitted the
job.
Problem: There are no tests for pending jobs that are sent back to
PRIORITY while they are pending.

Add some tests.
@cmoussa1 cmoussa1 force-pushed the job.state.priority-fix branch from a4d3361 to d8106fc Compare February 4, 2025 16:57
@cmoussa1
Copy link
Member Author

cmoussa1 commented Feb 4, 2025

Thanks for reviewing this @grondo! I'll set MWP here

@mergify mergify bot merged commit f753499 into flux-framework:master Feb 4, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-fix A proposal for something that isn't working merge-when-passing plugin related to the multi-factor priority plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

plugin: jobs in SCHED state lose their Association aux item on flux-restart
2 participants