-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job.state.priority
: remove raising exception when no aux item found
#568
job.state.priority
: remove raising exception when no aux item found
#568
Conversation
job.state.priority
: remove raising exception when no aux item foundjob.state.priority
: remove raising exception when no aux item found
fcd00d2
to
c176db3
Compare
@grondo I believe I've found the source of those job failures I was telling you about a few days ago with the jobs that failed to load the |
You can take a look at |
f3bf982
to
a4d3361
Compare
OK, I believe I have a working test file now to reproduce the behavior I was describing in #575, so I've dropped [WIP] from that commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM and seems to be a nice solution! Just one comment suggestion inline.
src/plugins/mf_priority.cpp
Outdated
// The flux-accounting information associated with this job could not | ||
// be found by the time this job got to job.state.priority. Attempt to | ||
// look up the association again and attach its information to the job | ||
// with aux_set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Amend the comment to indicate why the aux item may not be set or the bank info is missing. This may help a future developer. E.g. "This could be due to either ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, thanks! I've just force-pushed up an amendment to this comment as well as referenced RFC 21 for the job state diagram; the image in that RFC is super helpful.
Problem: the multi-factor priority plugin raises an exception on a job in job.state.priority when it cannot find the aux item containing the association information for the job. However, when a system instance is restarted with pending jobs that are reprioritized, the aux items are cleared from the job. So, the job will have an exception raised on it even if the plugin has accounting data for the association that submitted the job. Remove raising a job exception in job.state.priority in the case where the job does not have an aux item for the accounting information associated with the job. Instead, attempt to perform another lookup for the flux-accounting information for the association that submitted the job.
Problem: There are no tests for pending jobs that are sent back to PRIORITY while they are pending. Add some tests.
a4d3361
to
d8106fc
Compare
Thanks for reviewing this @grondo! I'll set MWP here |
Scenario
A job held in PRIORITY and waiting to be allocated resources gets reprioritized and sent back to
job.state.priority
orjob.priority.get
after a restart (clearing its aux items) when a plugin is already loaded and seen the job.Problem
The multi-factor priority plugin raises an exception on a job in
job.state.priority
when it cannot find the aux item containing the association information for the job. However, when a system instance is reloaded, the aux items set on a job are cleared. When jobs are reprioritized and are sent back tojob.state.priority
orjob.priority.get
, the plugin will raise an exception on the job because it can't unpack the aux item containing the flux-accounting information associated with that job. See the below eventlog for an example:This PR removes raising a job exception in
job.state.priority
in the case where the job does not have an aux item for the accounting information associated with the job. Instead, it attempts to perform another lookup for the flux-accounting information for the association that submitted the job, which is already what it does when a job is submitted before the plugin is loaded with any flux-accounting information.Fixes #575