You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Apparently related to heavy load, LSF's bjobs output does not report all job IDs. For jobs in the running state when the job is not reported anymore it will be classified as kind of finished/exited. However, if the job later reappears in the JSON output the state is probably not changed back. We observed that in 3 queries within 10 minutes a job was missing, while it was reported before and after this interval.
Suggested Solution:
Only report jobs as exited (maybe COMPLETED_UNKNOWN), if they are explicitly marked as EXITED or DONE. When they are lost, this is always an indication, that something is wrong (unless the system is configured to prune the list of exited jobs older than 2 minutes)
Make it configurable for how long to wait for lost jobs and for how long the list is maintained by LSF (i.e. after what time jobs certainly are not expected to be found in the list)
Add a job state like "missing" or "not-reported" or maybe a counter in the "running" state telling when the job was last seen in the list.
Warn if a job is lost from the list, but it is expected that it should be visible. This is always an abnormal situation.
The text was updated successfully, but these errors were encountered:
Apparently related to heavy load, LSF's bjobs output does not report all job IDs. For jobs in the running state when the job is not reported anymore it will be classified as kind of finished/exited. However, if the job later reappears in the JSON output the state is probably not changed back. We observed that in 3 queries within 10 minutes a job was missing, while it was reported before and after this interval.
Suggested Solution:
The text was updated successfully, but these errors were encountered: