Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jobs] fix possible managed job status issues #4637

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Feb 3, 2025

Descriptions from the commits:

[jobs] don't set end_at for CANCELLING
CANCELLING is no longer a terminal status - we shouldn't set end_at.

[jobs] don't require job to be CANCELLING to become CANCELLED
It's possible that the job status erroneously gets set to some other value (e.g. RUNNING) by another process after getting set to CANCELLING - we should avoid this case, but if we end up here, we should still transition to CANCELLED.

[jobs] avoid status transitions that don't make sense
I saw a bug where a job that was CANCELLING got updated by RUNNING by a different process. Avoid state transitions like this that obviously violate the expected state machine.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • smoke test managed jobs
  • quicktest-core

cg505 added 3 commits February 3, 2025 15:34
CANCELLING is no longer a terminal status - we shouldn't set end_at.
It's possible that the job status erroneously gets set to some other
value (e.g. RUNNING) by another process after getting set to
CANCELLING - we should avoid this case, but if we end up here, we
should still transition to CANCELLED.
I saw a bug where a job that was CANCELLING got updated by RUNNING by
a different process. Avoid state transitions like this that obviously
violate the expected state machine.
@cg505 cg505 requested a review from Michaelvll February 3, 2025 23:59
@cg505
Copy link
Collaborator Author

cg505 commented Feb 3, 2025

/quicktest-core

@cg505
Copy link
Collaborator Author

cg505 commented Feb 3, 2025

/smoke-test --managed-jobs

@cg505
Copy link
Collaborator Author

cg505 commented Feb 4, 2025

/smoke-test --managed-jobs

cg505 added a commit to cg505/skypilot that referenced this pull request Feb 4, 2025
Usually, a job should always transition to a terminal status before it
is set to DONE. However, if a bug makes that fail for some
reason (e.g. the issues fixed in skypilot-org#4637), we should still catch this
state.
@Michaelvll
Copy link
Collaborator

I saw a bug where a job that was CANCELLING got updated by RUNNING by a different process. Avoid state transitions like this that obviously violate the expected state machine.

Is it possible to just avoid the job from being set to RUNNING when it is CANCELLING, instead of getting all the fixes here? Otherwise, more unknown behaviors may just accumulate, causing the complexity to grow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants