-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YUNIKORN-2504] Support canonical labels for queue/applicationId in scheduler #54
Conversation
005af49
to
6057c13
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #54 +/- ##
==========================================
+ Coverage 67.21% 67.29% +0.08%
==========================================
Files 70 70
Lines 7640 7706 +66
==========================================
+ Hits 5135 5186 +51
- Misses 2287 2297 +10
- Partials 218 223 +5 ☔ View full report in Codecov by Sentry. |
ed22fdd
to
3bf0269
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it easier for the reviewer to read, add some notes.
- Can start from 'task.sanityCheckBeforeScheduling()' in applications.go
- Can check e2e tests in basic_scheduling_test and recover_and_restart first.
log.Log(log.ShimCacheApplication).Info("new pod status", zap.String("status", string(pod.Status.Phase))) | ||
} | ||
} | ||
|
||
func (app *Application) handleFailApplicationEvent(errMsg string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to failTaskPodWithReasonAndMsg() to task.go
change
- podCopy := task.GetTaskPod().DeepCopy()
to - podCopy := task.pod.DeepCopy()
to prevent deadlock when task state machine is handling TaskRejected event.
// if the task is not ready for scheduling, we keep it in New state | ||
// if the task pod is bounded and have conflicting metadata, we move the task to Rejected state | ||
err, rejectTask := task.sanityCheckBeforeScheduling() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perform a sanity check before move this task to Pending state.
Before this PR, sanity check only check PVC's readiness
- If sanity check passed, move task state from 'New' -> 'Pending'
- If sanity check failed, task state remains in 'New' (Will be checked again in next schedule cycle)
After this PR (Sanity check check PVC and Pod Metadata)
- if sanity check passed, 'New' -> 'Pending'
- if sanity check fails due to PVC -> 'New' (No change)
- if sanity check fails due to a unbound pod with inconsistent metadata (AppID/Label), move task state from 'New' to 'Rejected'
Design decision: Only reject unbound pods because we don't want to failed existing running pod after restart YK.
constants.CanonicalLabelApplicationID: app.GetApplicationID(), | ||
constants.CanonicalLabelQueueName: app.GetQueue(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note:
We can directly use canonical representation for placeholder here.
The newer version shim allows legacy and canonical representation metadata coexists.
func (task *Task) postTaskRejected(reason string) { | ||
// if task is rejected because of conflicting metadata, we should fail the pod with reason | ||
if strings.Contains(reason, constants.TaskPodInconsistMetadataFailure) { | ||
task.failTaskPodWithReasonAndMsg(constants.TaskRejectedFailure, reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fail the pod if the task's reject reason is inconsistent metadata.
@@ -104,12 +104,21 @@ func IsAssignedPod(pod *v1.Pod) bool { | |||
} | |||
|
|||
func GetQueueNameFromPod(pod *v1.Pod) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order to get 'queue' from pod:
Before this PR:
- Label: constants.LabelQueueName
- Annotation: constants.AnnotationQueueName
- Default: constants.ApplicationDefaultQueue
After this PR
1. Label: constants.CanonicalLabelQueueName (New)
2. Label: constants.LabelQueueName
3. Annotation: constants.AnnotationQueueName
4. Default: constants.ApplicationDefaultQueue
@@ -154,15 +163,26 @@ func GetApplicationIDFromPod(pod *v1.Pod) string { | |||
} | |||
} | |||
|
|||
// Application ID can be defined in annotation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order to get 'app-id' from pod:
Before this PR:
- Annotation: constants.AnnotationApplicationID
- Label: constants.LabelApplicationID
- Label: constants.SparkLabelAppID
After this PR
- Label: constants.CanonicalLabelApplicationID (New)
- Label: constants.LabelApplicationID
- Label: constants.SparkLabelAppID
- Annotation: constants.AnnotationApplicationID(Move to the last, label first)
3bf0269
to
1d4979b
Compare
1d4979b
to
197e0ca
Compare
What is this PR for?
Support canonical Queue/ApplicationId labels in Pod: (The existing label are kept without deprecation.)
Check config consistency in sanityCheckBeforeScheduling(). Reject task if the task pod is not bound and
Not bound and
Conflict appId detect
What type of PR is it?
Todos
What is the Jira issue?
[YUNIKORN-2] Gang scheduling interface parameters
How should this be tested?
Screenshots (if appropriate)
Questions: