Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-2504] Support canonical labels for queue/applicationId in scheduler #54

Closed
wants to merge 1 commit into from

Conversation

chenyulin0719
Copy link
Owner

@chenyulin0719 chenyulin0719 commented Jun 18, 2024

What is this PR for?

Support canonical Queue/ApplicationId labels in Pod: (The existing label are kept without deprecation.)

  • yunikorn.apache.org/app-id (New, Canonical Label)
  • yunikorn.apache.org/queue (New, Canonical Label)

Check config consistency in sanityCheckBeforeScheduling(). Reject task if the task pod is not bound and

  1. Not bound and

  2. Conflict appId detect

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

Copy link

codecov bot commented Jun 18, 2024

Codecov Report

Attention: Patch coverage is 75.72816% with 25 lines in your changes missing coverage. Please review.

Project coverage is 67.29%. Comparing base (e59162d) to head (3bf0269).

Current head 3bf0269 differs from pull request most recent head 197e0ca

Please upload reports for the commit 197e0ca to get more accurate results.

Files Patch % Lines
pkg/cache/task.go 75.00% 6 Missing and 5 partials ⚠️
pkg/cache/application.go 28.57% 10 Missing ⚠️
pkg/cache/task_state.go 50.00% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #54      +/-   ##
==========================================
+ Coverage   67.21%   67.29%   +0.08%     
==========================================
  Files          70       70              
  Lines        7640     7706      +66     
==========================================
+ Hits         5135     5186      +51     
- Misses       2287     2297      +10     
- Partials      218      223       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@chenyulin0719 chenyulin0719 force-pushed the YUNIKORN-2504 branch 6 times, most recently from ed22fdd to 3bf0269 Compare June 21, 2024 11:42
Copy link
Owner Author

@chenyulin0719 chenyulin0719 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it easier for the reviewer to read, add some notes.

  1. Can start from 'task.sanityCheckBeforeScheduling()' in applications.go
  2. Can check e2e tests in basic_scheduling_test and recover_and_restart first.

log.Log(log.ShimCacheApplication).Info("new pod status", zap.String("status", string(pod.Status.Phase)))
}
}

func (app *Application) handleFailApplicationEvent(errMsg string) {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to failTaskPodWithReasonAndMsg() to task.go

change

  • podCopy := task.GetTaskPod().DeepCopy()
    to
  • podCopy := task.pod.DeepCopy()

to prevent deadlock when task state machine is handling TaskRejected event.

// if the task is not ready for scheduling, we keep it in New state
// if the task pod is bounded and have conflicting metadata, we move the task to Rejected state
err, rejectTask := task.sanityCheckBeforeScheduling()

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perform a sanity check before move this task to Pending state.

Before this PR, sanity check only check PVC's readiness

  • If sanity check passed, move task state from 'New' -> 'Pending'
  • If sanity check failed, task state remains in 'New' (Will be checked again in next schedule cycle)

After this PR (Sanity check check PVC and Pod Metadata)

  • if sanity check passed, 'New' -> 'Pending'
  • if sanity check fails due to PVC -> 'New' (No change)
  • if sanity check fails due to a unbound pod with inconsistent metadata (AppID/Label), move task state from 'New' to 'Rejected'

Design decision: Only reject unbound pods because we don't want to failed existing running pod after restart YK.

Comment on lines +93 to +94
constants.CanonicalLabelApplicationID: app.GetApplicationID(),
constants.CanonicalLabelQueueName: app.GetQueue(),
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note:
We can directly use canonical representation for placeholder here.
The newer version shim allows legacy and canonical representation metadata coexists.

func (task *Task) postTaskRejected(reason string) {
// if task is rejected because of conflicting metadata, we should fail the pod with reason
if strings.Contains(reason, constants.TaskPodInconsistMetadataFailure) {
task.failTaskPodWithReasonAndMsg(constants.TaskRejectedFailure, reason)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fail the pod if the task's reject reason is inconsistent metadata.

@@ -104,12 +104,21 @@ func IsAssignedPod(pod *v1.Pod) bool {
}

func GetQueueNameFromPod(pod *v1.Pod) string {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order to get 'queue' from pod:

Before this PR:

  1. Label: constants.LabelQueueName
  2. Annotation: constants.AnnotationQueueName
  3. Default: constants.ApplicationDefaultQueue

After this PR
1. Label: constants.CanonicalLabelQueueName (New)
2. Label: constants.LabelQueueName
3. Annotation: constants.AnnotationQueueName
4. Default: constants.ApplicationDefaultQueue

@@ -154,15 +163,26 @@ func GetApplicationIDFromPod(pod *v1.Pod) string {
}
}

// Application ID can be defined in annotation
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order to get 'app-id' from pod:

Before this PR:

  1. Annotation: constants.AnnotationApplicationID
  2. Label: constants.LabelApplicationID
  3. Label: constants.SparkLabelAppID

After this PR

  1. Label: constants.CanonicalLabelApplicationID (New)
  2. Label: constants.LabelApplicationID
  3. Label: constants.SparkLabelAppID
  4. Annotation: constants.AnnotationApplicationID(Move to the last, label first)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant