Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add kep-162 Colocated Placement #168

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vie-serendipity
Copy link
Contributor

What type of PR is this?

/kind documentation

What this PR does / why we need it

Support colocated placement of multiple pod groups

Which issue(s) this PR fixes

Fixes #162

Special notes for your reviewer

I don't know how to upload image to aws and I just push image to repo. This should be fixed later.

Does this PR introduce a user-facing change?


@k8s-ci-robot k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 1, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vie-serendipity
Once this PR has been reviewed and has the lgtm label, please assign kerthcet for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @vie-serendipity!

It looks like this is your first PR to kubernetes-sigs/lws 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/lws has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 1, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @vie-serendipity. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 1, 2024
@vie-serendipity vie-serendipity changed the title add kep-165 Colocated Placement add kep-162 Colocated Placement Jul 2, 2024
@liurupeng
Copy link
Collaborator

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 2, 2024
@vie-serendipity
Copy link
Contributor Author

any suggestions?

@kerthcet
Copy link
Contributor

kerthcet commented Jul 8, 2024

Will visit this API later for bandwidth, sorry for that.

@liurupeng
Copy link
Collaborator

@vie-serendipity sry for the late reply, I was on vacation last week, checking it now

know that this has succeeded?
-->

- Add a new way of topology scheduling and allow multiple pod groups land on the same domain
Copy link
Collaborator

@liurupeng liurupeng Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for one pod group (one multi-host inference replica), we need to colocate the pods, but for different pod groups, why we need to put them in the same domain since different replicas won't communicate with each other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might confuse you with my expression, I mean allowing all leaders to schedule freely across any topology, not restricting one pod group to a topology just like exclusive placement.


### SubGroup Policy Support
Colocated placement can support subgroup policy as well.
Compared with [exclusive support for subgroup policy](https://github.com/kubernetes-sigs/lws/assets/86417275/ff9fc93d-c738-4c09-abc8-50a7b16d49df), the workflow is almost identical, as follows:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the subgroup feature will only have one leader pod for all the pods in the same topology

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the subgroup policy, it seems like it just needs to ensure that the leader and workers are in the same topology, so I think colocated placement should meet its requirements.

@liurupeng
Copy link
Collaborator

overall lgtm, but one more part to consider is the existing lws workload that use exclusive placement. Do we want to keep two set of APIs to use different placement strategies?

@liurupeng
Copy link
Collaborator

@ahg-g and @kerthcet any objection to this feature?

@liurupeng
Copy link
Collaborator

@ahg-g and @kerthcet any objection to this feature?

Abdullah is on vacation, @kerthcet could you review this as well. In general, I think allowing configuring more customized scheduling semantics will be very helpful

@ahg-g
Copy link
Contributor

ahg-g commented Jul 17, 2024

I will take a look after I come back from vacation, first week of August.

@k8s-ci-robot
Copy link
Contributor

@vie-serendipity: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-lws-verify-main f9a1557 link true /test pull-lws-verify-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@liurupeng
Copy link
Collaborator

@ahg-g could you help check this one when got some time? thanks!


const (
ExclusiveTopologyPlacementPolicyType TopologyPlacementPolicyType = "Exclusive"
ColocatedTopologyPlacementPolicyType TopologyPlacementPolicyType = "Colocated"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we implement this?

If we implement this using pod-affinity (i.e., same as exclusive placement, but removing the anti-affinity term), we will run into race conditions: consider the case where two groups try to schedule on the same domain at the same time but there is capacity only for one, there is a possibility that both will partially schedule on the domain, and none will schedule fully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That‘s a tricky case I didn't consider before. It seems like there‘s already an issue when scheduling multiple pod groups with limited resources. Refer to #167
Co-scheduling plugin conflicts with LeaderReady startup policy.
In this situation, I think it has to rely on the autoscaler or adding more resources manully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that a deadlock with the plugin only happens when the leaderready startup policy is adopted. We can do gang scheduling with the pod group declared in the plugin. This can fix the problem you mentioned.
It's worth noting that colocated placement policy and leaderready startup policy can't be set together, and we have to validate in the webhook.
But I'm not sure whether relying on an out-of-tree scheduler plugin is appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is more related to scheduling rather than workload itself? This is an exist problem for all distributed ones. Unless we have a queueing system ourself just like WaitForPodsReady in kueue.

@kerthcet
Copy link
Contributor

kerthcet commented Aug 6, 2024

Sorry @liurupeng I leave the community for couple of days for some reasons. I'll back to this ASAP.

Copy link
Contributor

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a build-in feature gate system as kubernetes or it will impact our api version a lot, like this one, I have no idea whether we're on the right path because we don't have much feedbacks, without feature gates, we'll have to bump the apiversion to avoid backwards compatibilities. WDYT? @ahg-g @liurupeng

TopologyPlacementPolicy TopologyPlacementPolicy `json:"topologyPlacementPolicy",omitempty`
}

type TopologyPlacementPolicy struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe GroupPlacementPolicy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.


type LeaderWorkerTemplate struct {
// +optional
TopologyPlacementPolicy TopologyPlacementPolicy `json:"topologyPlacementPolicy",omitempty`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this a slice in case we'll append more than one policies to the lws, and the policies should be ANDed?


const (
ExclusiveTopologyPlacementPolicyType TopologyPlacementPolicyType = "Exclusive"
ColocatedTopologyPlacementPolicyType TopologyPlacementPolicyType = "Colocated"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is more related to scheduling rather than workload itself? This is an exist problem for all distributed ones. Unless we have a queueing system ourself just like WaitForPodsReady in kueue.

// +kubebuilder:default=None
// +kubebuilder:validation=Enum={ExclusiveTopologyPlacementPolicyType,ColocatedTopologyPlacementPolicyType,NoneTopologyPlacementPolicyType}
type TopologyPlacementPolicyType `json:"type"`
topologyKey *string `json:"topologyKey",omitempty`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the policy is none, topologyKey should be nil, otherwise, it should be explicitly declared. So I think pointer is more suitable.

### User Stories (Optional)

#### Story 1
Each pod group (leader and workers) is colocated in one domain(i.e., more than one group could land on the same domain)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of a real case, this one is somehow a more theoretical one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, assuming a nodepool is a topology domain, each nodepool contains many nodes with the same GPU(easy to scale). So, it makes sense to schedule more than one group to a nodepool naturally.
And it's reasonable to schedule a group of pods to a single nodepool since they have the same GPU, ensuring equal inference speeds.

type TopologyPlacementPolicy struct {
// +kubebuilder:default=None
// +kubebuilder:validation=Enum={ExclusiveTopologyPlacementPolicyType,ColocatedTopologyPlacementPolicyType,NoneTopologyPlacementPolicyType}
type TopologyPlacementPolicyType `json:"type"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the constraint mode like prefer or required? But this is not that vital, we can add later.

@liurupeng
Copy link
Collaborator

@vie-serendipity we are planning the release v0.5.0, do you want to proceed with this KEP? thanks!

@ahg-g
Copy link
Contributor

ahg-g commented Sep 14, 2024

@vie-serendipity we are planning the release v0.5.0, do you want to proceed with this KEP? thanks!

The proposal must have a way to address #168 (comment)

@vie-serendipity
Copy link
Contributor Author

@liurupeng I'm glad to push this work forward.

@vie-serendipity
Copy link
Contributor Author

@ahg-g I agree with kerthcet. I think it's more about scheduling and it's not the scope lws should cover. How do you think?

@ahg-g
Copy link
Contributor

ahg-g commented Oct 5, 2024

@ahg-g I agree with kerthcet. I think it's more about scheduling and it's not the scope lws should cover. How do you think?

But we can't offer an API that is not backed by an implementation.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhance scheduling capabilities for a group of pods
6 participants