Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Add Kubeflow Trainer Pipeline Framework Design #2439

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tenzen-y
Copy link
Member

What this PR does / why we need it:
As we implemented in the Trainer runtime package and sharing previously in https://docs.google.com/presentation/d/1HyEsBa7hxWpIoBXaX6uECiB48FWB85SG1kx15mO8hug/edit?usp=sharing, we added an internal mechanism for easily expanding runtime mechanism.

So, I added the design to KEP to help understand for contributors what the framework mechanism is.

Note that we intend not to add actual interfaces to avoid double management interfaces between codes and documentation since those could potentially be broken in future developments.

cc @kubeflow/wg-training-leads @astefanutti

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from tenzen-y. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y tenzen-y force-pushed the add-trainer-pipeline-framework-documentation branch 4 times, most recently from 7843907 to c1f759d Compare February 15, 2025 23:59
@tenzen-y tenzen-y force-pushed the add-trainer-pipeline-framework-documentation branch from c1f759d to 8a60919 Compare February 16, 2025 00:00
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @tenzen-y!
/assign @kubeflow/wg-training-leads @astefanutti @saileshd1402 @Electronic-Waste

@@ -1692,6 +1692,61 @@ _Will be added after initial implementation for PyTorch._

_Will be added after initial implementation for PyTorch._

## Pipeline Framework
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you for adding this @tenzen-y!
Please can you create a dedicated issue so we can update the Kubeflow Trainer operators docs with this guide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sure. In that case, where do we want to put this?
I am guessing if we should create Kubeflow Trainer > User Guide > PlatformDeveloper Guide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should add them under Operator Guides: https://www.kubeflow.org/docs/components/trainer/operator-guides/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't find that. That sounds great.

As described in the following, each phase is basically executed step by step although `Startup Phase` is executed only once
during starting trainer-controller-manager:

- `Startup Phase`: Initialize internal components at once when the trainer-controller-manager starts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `Startup Phase`: Initialize internal components at once when the trainer-controller-manager starts.
- `Startup Phase`: Initialize internal components at once when the `kubeflow-trainer-controller-manager` starts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM


![Kubeflow TrainerPipelineFramework](./TrainerPipelineFramework.drawio.svg)

- `Startup Phase`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should also say that cluster operators can:

  1. Register new plugins into existing TrainingRuntime and ClusterTrainingRuntime
  2. Register new runtimes into runtime framework (e.g. SlurmRuntime).
    Or you think, it is out-of-scope since we don't support registration of new runtimes yet ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Register new runtimes into runtime framework (e.g. SlurmRuntime).
Or you think, it is out-of-scope since we don't support registration of new runtimes yet ?

Yeah, the current framework does not support inserting arbitrary plugins and runtimes (for sure, technically, we could do it). However, I think we can add another InternalAPI Initialize Trainer Framework Pipelines to diagram to clarify when pipelines are initialized. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that would be nice.

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM. Thanks! @tenzen-y

Comment on lines +1714 to +1715
The Extension Point is exposed and could be added operations within the scope of the Pipeline Framework Plugins Interfaces as plugins
and those plugins are performed in any order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Extension Point is exposed and could be added operations within the scope of the Pipeline Framework Plugins Interfaces as plugins
and those plugins are performed in any order.
The Extension Point is exposed, allowing operations to be added as plugins within the scope of the Pipeline Framework Plugins Interfaces. These plugins can be executed in any order.

Rephrase this sentence so it would be more readable:)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, These plugins can be executed in any order. is incorrect since we can not handle the order.
So, I will refine this with These plugins are executed in any order.

- `PostExecution Phase`:
- Internal API:
- `SupendedCondition`: Check if TrainJob is suspended state, and then add `Suspended` condition to TrainJob.
- `CreatedConditon`: Check if TrainJob is created state, and then add `Created` condition to TrainJob.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more a general comment rather than something strictly in the scope of this PR, I find the name of this condition a bit confusing as the TrainJob has to be "created", even before that condition can be applied.

So either it could be stated that this condition is about whether the children components of the TrainJob have been created, or possibly rename the condition to something like InitializedCondition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. That makes sense. The current Created condition problem is mixed semantics for "created TrainJob" and "created sub-objects based on runtime and job".

So, we might want to reconsider it. @astefanutti @andreyvelich What about trying to change "Created" condition type name in a follow-up PR? If we decide which name we should use instead of "Created", we can fix this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about trying to change "Created" condition type name in a follow-up PR?

That sounds good to me 👍🏼.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let's discuss it separately.
@astefanutti So do you suggest that we split Created condition for TrainJob between two ?
https://github.com/kubeflow/trainer/blob/8a6091907a2df6ab1d55b9713ead631f72ece56e/docs/proposals/2170-kubeflow-training-v2/README.md#state-transition

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I was more simply thinking about renaming it to something like Initialized or ComponentsCreated more than splitting it. As it stands it may be interpreted as the TrainJob is created, which is not what it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants