-
Notifications
You must be signed in to change notification settings - Fork 731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Add Kubeflow Trainer Pipeline Framework Design #2439
base: master
Are you sure you want to change the base?
KEP-2170: Add Kubeflow Trainer Pipeline Framework Design #2439
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
7843907
to
c1f759d
Compare
Signed-off-by: Yuki Iwai <[email protected]>
c1f759d
to
8a60919
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @tenzen-y!
/assign @kubeflow/wg-training-leads @astefanutti @saileshd1402 @Electronic-Waste
@@ -1692,6 +1692,61 @@ _Will be added after initial implementation for PyTorch._ | |||
|
|||
_Will be added after initial implementation for PyTorch._ | |||
|
|||
## Pipeline Framework |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thank you for adding this @tenzen-y!
Please can you create a dedicated issue so we can update the Kubeflow Trainer operators docs with this guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sure. In that case, where do we want to put this?
I am guessing if we should create Kubeflow Trainer > User Guide > PlatformDeveloper Guide
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should add them under Operator Guides: https://www.kubeflow.org/docs/components/trainer/operator-guides/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't find that. That sounds great.
As described in the following, each phase is basically executed step by step although `Startup Phase` is executed only once | ||
during starting trainer-controller-manager: | ||
|
||
- `Startup Phase`: Initialize internal components at once when the trainer-controller-manager starts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `Startup Phase`: Initialize internal components at once when the trainer-controller-manager starts. | |
- `Startup Phase`: Initialize internal components at once when the `kubeflow-trainer-controller-manager` starts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
|
||
data:image/s3,"s3://crabby-images/9c167/9c167f77c82fda2b7645cbba212fc6b6760a50e7" alt="Kubeflow TrainerPipelineFramework" | ||
|
||
- `Startup Phase`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should also say that cluster operators can:
- Register new plugins into existing TrainingRuntime and ClusterTrainingRuntime
- Register new runtimes into runtime framework (e.g. SlurmRuntime).
Or you think, it is out-of-scope since we don't support registration of new runtimes yet ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Register new runtimes into runtime framework (e.g. SlurmRuntime).
Or you think, it is out-of-scope since we don't support registration of new runtimes yet ?
Yeah, the current framework does not support inserting arbitrary plugins and runtimes (for sure, technically, we could do it). However, I think we can add another InternalAPI Initialize Trainer Framework Pipelines
to diagram to clarify when pipelines are initialized. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM. Thanks! @tenzen-y
The Extension Point is exposed and could be added operations within the scope of the Pipeline Framework Plugins Interfaces as plugins | ||
and those plugins are performed in any order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Extension Point is exposed and could be added operations within the scope of the Pipeline Framework Plugins Interfaces as plugins | |
and those plugins are performed in any order. | |
The Extension Point is exposed, allowing operations to be added as plugins within the scope of the Pipeline Framework Plugins Interfaces. These plugins can be executed in any order. |
Rephrase this sentence so it would be more readable:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, These plugins can be executed in any order.
is incorrect since we can not handle the order.
So, I will refine this with These plugins are executed in any order.
- `PostExecution Phase`: | ||
- Internal API: | ||
- `SupendedCondition`: Check if TrainJob is suspended state, and then add `Suspended` condition to TrainJob. | ||
- `CreatedConditon`: Check if TrainJob is created state, and then add `Created` condition to TrainJob. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe more a general comment rather than something strictly in the scope of this PR, I find the name of this condition a bit confusing as the TrainJob has to be "created", even before that condition can be applied.
So either it could be stated that this condition is about whether the children components of the TrainJob have been created, or possibly rename the condition to something like InitializedCondition
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. That makes sense. The current Created
condition problem is mixed semantics for "created TrainJob" and "created sub-objects based on runtime and job".
So, we might want to reconsider it. @astefanutti @andreyvelich What about trying to change "Created" condition type name in a follow-up PR? If we decide which name we should use instead of "Created", we can fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about trying to change "Created" condition type name in a follow-up PR?
That sounds good to me 👍🏼.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, let's discuss it separately.
@astefanutti So do you suggest that we split Created condition for TrainJob between two ?
https://github.com/kubeflow/trainer/blob/8a6091907a2df6ab1d55b9713ead631f72ece56e/docs/proposals/2170-kubeflow-training-v2/README.md#state-transition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich I was more simply thinking about renaming it to something like Initialized
or ComponentsCreated
more than splitting it. As it stands it may be interpreted as the TrainJob is created, which is not what it is.
What this PR does / why we need it:
As we implemented in the Trainer runtime package and sharing previously in https://docs.google.com/presentation/d/1HyEsBa7hxWpIoBXaX6uECiB48FWB85SG1kx15mO8hug/edit?usp=sharing, we added an internal mechanism for easily expanding runtime mechanism.
So, I added the design to KEP to help understand for contributors what the framework mechanism is.
Note that we intend not to add actual interfaces to avoid double management interfaces between codes and documentation since those could potentially be broken in future developments.
cc @kubeflow/wg-training-leads @astefanutti
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: