-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(sdk/backend): Add support for placeholders in resource limits #11501
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
9a90c6a
to
c0343b9
Compare
Note that I labeled this as a feature but it's also a bug fix as mentioned in the Slack thread here: Additionally, it seems that this recent unreleased commit added placeholder support to the wrong accelerator type field, so it'd be nice to merge this PR before the next KFP release: |
The API introduced new fields prefixed with Resource (e.g. ResourceCpuLimit) to replace the old fields without the prefix. The Driver hadn't been updated to honor those fields but the SDK started using them which led to unexpected behavior. The Driver now honors both fields but prioritizes the new fields. The SDK now only sets the new fields. The outcome is that resource limits/requests can now use input parameters. Note that pipeline_spec_builder.py was doing some validation on the limits/requests being set, but that's already handled in the user facing method (e.g. set_cpu_limit). Resolves: kubeflow#11500 Signed-off-by: mprahl <[email protected]>
c0343b9
to
4d49c1d
Compare
I've tested the SDK portion as we're eagerly awaiting the capability of setting cpu/ram/accelerator type/count dynamically in our projects. I also see the following in the compiled pipeline configuration for a step using those dynamic properties. resources:
accelerator:
resourceCount: '{{$.inputs.parameters[''pipelinechannel--some-step-accelerator_count'']}}'
resourceType: '{{$.inputs.parameters[''pipelinechannel--some-step-accelerator_type'']}}'
resourceCpuLimit: '{{$.inputs.parameters[''pipelinechannel--some-step-cpu'']}}'
resourceMemoryLimit: '{{$.inputs.parameters[''pipelinechannel--some-step-ram'']}}' |
@@ -362,6 +362,35 @@ func Container(ctx context.Context, opts Options, mlmd *metadata.Client, cacheCl | |||
return execution, nil | |||
} | |||
|
|||
// getPodResource will accept the new field that accepts placeholders (e.g. resourceMemoryLimit) and the old float64 | |||
// field (e.g. memoryLimit) and return the resolved value as a Quanity. If the returned Quantity is nil, it was not set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// field (e.g. memoryLimit) and return the resolved value as a Quanity. If the returned Quantity is nil, it was not set | |
// field (e.g. memoryLimit) and return the resolved value as a Quantity. If the returned Quantity is nil, it was not set |
|
||
resolved, err = resolvePodSpecInputRuntimeParameter(new, executorInput) | ||
if err != nil { | ||
return nil, fmt.Errorf("failed to init podSpecPatch: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would prefer if this error was specific to getPodResource()
, something like "failed to resolve executor input when retrieving pod resource", since we're already reporting the initpodspecpatch error in the caller function
@@ -362,6 +362,35 @@ func Container(ctx context.Context, opts Options, mlmd *metadata.Client, cacheCl | |||
return execution, nil | |||
} | |||
|
|||
// getPodResource will accept the new field that accepts placeholders (e.g. resourceMemoryLimit) and the old float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe worth mentioning in the getPodResource call new values take precedent over old, if both are provided
Note that this also resolves: #11375 Looks like a safe change. Folks pre-kfp 2.4 and post 2.4 will have continued support of old fields, even with pipelines rendered pre sdk 2.10, when submitted. So removing them from the pipeline render makes sense. Recompiling will remove these old fields, which does seem like it's api-changing (the api is implicit here in the yaml even when not defined in the proto) but looks backwards compatible since we'll continue to handle it on driver side. Since initial change completely removed it from vertex side, I do not think this is likely to have downstream effects there, but regardless fyi @chensun. I tested this with deprecated and new values, and with parameter values which are correctly resolved. Which is great, thanks @mprahl! some nit comments above @mprahl upto you to address them or not, let me know otherwise I can approve /lgtm |
Description of your changes:
The API introduced new fields prefixed with Resource (e.g. ResourceCpuLimit) to replace the old fields without the prefix. The Driver hadn't been updated to honor those fields but the SDK started using them which led to unexpected behavior.
The Driver now honors both fields but prioritizes the new fields. The SDK now only sets the new fields.
The outcome is that resource limits/requests can now use input parameters.
Note that pipeline_spec_builder.py was doing some validation on the limits/requests being set, but that's already handled in the user facing method (e.g. set_cpu_limit).
Resolves:
#11500
Checklist: