Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sdk/backend): Add support for placeholders in resource limits #11501

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mprahl
Copy link
Contributor

@mprahl mprahl commented Jan 6, 2025

Description of your changes:

The API introduced new fields prefixed with Resource (e.g. ResourceCpuLimit) to replace the old fields without the prefix. The Driver hadn't been updated to honor those fields but the SDK started using them which led to unexpected behavior.

The Driver now honors both fields but prioritizes the new fields. The SDK now only sets the new fields.

The outcome is that resource limits/requests can now use input parameters.

Note that pipeline_spec_builder.py was doing some validation on the limits/requests being set, but that's already handled in the user facing method (e.g. set_cpu_limit).

Resolves:
#11500

Checklist:

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chensun for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mprahl mprahl force-pushed the fix-param-resources branch 2 times, most recently from 9a90c6a to c0343b9 Compare January 6, 2025 22:26
@mprahl
Copy link
Contributor Author

mprahl commented Jan 6, 2025

/cc @HumairAK @rimolive

Note that I labeled this as a feature but it's also a bug fix as mentioned in the Slack thread here:
https://cloud-native.slack.com/archives/C073N7BMLB1/p1735746170303549

Additionally, it seems that this recent unreleased commit added placeholder support to the wrong accelerator type field, so it'd be nice to merge this PR before the next KFP release:
#11404

@google-oss-prow google-oss-prow bot requested review from HumairAK and rimolive January 6, 2025 22:27
The API introduced new fields prefixed with Resource (e.g.
ResourceCpuLimit) to replace the old fields without the prefix. The
Driver hadn't been updated to honor those fields but the SDK started
using them which led to unexpected behavior.

The Driver now honors both fields but prioritizes the new fields. The
SDK now only sets the new fields.

The outcome is that resource limits/requests can now use input
parameters.

Note that pipeline_spec_builder.py was doing some validation on the
limits/requests being set, but that's already handled in the user facing
method (e.g. set_cpu_limit).

Resolves:
kubeflow#11500

Signed-off-by: mprahl <[email protected]>
@mprahl mprahl force-pushed the fix-param-resources branch from c0343b9 to 4d49c1d Compare January 6, 2025 22:36
@HumairAK HumairAK added this to the KFP 2.4.0 milestone Jan 8, 2025
@tomzx
Copy link

tomzx commented Jan 8, 2025

I've tested the SDK portion as we're eagerly awaiting the capability of setting cpu/ram/accelerator type/count dynamically in our projects.
I can confirm that it works, as in it passes compilation and can be submitted to GCP Vertex AI (for example).
The Vertex AI pipeline does use the dynamically defined cpu/ram/accelerator type/count to find the proper machine to use and to attach accelerators if needed.

I also see the following in the compiled pipeline configuration for a step using those dynamic properties.

        resources:
          accelerator:
            resourceCount: '{{$.inputs.parameters[''pipelinechannel--some-step-accelerator_count'']}}'
            resourceType: '{{$.inputs.parameters[''pipelinechannel--some-step-accelerator_type'']}}'
          resourceCpuLimit: '{{$.inputs.parameters[''pipelinechannel--some-step-cpu'']}}'
          resourceMemoryLimit: '{{$.inputs.parameters[''pipelinechannel--some-step-ram'']}}'

@@ -362,6 +362,35 @@ func Container(ctx context.Context, opts Options, mlmd *metadata.Client, cacheCl
return execution, nil
}

// getPodResource will accept the new field that accepts placeholders (e.g. resourceMemoryLimit) and the old float64
// field (e.g. memoryLimit) and return the resolved value as a Quanity. If the returned Quantity is nil, it was not set
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// field (e.g. memoryLimit) and return the resolved value as a Quanity. If the returned Quantity is nil, it was not set
// field (e.g. memoryLimit) and return the resolved value as a Quantity. If the returned Quantity is nil, it was not set


resolved, err = resolvePodSpecInputRuntimeParameter(new, executorInput)
if err != nil {
return nil, fmt.Errorf("failed to init podSpecPatch: %w", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would prefer if this error was specific to getPodResource(), something like "failed to resolve executor input when retrieving pod resource", since we're already reporting the initpodspecpatch error in the caller function

@@ -362,6 +362,35 @@ func Container(ctx context.Context, opts Options, mlmd *metadata.Client, cacheCl
return execution, nil
}

// getPodResource will accept the new field that accepts placeholders (e.g. resourceMemoryLimit) and the old float64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe worth mentioning in the getPodResource call new values take precedent over old, if both are provided

@HumairAK
Copy link
Collaborator

HumairAK commented Jan 9, 2025

Note that this also resolves: #11375

Looks like a safe change. Folks pre-kfp 2.4 and post 2.4 will have continued support of old fields, even with pipelines rendered pre sdk 2.10, when submitted. So removing them from the pipeline render makes sense. Recompiling will remove these old fields, which does seem like it's api-changing (the api is implicit here in the yaml even when not defined in the proto) but looks backwards compatible since we'll continue to handle it on driver side. Since initial change completely removed it from vertex side, I do not think this is likely to have downstream effects there, but regardless fyi @chensun.

I tested this with deprecated and new values, and with parameter values which are correctly resolved. Which is great, thanks @mprahl!

some nit comments above @mprahl upto you to address them or not, let me know otherwise I can approve

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants