Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(backend): Fix enable_caching issues when handling PVC creation/deletion #11411

Merged
merged 1 commit into from
Jan 16, 2025

Conversation

rimolive
Copy link
Member

Description of your changes:
Resolves #10188

When setting enable_caching=false, KFP Driver still tries to create the cache entry. Added a condition to ensure cache entry will only get created if enable_caching=true.

Checklist:

@github-actions github-actions bot added the ci-passed All CI tests on a pull request have passed label Nov 26, 2024
err = createCache(ctx, createdExecution, opts, taskStartedTime, fingerPrint, cacheClient)
if err != nil {
return "", createdExecution, pb.Execution_FAILED, fmt.Errorf("failed to create cache entrty for create pvc: %w", err)
if opts.Task.GetCachingOptions().GetEnableCache() && ecfg.CachedMLMDExecutionID != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ecfg.CachedMLMDExecutionID != "" condition might possibly not be necessary here, as creating a PVC doesn't always depend on a cached execution ID. The cached execution ID might not exist at the time of PVC creation.
Unlike here, as deleting a PVC without a valid cached execution ID could result in errors or unintended behavior (e.g., trying to delete a PVC that doesn’t exist or isn't linked to the cache).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right about createPVC, but it seems to me like this won't have the intended result for deletepvc either

Unlike here, as deleting a PVC without a valid cached execution ID could result in errors or unintended behavior (e.g., trying to delete a PVC that doesn’t exist or isn't linked to the cache).

when would we ever hit this in either case?

this exact condition is checked here, and this results in an early exit.

Regardless, how can ecfg.CachedMLMDExecutionID != "" ever be true if we never create a cache outside this condition?

@rimolive did you test to see if deletepvc is even getting cached?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HumairAK Yes, tested with the same pipeline but creating/deleting the PVC and with cache enabled. The cache fingerprint is generated in both tasks, so maybe this condition should also apply to deletePVC.

@github-actions github-actions bot added ci-passed All CI tests on a pull request have passed and removed ci-passed All CI tests on a pull request have passed labels Nov 27, 2024
@github-actions github-actions bot added ci-passed All CI tests on a pull request have passed and removed ci-passed All CI tests on a pull request have passed labels Nov 27, 2024
Copy link
Contributor

@hbelmiro hbelmiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@VaniHaripriya
Copy link
Contributor

/lgtm

Copy link
Contributor

@mprahl mprahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test?

It seems that samples/v2/pipeline_with_volume.py already tests it with the cache enabled. It should be simple enough to copy this test but with the cache disabled?

@rimolive
Copy link
Member Author

rimolive commented Jan 14, 2025

Could you please add a test?

I'm afraid it is not possible because it's a special condition when you create a run directly from the SDK:

client.create_run_from_pipeline_func(test_pipeline, arguments={}, enable_caching=False)

@mprahl Sample test does not interact with KFP that way, it only compiles the sample and submit as-is to KFP. https://github.com/kubeflow/pipelines/blob/master/samples/v2/sample_test.py#L92

Personally, I think duplicating this code just to cover one single case is overengineering. I can take a look at other options, but I'm open for other suggestions.

@mprahl
Copy link
Contributor

mprahl commented Jan 14, 2025

Could you please add a test?
I'm afraid it is not possible because it's a special condition when you create a run directly from the SDK:

client.create_run_from_pipeline_func(test_pipeline, arguments={}, enable_caching=False)

@mprahl Sample test does not interact with KFP that way, it only compiles the sample and submit as-is to KFP. https://github.com/kubeflow/pipelines/blob/master/samples/v2/sample_test.py#L92

Personally, I think duplicating this code just to cover one single case is overengineering. I can take a look at other options, but I'm open for other suggestions.

Could you please add a test?
I'm afraid it is not possible because it's a special condition when you create a run directly from the SDK:

client.create_run_from_pipeline_func(test_pipeline, arguments={}, enable_caching=False)

@mprahl Sample test does not interact with KFP that way, it only compiles the sample and submit as-is to KFP. https://github.com/kubeflow/pipelines/blob/master/samples/v2/sample_test.py#L92

Personally, I think duplicating this code just to cover one single case is overengineering. I can take a look at other options, but I'm open for other suggestions.

pipeline_with_volume

I disagree that this is "overengineering". Since this used to work before and was broken and not caught (i.e. regression) in a newer release, that signals to me that the test coverage is not sufficient.

Submitting the job with enable_caching=False as you showed in your sample is an override of the compiled YAML. You can explicitly disable the cache through the SDK. Here is a diff of the existing sample that disables caching. I'm not suggesting to change the existing test but to add an additional one with this diff.

diff --git a/samples/v2/pipeline_with_volume.py b/samples/v2/pipeline_with_volume.py
index c1a8b1035..216e0947a 100644
--- a/samples/v2/pipeline_with_volume.py
+++ b/samples/v2/pipeline_with_volume.py
@@ -41,10 +41,10 @@ def pipeline_with_volume():
         access_modes=['ReadWriteOnce'],
         size='5Mi',
         storage_class_name='standard',
-    )
+    ).set_caching_options(False)
 
-    task1 = producer()
-    task2 = consumer().after(task1)
+    task1 = producer().set_caching_options(False)
+    task2 = consumer().set_caching_options(False).after(task1)
 
     kubernetes.mount_pvc(
         task1,
@@ -58,11 +58,11 @@ def pipeline_with_volume():
     )
 
     delete_pvc1 = kubernetes.DeletePVC(
-        pvc_name=pvc1.outputs['name']).after(task2)
+        pvc_name=pvc1.outputs['name']).set_caching_options(False).after(task2)
 
 if __name__ == '__main__':
     # execute only if run as a script
     from kfp import compiler
     compiler.Compiler().compile(
         pipeline_func=pipeline_with_volume,
-        package_path='pipeline_with_volume.json')
\ No newline at end of file
+        package_path='pipeline_with_volume.yaml')
\ No newline at end of file

mount_path='/data',
)

delete_pvc1 = kubernetes.DeletePVC(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to also have caching disabled for the deletePvc function test coverage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, as the Driver changes apply to both functions. I'm adding it to the test.

@@ -16,9 +16,9 @@
.PHONY: all
all:
@for f in $$(find . -name "*.py"); do \
compiled="$${f%.*}_pipeline.json"; \
compiled="$${f%.*}_pipeline.yaml"; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll need to update the clean Make target as well since it looks for all JSON files still.

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Pipeline with volume creation, mount and deletion in v2 engine pipeline."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Pipeline with volume creation, mount and deletion in v2 engine pipeline."""
"""Pipeline with no caching on volume creation, mount and deletion in v2 engine pipeline."""

Copy link
Contributor

@mprahl mprahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@rimolive
Copy link
Member Author

@HumairAK Can you PTAL?

@HumairAK
Copy link
Collaborator

/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HumairAK, mprahl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 027ca8b into kubeflow:master Jan 16, 2025
31 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed All CI tests on a pull request have passed lgtm size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[sdk] enable_caching breaks when using CreatePVC: must specify FingerPrint
6 participants