Use the `/dbfs/put` API endpoint to upload smaller DBFS files #1951

shreyas-goenka · 2024-12-02T22:47:57Z

Changes

The DBFS put API is much more efficient and faster than the streaming upload API. It's recommended for files smaller than 2 GB.

This PR modifies the DBFS filer client to use the PUT API for local files that are smaller than the 2 GB threshold.

This PR utilizes a reader and a writer from io.Pipe() to make streaming uploads work.

Why don't we use the PUT API for non-local files?

The most important use case for the databricks fs cp command is uploading local files to DBFS. Thus I did not invest time into making that use case work well with this optimization since it'll lead to additional complexity.

Tests

Unit and integration tests. Also manually.

Before:

venv➜  bundle-playground git:(master) ✗ databricks fs cp databricks.yml dbfs:/Users/[email protected]/foo.yml -p dogfood --debug
12:41:12 DEBUG POST /api/2.0/dbfs/create
> {
>   "path": "/Users/[email protected]/foo.yml"
> }
< HTTP/2.0 200 OK
< {
<   "handle": 1648410211255868
< } pid=66693 sdk=true
12:41:12 DEBUG POST /api/2.0/dbfs/add-block
> {
>   "data": "IyB5YW1sLWxhbmd1YWdlLXNlcnZlcjogJHNjaGVtYT0uL3NjaGVtYS5qc29uCmJ1bmRsZToKICBuYW1lOiBidW5kbGUtcGxh... (144 more bytes)",
>   "handle": 1648410211255868
> }
< HTTP/2.0 200 OK
< {} pid=66693 sdk=true
12:41:12 DEBUG POST /api/2.0/dbfs/close
> {
>   "handle": 1648410211255868
> }
< HTTP/2.0 200 OK
< {} pid=66693 sdk=true
databricks.yml -> dbfs:/Users/[email protected]/foo.yml

After:

.venv➜  bundle-playground git:(master) ✗ cli  fs cp databricks.yml dbfs:/Users/[email protected]/bar.yml -p dogfood --debug
18:48:22 DEBUG POST /api/2.0/dbfs/put
> <io.Reader>
< HTTP/2.0 200 OK
< {} pid=97432 sdk=true

eng-dev-ecosystem-bot · 2024-12-02T23:08:01Z

Test Details: go/deco-tests/12129581169

shreyas-goenka · 2025-01-02T10:00:32Z

@denik We can skip proper automated benchmarks here. DBFS as a platform feature is almost deprecated and users are highly encouraged to use UC Volumes.

The legacy Databricks CLI used the PUT API (https://github.com/databricks/databricks-cli), but this new CLI does not, so this PR is really meant to handle that regression.

This reverts commit 8ec1e07.

github-actions · 2025-01-02T18:33:36Z

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/cli

Inputs:

PR number: 1951
Commit SHA: e9b0afb337ec152a79235449a4baec9322b6c96c

Checks will be approved automatically on success.

pietern · 2025-01-20T09:33:46Z

libs/filer/dbfs_client.go

@@ -114,7 +228,36 @@ func (w *DbfsClient) Write(ctx context.Context, name string, reader io.Reader, m
 		}
 	}

-	handle, err := w.workspaceClient.Dbfs.Open(ctx, absPath, fileMode)


This new approach really belongs in the SDK.

There is an existing interface for dealing with DBFS files (that is called here).

The details of streaming a file or doing a single put call can be abstracted there and surfaced here with a dedicated FileMode to indicate whether it should be a single call or multiple calls.

The size of a file can be retrieved through the io.Seeker interface.

The change here should really be limited to determining the file mode and not the implementation.

The SDK guarantees the correctness of the implementation in either streaming or single-call mode.

Given that DBFS is a service that will be deprecated, at least the public-facing part I think we should just keep it in the CLI rather than invest time to define the interfaces on the SDK side and using them here.

This PR is really meant to address the regression from the legacy Databricks CLI.

Happy to move it to the SDK if you disagree.

Use the /dbfs/put API endpoint to upload smaller DBFS files

c4cea1a

shreyas-goenka temporarily deployed to test-trigger-is December 2, 2024 22:48 — with GitHub Actions Inactive

-

91a2dfa

shreyas-goenka temporarily deployed to test-trigger-is December 2, 2024 22:49 — with GitHub Actions Inactive

refactor to make tests easier

06af01c

shreyas-goenka temporarily deployed to test-trigger-is December 2, 2024 23:05 — with GitHub Actions Inactive

todo

4b484fd

shreyas-goenka temporarily deployed to test-trigger-is December 2, 2024 23:07 — with GitHub Actions Inactive

shreyas-goenka added 6 commits December 4, 2024 20:19

Merge remote-tracking branch 'origin' into multipart-dbfs

04cae2f

Merge remote-tracking branch 'origin' into multipart-dbfs

f690f0a

Merge remote-tracking branch 'origin' into multipart-dbfs

da46c14

some cleanup

78b9788

Merge remote-tracking branch 'origin' into multipart-dbfs

2717aca

add unit test

0ce50fa

shreyas-goenka temporarily deployed to test-trigger-is December 31, 2024 06:39 — with GitHub Actions Inactive

added integration test

63e599c

shreyas-goenka temporarily deployed to test-trigger-is December 31, 2024 07:01 — with GitHub Actions Inactive

shreyas-goenka temporarily deployed to test-trigger-is December 31, 2024 07:02 — with GitHub Actions Inactive

ignore linter

932aeee

shreyas-goenka temporarily deployed to test-trigger-is December 31, 2024 07:17 — with GitHub Actions Inactive

fix fd lingering'

9d8ba09

shreyas-goenka temporarily deployed to test-trigger-is December 31, 2024 07:17 — with GitHub Actions Inactive

shreyas-goenka temporarily deployed to test-trigger-is December 31, 2024 07:18 — with GitHub Actions Inactive

denik approved these changes Jan 2, 2025

View reviewed changes

shreyas-goenka added 3 commits January 2, 2025 16:55

Reapply "add streaming uploads"

890b48f

This reverts commit 8ec1e07.

calculate content length before upload

f70c472

merge

1e2545e

shreyas-goenka temporarily deployed to test-trigger-is January 2, 2025 11:51 — with GitHub Actions Inactive

shreyas-goenka added 2 commits January 2, 2025 17:59

make content length work

6991dea

lint

583637a

shreyas-goenka temporarily deployed to test-trigger-is January 2, 2025 12:38 — with GitHub Actions Inactive

lint

9552131

shreyas-goenka temporarily deployed to test-trigger-is January 2, 2025 12:41 — with GitHub Actions Inactive

shreyas-goenka temporarily deployed to test-trigger-is January 2, 2025 12:42 — with GitHub Actions Inactive

shreyas-goenka requested a review from denik January 2, 2025 13:15

shreyas-goenka added 2 commits January 2, 2025 18:53

simplify copy

7ab9fb7

-

ac37ca0

shreyas-goenka temporarily deployed to test-trigger-is January 2, 2025 13:25 — with GitHub Actions Inactive

cleanup

f4623eb

shreyas-goenka temporarily deployed to test-trigger-is January 2, 2025 18:27 — with GitHub Actions Inactive

use write testutil

ee9499b

shreyas-goenka temporarily deployed to test-trigger-is January 2, 2025 18:32 — with GitHub Actions Inactive

-

e9b0afb

shreyas-goenka temporarily deployed to test-trigger-is January 2, 2025 18:33 — with GitHub Actions Inactive

pietern reviewed Jan 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the `/dbfs/put` API endpoint to upload smaller DBFS files #1951

Use the `/dbfs/put` API endpoint to upload smaller DBFS files #1951

shreyas-goenka commented Dec 2, 2024 •

edited

Loading

eng-dev-ecosystem-bot commented Dec 2, 2024

shreyas-goenka commented Jan 2, 2025

github-actions bot commented Jan 2, 2025

pietern Jan 20, 2025

shreyas-goenka Jan 20, 2025

shreyas-goenka Jan 20, 2025

Use the /dbfs/put API endpoint to upload smaller DBFS files #1951

Are you sure you want to change the base?

Use the /dbfs/put API endpoint to upload smaller DBFS files #1951

Conversation

shreyas-goenka commented Dec 2, 2024 • edited Loading

Changes

Why don't we use the PUT API for non-local files?

Tests

eng-dev-ecosystem-bot commented Dec 2, 2024

shreyas-goenka commented Jan 2, 2025

github-actions bot commented Jan 2, 2025

pietern Jan 20, 2025

Choose a reason for hiding this comment

shreyas-goenka Jan 20, 2025

Choose a reason for hiding this comment

shreyas-goenka Jan 20, 2025

Choose a reason for hiding this comment

Use the `/dbfs/put` API endpoint to upload smaller DBFS files #1951

Use the `/dbfs/put` API endpoint to upload smaller DBFS files #1951

shreyas-goenka commented Dec 2, 2024 •

edited

Loading