Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the /dbfs/put API endpoint to upload smaller DBFS files #1951

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

shreyas-goenka
Copy link
Contributor

@shreyas-goenka shreyas-goenka commented Dec 2, 2024

Changes

The DBFS put API is much more efficient and faster than the streaming upload API. It's recommended for files smaller than 2 GB.

This PR modifies the DBFS filer client to use the PUT API for local files that are smaller than the 2 GB threshold.

This PR utilizes a reader and a writer from io.Pipe() to make streaming uploads work.

Why don't we use the PUT API for non-local files?

The most important use case for the databricks fs cp command is uploading local files to DBFS. Thus I did not invest time into making that use case work well with this optimization since it'll lead to additional complexity.

Tests

Unit and integration tests. Also manually.

Before:

venv➜  bundle-playground git:(master) ✗ databricks fs cp databricks.yml dbfs:/Users/[email protected]/foo.yml -p dogfood --debug
12:41:12 DEBUG POST /api/2.0/dbfs/create
> {
>   "path": "/Users/[email protected]/foo.yml"
> }
< HTTP/2.0 200 OK
< {
<   "handle": 1648410211255868
< } pid=66693 sdk=true
12:41:12 DEBUG POST /api/2.0/dbfs/add-block
> {
>   "data": "IyB5YW1sLWxhbmd1YWdlLXNlcnZlcjogJHNjaGVtYT0uL3NjaGVtYS5qc29uCmJ1bmRsZToKICBuYW1lOiBidW5kbGUtcGxh... (144 more bytes)",
>   "handle": 1648410211255868
> }
< HTTP/2.0 200 OK
< {} pid=66693 sdk=true
12:41:12 DEBUG POST /api/2.0/dbfs/close
> {
>   "handle": 1648410211255868
> }
< HTTP/2.0 200 OK
< {} pid=66693 sdk=true
databricks.yml -> dbfs:/Users/[email protected]/foo.yml

After:

.venv➜  bundle-playground git:(master) ✗ cli  fs cp databricks.yml dbfs:/Users/[email protected]/bar.yml -p dogfood --debug
18:48:22 DEBUG POST /api/2.0/dbfs/put
> <io.Reader>
< HTTP/2.0 200 OK
< {} pid=97432 sdk=true

@eng-dev-ecosystem-bot
Copy link
Collaborator

Test Details: go/deco-tests/12129581169

@shreyas-goenka
Copy link
Contributor Author

@denik We can skip proper automated benchmarks here. DBFS as a platform feature is almost deprecated and users are highly encouraged to use UC Volumes.

The legacy Databricks CLI used the PUT API (https://github.com/databricks/databricks-cli), but this new CLI does not, so this PR is really meant to handle that regression.

Copy link

github-actions bot commented Jan 2, 2025

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/cli

Inputs:

  • PR number: 1951
  • Commit SHA: e9b0afb337ec152a79235449a4baec9322b6c96c

Checks will be approved automatically on success.

@@ -114,7 +228,36 @@ func (w *DbfsClient) Write(ctx context.Context, name string, reader io.Reader, m
}
}

handle, err := w.workspaceClient.Dbfs.Open(ctx, absPath, fileMode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new approach really belongs in the SDK.

There is an existing interface for dealing with DBFS files (that is called here).

The details of streaming a file or doing a single put call can be abstracted there and surfaced here with a dedicated FileMode to indicate whether it should be a single call or multiple calls.

The size of a file can be retrieved through the io.Seeker interface.

The change here should really be limited to determining the file mode and not the implementation.

The SDK guarantees the correctness of the implementation in either streaming or single-call mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that DBFS is a service that will be deprecated, at least the public-facing part I think we should just keep it in the CLI rather than invest time to define the interfaces on the SDK side and using them here.

This PR is really meant to address the regression from the legacy Databricks CLI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to move it to the SDK if you disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants