-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the /dbfs/put
API endpoint to upload smaller DBFS files
#1951
base: main
Are you sure you want to change the base?
Conversation
Test Details: go/deco-tests/12129581169 |
@denik We can skip proper automated benchmarks here. DBFS as a platform feature is almost deprecated and users are highly encouraged to use UC Volumes. The legacy Databricks CLI used the PUT API (https://github.com/databricks/databricks-cli), but this new CLI does not, so this PR is really meant to handle that regression. |
This reverts commit 8ec1e07.
If integration tests don't run automatically, an authorized user can run them manually by following the instructions below: Trigger: Inputs:
Checks will be approved automatically on success. |
@@ -114,7 +228,36 @@ func (w *DbfsClient) Write(ctx context.Context, name string, reader io.Reader, m | |||
} | |||
} | |||
|
|||
handle, err := w.workspaceClient.Dbfs.Open(ctx, absPath, fileMode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new approach really belongs in the SDK.
There is an existing interface for dealing with DBFS files (that is called here).
The details of streaming a file or doing a single put call can be abstracted there and surfaced here with a dedicated FileMode
to indicate whether it should be a single call or multiple calls.
The size of a file can be retrieved through the io.Seeker
interface.
The change here should really be limited to determining the file mode and not the implementation.
The SDK guarantees the correctness of the implementation in either streaming or single-call mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that DBFS is a service that will be deprecated, at least the public-facing part I think we should just keep it in the CLI rather than invest time to define the interfaces on the SDK side and using them here.
This PR is really meant to address the regression from the legacy Databricks CLI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to move it to the SDK if you disagree.
Changes
The DBFS put API is much more efficient and faster than the streaming upload API. It's recommended for files smaller than 2 GB.
This PR modifies the DBFS filer client to use the PUT API for local files that are smaller than the 2 GB threshold.
This PR utilizes a reader and a writer from
io.Pipe()
to make streaming uploads work.Why don't we use the PUT API for non-local files?
The most important use case for the
databricks fs cp
command is uploading local files to DBFS. Thus I did not invest time into making that use case work well with this optimization since it'll lead to additional complexity.Tests
Unit and integration tests. Also manually.
Before:
After: