Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement partial or streaming reads/writes (CloudFile abstraction) #9

Open
pjbull opened this issue Aug 17, 2020 · 5 comments
Open
Labels
enhancement New feature or request

Comments

@pjbull
Copy link
Member

pjbull commented Aug 17, 2020

Currently, if you want to read a file we download the whole file and then open it for reading. Most backends will have some way to do streaming reads. We may be able to improve the experience if we can do the same.

There may be some tricky bits with caching here. Can we stream to the user and then cache the streamed portion at the same time? Is there an async way that makes sense to do this?

@pjbull
Copy link
Member Author

pjbull commented Jul 29, 2024

From @moradology in #455 originally:


Glad to see that there's some willingness to explore options outside the current scope of cloud providers. HTTP presents some unique issues vs the more-like-a-real-filesystem cloud provider options already supported - hopefully those don't prove to be more than an annoyance here.

One thing I'm wondering about is range reads. In boto, ranges can be read like this:

import boto3

s3 = boto3.client('s3')

bucket_name = 'your-bucket-name'
object_key = 'path/to/your/object'
start_byte = 0
end_byte = 1023  # First 1KB

response = s3.get_object(
    Bucket=bucket_name,
    Key=object_key,
    Range=f'bytes={start_byte}-{end_byte}'
)

data = response['Body'].read() # Just the bytes we want

I'm new to the lib and certainly haven't gone through the source in detail but I wonder how well the Path abstraction fits with this. Here's what a Pathlib (stdlib) read looks like for only selected ranges:

from pathlib import Path

file_path = Path('path/to/your/file')
start_byte = 0
end_byte = 1023  # First 1KB

with file_path.open('rb') as file:
    file.seek(start_byte)
    data = file.read(end_byte - start_byte + 1)

It occurs to me that the expected behavior in this instance is a bit ambiguous, right? Like, if the file is remote, would cloudpathlib behavior download the whole thing locally and then seek through the bytes or would it appropriately attempt to read only bytes as-needed?

@pjbull
Copy link
Member Author

pjbull commented Jul 29, 2024

Like you point out, there is not a pathlib API for partial or streaming read/write. That is handled by the File abstractions in io.

Our current caching model is whole-file based. So the code above would execute, it just would download the whole file first, which is probably not what a user wants.

We have discussed CloudFile abstractions as a potential scope extension to enable these scenarios. Other folks have reported success with smart_open + cloudpathlib together for these scenarios. That said it is pretty complicated implementation (e.g., take a look at the smart_open s3 version). Given that scope, I think a File abstraction is a longer way out.

We also have discussed a CloudPath only read_range API. This would be substantially easier to implement, but breaks code that wants to handle both Path and CloudPath in the same way.

@TomNicholas
Copy link

Thanks so much for the engagement here @pjbull ! We're very interested in this (our entire stack / approach to science basically relies on a step like this).

We have discussed CloudFile abstractions

I like the idea of a CloudFile abstraction as a way to still use pathlib-like syntax for local/remote files. The idea of importing AnyPath and everything else just work is extremely enticing.

@pjbull
Copy link
Member Author

pjbull commented Jul 29, 2024

Yeah, we also like the idea—just are a little wary of the implementation complexity and maintenance burden.

We'd be happy to consider a PR that does the following:

  • Plays nice with our open implementation
  • Plays nice with our caching strategy
  • Mirrors one of the io abstractions in the way we do for pathlib.Path
  • Does not introduce additional third-party dependencies (this is a preference; there's a little wiggle room here for optional features)
  • Works with our the core providers we support
  • Usual goodies of tests, docs, typing, etc.

It may be the case that we decide it should be a separate repo/project that we have as an optional dependency and use if available, or we may decide it should be part of the cloudpathlib core.

@pjbull pjbull changed the title Implement streaming reads/writes Implement partial or streaming reads/writes Jul 29, 2024
@pjbull pjbull changed the title Implement partial or streaming reads/writes Implement partial or streaming reads/writes (CloudFile abstraction) Jul 30, 2024
@msmitherdc
Copy link

I've been doing this direct streaming via the smart-open method in #264

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants