Implement partial or streaming reads/writes (CloudFile abstraction) #9

pjbull · 2020-08-17T17:48:52Z

Currently, if you want to read a file we download the whole file and then open it for reading. Most backends will have some way to do streaming reads. We may be able to improve the experience if we can do the same.

There may be some tricky bits with caching here. Can we stream to the user and then cache the streamed portion at the same time? Is there an async way that makes sense to do this?

pjbull · 2024-07-29T18:55:00Z

From @moradology in #455 originally:

Glad to see that there's some willingness to explore options outside the current scope of cloud providers. HTTP presents some unique issues vs the more-like-a-real-filesystem cloud provider options already supported - hopefully those don't prove to be more than an annoyance here.

One thing I'm wondering about is range reads. In boto, ranges can be read like this:

import boto3

s3 = boto3.client('s3')

bucket_name = 'your-bucket-name'
object_key = 'path/to/your/object'
start_byte = 0
end_byte = 1023  # First 1KB

response = s3.get_object(
    Bucket=bucket_name,
    Key=object_key,
    Range=f'bytes={start_byte}-{end_byte}'
)

data = response['Body'].read() # Just the bytes we want

I'm new to the lib and certainly haven't gone through the source in detail but I wonder how well the Path abstraction fits with this. Here's what a Pathlib (stdlib) read looks like for only selected ranges:

from pathlib import Path

file_path = Path('path/to/your/file')
start_byte = 0
end_byte = 1023  # First 1KB

with file_path.open('rb') as file:
    file.seek(start_byte)
    data = file.read(end_byte - start_byte + 1)

It occurs to me that the expected behavior in this instance is a bit ambiguous, right? Like, if the file is remote, would cloudpathlib behavior download the whole thing locally and then seek through the bytes or would it appropriately attempt to read only bytes as-needed?

pjbull · 2024-07-29T18:55:12Z

Like you point out, there is not a pathlib API for partial or streaming read/write. That is handled by the File abstractions in io.

Our current caching model is whole-file based. So the code above would execute, it just would download the whole file first, which is probably not what a user wants.

We have discussed CloudFile abstractions as a potential scope extension to enable these scenarios. Other folks have reported success with smart_open + cloudpathlib together for these scenarios. That said it is pretty complicated implementation (e.g., take a look at the smart_open s3 version). Given that scope, I think a File abstraction is a longer way out.

We also have discussed a CloudPath only read_range API. This would be substantially easier to implement, but breaks code that wants to handle both Path and CloudPath in the same way.

TomNicholas · 2024-07-29T19:10:48Z

Thanks so much for the engagement here @pjbull ! We're very interested in this (our entire stack / approach to science basically relies on a step like this).

We have discussed CloudFile abstractions

I like the idea of a CloudFile abstraction as a way to still use pathlib-like syntax for local/remote files. The idea of importing AnyPath and everything else just work is extremely enticing.

pjbull · 2024-07-29T19:57:28Z

Yeah, we also like the idea—just are a little wary of the implementation complexity and maintenance burden.

We'd be happy to consider a PR that does the following:

Plays nice with our open implementation
Plays nice with our caching strategy
Mirrors one of the io abstractions in the way we do for pathlib.Path
Does not introduce additional third-party dependencies (this is a preference; there's a little wiggle room here for optional features)
Works with our the core providers we support
Usual goodies of tests, docs, typing, etc.

It may be the case that we decide it should be a separate repo/project that we have as an optional dependency and use if available, or we may decide it should be part of the cloudpathlib core.

msmitherdc · 2024-10-10T22:18:44Z

I've been doing this direct streaming via the smart-open method in #264

pjbull added the enhancement New feature or request label Oct 2, 2020

pjbull mentioned this issue Jun 30, 2021

Hello from fsspec! #96

Closed

pjbull mentioned this issue Jun 2, 2022

Add ability to turn off caching (or automatically remove it once files are uploaded or loaded into memory) #233

Closed

pjbull changed the title ~~Implement streaming reads~~ Implement streaming reads/writes Jun 2, 2022

pjbull mentioned this issue Aug 19, 2022

Implement smart_open instead of .open() to allow efficient streaming (saving/loading) of large files to cloud bucket. #264

Open

pjbull mentioned this issue May 25, 2023

OverwriteNewerLocalError when reading same resource in parallel #283

Open

pjbull mentioned this issue Jul 29, 2024

Read from http - httppathlib? #455

Open

pjbull changed the title ~~Implement streaming reads/writes~~ Implement partial or streaming reads/writes Jul 29, 2024

pjbull changed the title ~~Implement partial or streaming reads/writes~~ Implement partial or streaming reads/writes (CloudFile abstraction) Jul 30, 2024

scottyhq mentioned this issue Nov 25, 2024

Use cloudpathlib instead of fsspec? zarr-developers/VirtualiZarr#172

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement partial or streaming reads/writes (CloudFile abstraction) #9

Implement partial or streaming reads/writes (CloudFile abstraction) #9

pjbull commented Aug 17, 2020

pjbull commented Jul 29, 2024

pjbull commented Jul 29, 2024

TomNicholas commented Jul 29, 2024

pjbull commented Jul 29, 2024

msmitherdc commented Oct 10, 2024

Implement partial or streaming reads/writes (CloudFile abstraction) #9

Implement partial or streaming reads/writes (CloudFile abstraction) #9

Comments

pjbull commented Aug 17, 2020

pjbull commented Jul 29, 2024

pjbull commented Jul 29, 2024

TomNicholas commented Jul 29, 2024

pjbull commented Jul 29, 2024

msmitherdc commented Oct 10, 2024