-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement partial or streaming reads/writes (CloudFile abstraction) #9
Comments
From @moradology in #455 originally: Glad to see that there's some willingness to explore options outside the current scope of cloud providers. HTTP presents some unique issues vs the more-like-a-real-filesystem cloud provider options already supported - hopefully those don't prove to be more than an annoyance here. One thing I'm wondering about is range reads. In boto, ranges can be read like this: import boto3
s3 = boto3.client('s3')
bucket_name = 'your-bucket-name'
object_key = 'path/to/your/object'
start_byte = 0
end_byte = 1023 # First 1KB
response = s3.get_object(
Bucket=bucket_name,
Key=object_key,
Range=f'bytes={start_byte}-{end_byte}'
)
data = response['Body'].read() # Just the bytes we want I'm new to the lib and certainly haven't gone through the source in detail but I wonder how well the from pathlib import Path
file_path = Path('path/to/your/file')
start_byte = 0
end_byte = 1023 # First 1KB
with file_path.open('rb') as file:
file.seek(start_byte)
data = file.read(end_byte - start_byte + 1) It occurs to me that the expected behavior in this instance is a bit ambiguous, right? Like, if the file is remote, would |
Like you point out, there is not a Our current caching model is whole-file based. So the code above would execute, it just would download the whole file first, which is probably not what a user wants. We have discussed We also have discussed a |
Thanks so much for the engagement here @pjbull ! We're very interested in this (our entire stack / approach to science basically relies on a step like this).
I like the idea of a |
Yeah, we also like the idea—just are a little wary of the implementation complexity and maintenance burden. We'd be happy to consider a PR that does the following:
It may be the case that we decide it should be a separate repo/project that we have as an optional dependency and use if available, or we may decide it should be part of the |
I've been doing this direct streaming via the smart-open method in #264 |
Currently, if you want to read a file we download the whole file and then open it for reading. Most backends will have some way to do streaming reads. We may be able to improve the experience if we can do the same.
There may be some tricky bits with caching here. Can we stream to the user and then cache the streamed portion at the same time? Is there an async way that makes sense to do this?
The text was updated successfully, but these errors were encountered: