-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added checksum support to s3 operations. #17
base: main
Are you sure you want to change the base?
Conversation
remote_vector_index_builder/core/common/models/index_build_parameters.py
Outdated
Show resolved
Hide resolved
remote_vector_index_builder/core/object_store/s3/s3_object_store.py
Outdated
Show resolved
Hide resolved
|
||
# Start with default values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't S3 provide some type of integrity checking by default already? https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it does - but it does not use CRC32 by default, according to those docs. I still think it makes sense to give the caller of this library some control over these parameters, as they may need to be tuned in the future. In fact, to match OpenSearch's behavior, we should be overriding the s3 integrity defaults with CRC32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was in regards to the ChecksumMode
on the download path. Just checking if we actually needed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. I'm actually not sure - I looked into the S3Transfer code and wasn't able to determine if its enabled by default. My guess is that it is, but to be safe, we should enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, for default configuration are we going to use CRC checksum or anything else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRC32 checksum for uploads. For downloads, the checksum will be based on the checksum algorithm used when the object was first uploaded to s3.
remote_vector_index_builder/core/object_store/s3/s3_object_store.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Rohan Chitale <[email protected]>
8ef764f
to
1dfe935
Compare
except TypeError as e: | ||
raise BlobError(f"Error calling boto3.download_fileobj: {e}") from e | ||
except ClientError as e: | ||
raise BlobError(f"Error downloading file: {e}") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need additional handling for checksum errors? Also is there any way to actually test that the checksum is working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checksum errors get raised as ClientError
, so I don't think it makes sense to add additional error handling. I think testing if the checksum works or not belongs in the integration tests, not unit tests. But I'd need to think more about how to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we track the integration tests we need to add later in a GH issue then?
Description
Adding support for checksums on s3 file download and upload operations. Defaulting to CRC32 for uploads.
Issues Resolved
Closes #12
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.