Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added checksum support to s3 operations. #17

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rchitale7
Copy link
Member

@rchitale7 rchitale7 commented Mar 3, 2025

Description

Adding support for checksums on s3 file download and upload operations. Defaulting to CRC32 for uploads.

Issues Resolved

Closes #12

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@rchitale7 rchitale7 added the enhancement New feature or request label Mar 3, 2025
@rchitale7 rchitale7 self-assigned this Mar 3, 2025
@rchitale7 rchitale7 marked this pull request as ready for review March 3, 2025 21:51

# Start with default values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't S3 provide some type of integrity checking by default already? https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it does - but it does not use CRC32 by default, according to those docs. I still think it makes sense to give the caller of this library some control over these parameters, as they may need to be tuned in the future. In fact, to match OpenSearch's behavior, we should be overriding the s3 integrity defaults with CRC32

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was in regards to the ChecksumMode on the download path. Just checking if we actually needed it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I'm actually not sure - I looked into the S3Transfer code and wasn't able to determine if its enabled by default. My guess is that it is, but to be safe, we should enable it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for default configuration are we going to use CRC checksum or anything else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRC32 checksum for uploads. For downloads, the checksum will be based on the checksum algorithm used when the object was first uploaded to s3.

Comment on lines +206 to 209
except TypeError as e:
raise BlobError(f"Error calling boto3.download_fileobj: {e}") from e
except ClientError as e:
raise BlobError(f"Error downloading file: {e}") from e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need additional handling for checksum errors? Also is there any way to actually test that the checksum is working?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checksum errors get raised as ClientError, so I don't think it makes sense to add additional error handling. I think testing if the checksum works or not belongs in the integration tests, not unit tests. But I'd need to think more about how to do that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we track the integration tests we need to add later in a GH issue then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Implementation] Remote Vector Index Builder - Create Vectors Dataset and Upload Index Tasks
3 participants