-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Catalog from spec #18
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful, thanks for writing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense to me. To confirm my understanding, every icehcunk dataset will have a icechunk.json
file that represents the latest sate of the data, and when we commit a change to a dataset, we are essentially committing a new state file + overwriting the existing icechunk.json
with the contents of this new state file?
Yes that's right. But this raises a few issues / concerns
|
Just a clarifying comment about how this fits into the whole catalog discussion. By eliminating the Catalog from the icechunk spec, we basically eliminate what Orestis and Lindsey dubbed the "technical catalog". That leaves use free to just focus on the "federated catalog". Our platform would provide a "synchronization endpoint" service for Icechunk stores. This would make our service aware of any time that a user read from or committed to a store. It would be up to us how we would integrate this into the federated catalog. I really like this because it provides a key role for our platform without getting overly specific about much else at the Icechunk level. |
Here are my thoughts on these issues. Bottom line, I don't think any of them is a blocker.
This is an acceptable tradeoff for not having to have an external catalog. For stores that support atomic compare-and-swap, it's completely fine. For S3, it will likely be tolerable for many use cases, particularly when using a synchronization endpoint. The alternative would be to require listing the object store and getting the most recent state file in a numerical sequence, similar to LanceDB. That has downsides as well.
Not a problem as long as the commit process ensures that these are consistent. Writing the immutable state file will always succeed, whereas overwriting icechunk.json may fail if there has been a conflicting update in the meantime. The immutable state file will preserve the orphaned snapshot, which is a nice feature IMO.
Based on my response above, I do think it is nice to have both. |
I think I am pretty aligned with everything you are proposing here! I would be interested to hear @digitaltopo's thoughts
I lean more towards file overwriting here as well. It makes it easier to know what is the latest state (since we don't have a catalog) and we will still have the statefiles for the previous versions. This does bring up the question in my mind of if we will need to consider some sort of garbage collection if we accumulate a lot of statefiles.
Agreed, I would also lean towards having both, even if there is data duplication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this started @rabernat. I want to encourage you to finish your thoughts on the API spec to help the service refactor project see how this will fit together.
@@ -131,7 +129,7 @@ The state file is a JSON file with the following JSON schema: | |||
|--|--|--|--| | |||
| id | YES | str UID | A unique identifier for the store | | |||
| generation | YES | int | An integer which must be incremented whenever the state file is updated | | |||
| store_root | NO | str | A URI which points to the root location of the store in object storage. If blank, the store root is assumed to be in the same directory as the state file itself. | | |||
| store_root | YES | str | A URI which points to the root location of the store in object storage. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this is required now? For stores NOT managed by arraylake, this is going to make it very hard to move icechunk data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you that it should not be required. Not sure why I changed my mind here.
def get_store_statefile_location(store_identifier) -> URI: | ||
"""Get the location of a store state file.""" | ||
... | ||
## Synchronization Endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushing this just a bit further will help us scope the changes on the service side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What additional detail do you think is needed to unblock work on the service side?
The idea is that, instead of reading and writing icechunk.json
directly in object storage, the client should read (GET) and write (POST) its contents to the synchronization endpoint. The service should store this data in a database.
Getting more specific means specifying in detail the contents of the icechunk.json
file. This is just the state file. From my POV, that's already pretty detailed (individual fields are specified), although it could be tightened up a bit.
Like the rest of the spec, this is likely to evolve a bit as we actually implement it. And as @paraseba has frequently reminded us, the state file is really the last step in terms of implementation. So getting overly specific about the state file contents now feels a little premature.
Happy to do more work here, just looking for guidance on what would be helpful.
A follow up question here (which may be easier to talk through synchronously): What is your perspective on the benefit of duplicating the latest state file contents in the Just to clarify here, I'm thinking about this as outside of the "synchronization endpoint"; that is, in the offline, OS version of icechunk. |
@lindseynield Yes, this is purely a performance optimization. In fact, we should probably have both things, the contents and the link in the icechunk.json file. Doing increasing numbers for the different icechunk.json versions could get tricky, in particular, it needs the "just arrived" write-if-not-exists feature of S3. Would be safer to have random keys for the icechunk files. |
Fewer IO operations is definitely the main motivation here. For the synchronization endpoint option, an additional benefit of duplicating the state file metadata is that we get this information in our database, so we have more context about what sort of updates are happening. If we are only storing a very lightweight pointer, we know basically nothing about the dataset and its evolution without being able to read the object store. |
Totally, the benefits make sense to me in the context of the synchronization endpoint, I was just curious of the case outside of the endpoint where we only have the |
Thanks @paraseba, just to disambiguate here, are you saying there are multiple The definitions I was working with so far have been:
|
No no, a single icechunk.json, multiple |
As discussed on Slack, going to merge this as is and then make a new PR for the newly proposed statefile update mechanism. |
Following our discussion today, this PR removes the concept of a Catalog from the spec entirely and replaces it with a more lightweight Synchronization Endpoint.
TODO: