Remove Catalog from spec #18

rabernat · 2024-08-16T21:26:08Z

Following our discussion today, this PR removes the concept of a Catalog from the spec entirely and replaces it with a more lightweight Synchronization Endpoint.

TODO:

Finish synchronization endpoint API spec

paraseba

Beautiful, thanks for writing this.

lindseynield

I think this makes sense to me. To confirm my understanding, every icehcunk dataset will have a icechunk.json file that represents the latest sate of the data, and when we commit a change to a dataset, we are essentially committing a new state file + overwriting the existing icechunk.json with the contents of this new state file?

spec/icechunk_spec.md

rabernat · 2024-08-16T23:48:02Z

Yes that's right. But this raises a few issues / concerns

it violates the "files are immutable once written" constraint
the same data is duplicated between the immutable state file and icechunk.json
do we really need a new state file for every commit if we are actually just overwriting icechunk.json?

rabernat · 2024-08-18T20:50:14Z

Just a clarifying comment about how this fits into the whole catalog discussion.

By eliminating the Catalog from the icechunk spec, we basically eliminate what Orestis and Lindsey dubbed the "technical catalog". That leaves use free to just focus on the "federated catalog".

Our platform would provide a "synchronization endpoint" service for Icechunk stores. This would make our service aware of any time that a user read from or committed to a store. It would be up to us how we would integrate this into the federated catalog.

I really like this because it provides a key role for our platform without getting overly specific about much else at the Icechunk level.

rabernat · 2024-08-20T15:15:57Z

Here are my thoughts on these issues. Bottom line, I don't think any of them is a blocker.

it violates the "files are immutable once written" constraint

This is an acceptable tradeoff for not having to have an external catalog. For stores that support atomic compare-and-swap, it's completely fine. For S3, it will likely be tolerable for many use cases, particularly when using a synchronization endpoint.

The alternative would be to require listing the object store and getting the most recent state file in a numerical sequence, similar to LanceDB. That has downsides as well.

the same data is duplicated between the immutable state file and icechunk.json

Not a problem as long as the commit process ensures that these are consistent. Writing the immutable state file will always succeed, whereas overwriting icechunk.json may fail if there has been a conflicting update in the meantime. The immutable state file will preserve the orphaned snapshot, which is a nice feature IMO.

do we really need a new state file for every commit if we are actually just overwriting icechunk.json?

Based on my response above, I do think it is nice to have both.

lindseynield · 2024-08-20T19:49:38Z

I think I am pretty aligned with everything you are proposing here! I would be interested to hear @digitaltopo's thoughts

it violates the "files are immutable once written" constraint
This is an acceptable tradeoff for not having to have an external catalog. For stores that support atomic compare-and-swap, it's completely fine. For S3, it will likely be tolerable for many use cases, particularly when using a synchronization endpoint.

The alternative would be to require listing the object store and getting the most recent state file in a numerical sequence, similar to LanceDB. That has downsides as well.

I lean more towards file overwriting here as well. It makes it easier to know what is the latest state (since we don't have a catalog) and we will still have the statefiles for the previous versions. This does bring up the question in my mind of if we will need to consider some sort of garbage collection if we accumulate a lot of statefiles.

the same data is duplicated between the immutable state file and icechunk.json
Not a problem as long as the commit process ensures that these are consistent. Writing the immutable state file will always succeed, whereas overwriting icechunk.json may fail if there has been a conflicting update in the meantime. The immutable state file will preserve the orphaned snapshot, which is a nice feature IMO.

do we really need a new state file for every commit if we are actually just overwriting icechunk.json?
Based on my response above, I do think it is nice to have both.

Agreed, I would also lean towards having both, even if there is data duplication

jhamman

Thanks for getting this started @rabernat. I want to encourage you to finish your thoughts on the API spec to help the service refactor project see how this will fit together.

jhamman · 2024-08-20T22:36:25Z

spec/icechunk_spec.md

@@ -131,7 +129,7 @@ The state file is a JSON file with the following JSON schema:
 |--|--|--|--|
 | id | YES | str UID | A unique identifier for the store |
 | generation | YES | int | An integer which must be incremented whenever the state file is updated |
-| store_root | NO | str | A URI which points to the root location of the store in object storage. If blank, the store root is assumed to be in the same directory as the state file itself. | 
+| store_root | YES | str | A URI which points to the root location of the store in object storage. | 


Can you explain why this is required now? For stores NOT managed by arraylake, this is going to make it very hard to move icechunk data.

I agree with you that it should not be required. Not sure why I changed my mind here.

jhamman · 2024-08-20T22:39:32Z

spec/icechunk_spec.md

-def get_store_statefile_location(store_identifier) -> URI:
-    """Get the location of a store state file."""
-    ...
+## Synchronization Endpoint


pushing this just a bit further will help us scope the changes on the service side.

What additional detail do you think is needed to unblock work on the service side?

The idea is that, instead of reading and writing icechunk.json directly in object storage, the client should read (GET) and write (POST) its contents to the synchronization endpoint. The service should store this data in a database.

Getting more specific means specifying in detail the contents of the icechunk.json file. This is just the state file. From my POV, that's already pretty detailed (individual fields are specified), although it could be tightened up a bit.

Like the rest of the spec, this is likely to evolve a bit as we actually implement it. And as @paraseba has frequently reminded us, the state file is really the last step in terms of implementation. So getting overly specific about the state file contents now feels a little premature.

Happy to do more work here, just looking for guidance on what would be helpful.

lindseynield · 2024-08-21T18:42:14Z

A follow up question here (which may be easier to talk through synchronously): What is your perspective on the benefit of duplicating the latest state file contents in the icechunk.json file (that is, the file that points to the latest version of the dataset)? Is there a reason that this can't just be a pointer to the state file? This would require two IO operations instead of one to read the latest dataset state file contents, but it's not super clear to me if that is the only benefit.

Just to clarify here, I'm thinking about this as outside of the "synchronization endpoint"; that is, in the offline, OS version of icechunk.

paraseba · 2024-08-21T18:47:30Z

This would require two IO operations instead of one to read the latest dataset state file contents, but it's not super clear to me if that is the only benefit.

@lindseynield Yes, this is purely a performance optimization. In fact, we should probably have both things, the contents and the link in the icechunk.json file. Doing increasing numbers for the different icechunk.json versions could get tricky, in particular, it needs the "just arrived" write-if-not-exists feature of S3. Would be safer to have random keys for the icechunk files.

rabernat · 2024-08-21T18:54:34Z

Is there a reason that this can't just be a pointer to the state file? This would require two IO operations instead of one to read the latest dataset state file contents, but it's not super clear to me if that is the only benefit.

Fewer IO operations is definitely the main motivation here.

For the synchronization endpoint option, an additional benefit of duplicating the state file metadata is that we get this information in our database, so we have more context about what sort of updates are happening. If we are only storing a very lightweight pointer, we know basically nothing about the dataset and its evolution without being able to read the object store.

lindseynield · 2024-08-21T19:00:11Z

Fewer IO operations is definitely the main motivation here.

For the synchronization endpoint option, an additional benefit of duplicating the state file metadata is that we get this information in our database, so we have more context about what sort of updates are happening. If we are only storing a very lightweight pointer, we know basically nothing about the dataset and its evolution without being able to read the object store.

Totally, the benefits make sense to me in the context of the synchronization endpoint, I was just curious of the case outside of the endpoint where we only have the icechunk.json. Sounds like reducing the number of read operations is the justification for including the contents of the latest state file in icechunk.json 👍 thanks all!

digitaltopo · 2024-08-21T19:06:40Z

This would require two IO operations instead of one to read the latest dataset state file contents, but it's not super clear to me if that is the only benefit.

@lindseynield Yes, this is purely a performance optimization. In fact, we should probably have both things, the contents and the link in the icechunk.json file. Doing increasing numbers for the different icechunk.json versions could get tricky, in particular, it needs the "just arrived" write-if-not-exists feature of S3. Would be safer to have random keys for the icechunk files.

Thanks @paraseba, just to disambiguate here, are you saying there are multiple icechunk.json files?

The definitions I was working with so far have been:

Multiple x.statefile.json files in s3 that get created every time there is a commit.
One icechunk.json file in s3 that serves as the "pointer" to the latest statefile.json (by duplicating the contents of the latest x.statefile.json)

paraseba · 2024-08-22T02:15:02Z

Thanks @paraseba, just to disambiguate here, are you saying there are multiple icechunk.json files?

The definitions I was working with so far have been:
* Multiple `x.statefile.json` files in s3 that get created every time there is a commit.

* One `icechunk.json` file in s3 that serves as the "pointer" to the latest `statefile.json` (by duplicating the contents of the latest `x.statefile.json`)

No no, a single icechunk.json, multiple x.statefile.json or whatever we want to call them. I was just arguing against x being an increasing number, unless we want to use write-if-exists.

rabernat · 2024-08-26T15:29:03Z

As discussed on Slack, going to merge this as is and then make a new PR for the newly proposed statefile update mechanism.

remove catalog

8d8541c

rabernat requested review from paraseba, digitaltopo and lindseynield August 16, 2024 21:26

paraseba approved these changes Aug 16, 2024

View reviewed changes

rabernat marked this pull request as draft August 16, 2024 21:30

lindseynield reviewed Aug 16, 2024

View reviewed changes

spec/icechunk_spec.md Show resolved Hide resolved

jhamman approved these changes Aug 20, 2024

View reviewed changes

rabernat marked this pull request as ready for review August 26, 2024 15:32

rabernat added 3 commits August 26, 2024 11:32

Merge branch 'main' into ryan/spec/remove-catalog

3d63adc

Merge branch 'main' into ryan/spec/remove-catalog

9b03bdf

Merge branch 'main' into ryan/spec/remove-catalog

d3afd9b

rabernat merged commit ddb289f into main Aug 26, 2024
1 check passed

rabernat deleted the ryan/spec/remove-catalog branch August 26, 2024 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Catalog from spec #18

Remove Catalog from spec #18

rabernat commented Aug 16, 2024 •

edited

Loading

paraseba left a comment

lindseynield left a comment

rabernat commented Aug 16, 2024

rabernat commented Aug 18, 2024

rabernat commented Aug 20, 2024

lindseynield commented Aug 20, 2024

jhamman left a comment

jhamman Aug 20, 2024

rabernat Aug 21, 2024

jhamman Aug 20, 2024

rabernat Aug 21, 2024 •

edited

Loading

lindseynield commented Aug 21, 2024 •

edited

Loading

paraseba commented Aug 21, 2024 •

edited

Loading

rabernat commented Aug 21, 2024

lindseynield commented Aug 21, 2024

digitaltopo commented Aug 21, 2024

paraseba commented Aug 22, 2024

rabernat commented Aug 26, 2024

Remove Catalog from spec #18

Remove Catalog from spec #18

Conversation

rabernat commented Aug 16, 2024 • edited Loading

paraseba left a comment

Choose a reason for hiding this comment

lindseynield left a comment

Choose a reason for hiding this comment

rabernat commented Aug 16, 2024

rabernat commented Aug 18, 2024

rabernat commented Aug 20, 2024

lindseynield commented Aug 20, 2024

jhamman left a comment

Choose a reason for hiding this comment

jhamman Aug 20, 2024

Choose a reason for hiding this comment

rabernat Aug 21, 2024

Choose a reason for hiding this comment

jhamman Aug 20, 2024

Choose a reason for hiding this comment

rabernat Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

lindseynield commented Aug 21, 2024 • edited Loading

paraseba commented Aug 21, 2024 • edited Loading

rabernat commented Aug 21, 2024

lindseynield commented Aug 21, 2024

digitaltopo commented Aug 21, 2024

paraseba commented Aug 22, 2024

rabernat commented Aug 26, 2024

rabernat commented Aug 16, 2024 •

edited

Loading

rabernat Aug 21, 2024 •

edited

Loading

lindseynield commented Aug 21, 2024 •

edited

Loading

paraseba commented Aug 21, 2024 •

edited

Loading