Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Catalog from spec #18

Merged
merged 4 commits into from
Aug 26, 2024
Merged

Remove Catalog from spec #18

merged 4 commits into from
Aug 26, 2024

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Aug 16, 2024

Following our discussion today, this PR removes the concept of a Catalog from the spec entirely and replaces it with a more lightweight Synchronization Endpoint.

TODO:

  • Finish synchronization endpoint API spec

Copy link
Collaborator

@paraseba paraseba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful, thanks for writing this.

@rabernat rabernat marked this pull request as draft August 16, 2024 21:30
Copy link
Contributor

@lindseynield lindseynield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense to me. To confirm my understanding, every icehcunk dataset will have a icechunk.json file that represents the latest sate of the data, and when we commit a change to a dataset, we are essentially committing a new state file + overwriting the existing icechunk.json with the contents of this new state file?

@rabernat
Copy link
Contributor Author

Yes that's right. But this raises a few issues / concerns

  • it violates the "files are immutable once written" constraint
  • the same data is duplicated between the immutable state file and icechunk.json
  • do we really need a new state file for every commit if we are actually just overwriting icechunk.json?

@rabernat
Copy link
Contributor Author

Just a clarifying comment about how this fits into the whole catalog discussion.

By eliminating the Catalog from the icechunk spec, we basically eliminate what Orestis and Lindsey dubbed the "technical catalog". That leaves use free to just focus on the "federated catalog".

Our platform would provide a "synchronization endpoint" service for Icechunk stores. This would make our service aware of any time that a user read from or committed to a store. It would be up to us how we would integrate this into the federated catalog.

I really like this because it provides a key role for our platform without getting overly specific about much else at the Icechunk level.

@rabernat
Copy link
Contributor Author

Here are my thoughts on these issues. Bottom line, I don't think any of them is a blocker.

  • it violates the "files are immutable once written" constraint

This is an acceptable tradeoff for not having to have an external catalog. For stores that support atomic compare-and-swap, it's completely fine. For S3, it will likely be tolerable for many use cases, particularly when using a synchronization endpoint.

The alternative would be to require listing the object store and getting the most recent state file in a numerical sequence, similar to LanceDB. That has downsides as well.

  • the same data is duplicated between the immutable state file and icechunk.json

Not a problem as long as the commit process ensures that these are consistent. Writing the immutable state file will always succeed, whereas overwriting icechunk.json may fail if there has been a conflicting update in the meantime. The immutable state file will preserve the orphaned snapshot, which is a nice feature IMO.

  • do we really need a new state file for every commit if we are actually just overwriting icechunk.json?

Based on my response above, I do think it is nice to have both.

@lindseynield
Copy link
Contributor

I think I am pretty aligned with everything you are proposing here! I would be interested to hear @digitaltopo's thoughts

  • it violates the "files are immutable once written" constraint
    This is an acceptable tradeoff for not having to have an external catalog. For stores that support atomic compare-and-swap, it's completely fine. For S3, it will likely be tolerable for many use cases, particularly when using a synchronization endpoint.

The alternative would be to require listing the object store and getting the most recent state file in a numerical sequence, similar to LanceDB. That has downsides as well.

I lean more towards file overwriting here as well. It makes it easier to know what is the latest state (since we don't have a catalog) and we will still have the statefiles for the previous versions. This does bring up the question in my mind of if we will need to consider some sort of garbage collection if we accumulate a lot of statefiles.

  • the same data is duplicated between the immutable state file and icechunk.json
    Not a problem as long as the commit process ensures that these are consistent. Writing the immutable state file will always succeed, whereas overwriting icechunk.json may fail if there has been a conflicting update in the meantime. The immutable state file will preserve the orphaned snapshot, which is a nice feature IMO.
  • do we really need a new state file for every commit if we are actually just overwriting icechunk.json?
    Based on my response above, I do think it is nice to have both.

Agreed, I would also lean towards having both, even if there is data duplication

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this started @rabernat. I want to encourage you to finish your thoughts on the API spec to help the service refactor project see how this will fit together.

@@ -131,7 +129,7 @@ The state file is a JSON file with the following JSON schema:
|--|--|--|--|
| id | YES | str UID | A unique identifier for the store |
| generation | YES | int | An integer which must be incremented whenever the state file is updated |
| store_root | NO | str | A URI which points to the root location of the store in object storage. If blank, the store root is assumed to be in the same directory as the state file itself. |
| store_root | YES | str | A URI which points to the root location of the store in object storage. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this is required now? For stores NOT managed by arraylake, this is going to make it very hard to move icechunk data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that it should not be required. Not sure why I changed my mind here.

def get_store_statefile_location(store_identifier) -> URI:
"""Get the location of a store state file."""
...
## Synchronization Endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushing this just a bit further will help us scope the changes on the service side.

Copy link
Contributor Author

@rabernat rabernat Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What additional detail do you think is needed to unblock work on the service side?

The idea is that, instead of reading and writing icechunk.json directly in object storage, the client should read (GET) and write (POST) its contents to the synchronization endpoint. The service should store this data in a database.

Getting more specific means specifying in detail the contents of the icechunk.json file. This is just the state file. From my POV, that's already pretty detailed (individual fields are specified), although it could be tightened up a bit.

Like the rest of the spec, this is likely to evolve a bit as we actually implement it. And as @paraseba has frequently reminded us, the state file is really the last step in terms of implementation. So getting overly specific about the state file contents now feels a little premature.

Happy to do more work here, just looking for guidance on what would be helpful.

@lindseynield
Copy link
Contributor

lindseynield commented Aug 21, 2024

A follow up question here (which may be easier to talk through synchronously): What is your perspective on the benefit of duplicating the latest state file contents in the icechunk.json file (that is, the file that points to the latest version of the dataset)? Is there a reason that this can't just be a pointer to the state file? This would require two IO operations instead of one to read the latest dataset state file contents, but it's not super clear to me if that is the only benefit.

Just to clarify here, I'm thinking about this as outside of the "synchronization endpoint"; that is, in the offline, OS version of icechunk.

@paraseba
Copy link
Collaborator

paraseba commented Aug 21, 2024

This would require two IO operations instead of one to read the latest dataset state file contents, but it's not super clear to me if that is the only benefit.

@lindseynield Yes, this is purely a performance optimization. In fact, we should probably have both things, the contents and the link in the icechunk.json file. Doing increasing numbers for the different icechunk.json versions could get tricky, in particular, it needs the "just arrived" write-if-not-exists feature of S3. Would be safer to have random keys for the icechunk files.

@rabernat
Copy link
Contributor Author

Is there a reason that this can't just be a pointer to the state file? This would require two IO operations instead of one to read the latest dataset state file contents, but it's not super clear to me if that is the only benefit.

Fewer IO operations is definitely the main motivation here.

For the synchronization endpoint option, an additional benefit of duplicating the state file metadata is that we get this information in our database, so we have more context about what sort of updates are happening. If we are only storing a very lightweight pointer, we know basically nothing about the dataset and its evolution without being able to read the object store.

@lindseynield
Copy link
Contributor

Fewer IO operations is definitely the main motivation here.

For the synchronization endpoint option, an additional benefit of duplicating the state file metadata is that we get this information in our database, so we have more context about what sort of updates are happening. If we are only storing a very lightweight pointer, we know basically nothing about the dataset and its evolution without being able to read the object store.

Totally, the benefits make sense to me in the context of the synchronization endpoint, I was just curious of the case outside of the endpoint where we only have the icechunk.json. Sounds like reducing the number of read operations is the justification for including the contents of the latest state file in icechunk.json 👍 thanks all!

@digitaltopo
Copy link
Contributor

This would require two IO operations instead of one to read the latest dataset state file contents, but it's not super clear to me if that is the only benefit.

@lindseynield Yes, this is purely a performance optimization. In fact, we should probably have both things, the contents and the link in the icechunk.json file. Doing increasing numbers for the different icechunk.json versions could get tricky, in particular, it needs the "just arrived" write-if-not-exists feature of S3. Would be safer to have random keys for the icechunk files.

Thanks @paraseba, just to disambiguate here, are you saying there are multiple icechunk.json files?

The definitions I was working with so far have been:

  • Multiple x.statefile.json files in s3 that get created every time there is a commit.
  • One icechunk.json file in s3 that serves as the "pointer" to the latest statefile.json (by duplicating the contents of the latest x.statefile.json)

@paraseba
Copy link
Collaborator

Thanks @paraseba, just to disambiguate here, are you saying there are multiple icechunk.json files?

The definitions I was working with so far have been:

* Multiple `x.statefile.json` files in s3 that get created every time there is a commit.

* One `icechunk.json` file in s3 that serves as the "pointer" to the latest `statefile.json` (by duplicating the contents of the latest `x.statefile.json`)

No no, a single icechunk.json, multiple x.statefile.json or whatever we want to call them. I was just arguing against x being an increasing number, unless we want to use write-if-exists.

@rabernat
Copy link
Contributor Author

As discussed on Slack, going to merge this as is and then make a new PR for the newly proposed statefile update mechanism.

@rabernat rabernat marked this pull request as ready for review August 26, 2024 15:32
@rabernat rabernat merged commit ddb289f into main Aug 26, 2024
1 check passed
@rabernat rabernat deleted the ryan/spec/remove-catalog branch August 26, 2024 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants