-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spec document #1
Changes from all commits
8673038
5e53936
bba3509
a77bb67
e1690eb
6a38427
c6640b3
03a329f
ad73733
e57c441
14a4e66
2e998a1
202de3a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,324 @@ | ||||||
# Icechunk Specification | ||||||
|
||||||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. | ||||||
|
||||||
## Introduction | ||||||
|
||||||
The Icechunk specification is a storage specification for [Zarr](https://zarr-specs.readthedocs.io/en/latest/specs.html) data. | ||||||
Icechunk is inspired by Apache Iceberg and borrows many concepts and ideas from the [Iceberg Spec](https://iceberg.apache.org/spec/#version-2-row-level-deletes). | ||||||
|
||||||
This specification describes a single Icechunk **store**. | ||||||
A store is defined as a Zarr store containing one or more Arrays and Groups. | ||||||
The most common scenario is for a store to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates. | ||||||
However, formally a store can be any valid Zarr hierarchy, from a single Array to a deeply nested structure of Groups and Arrays. | ||||||
Users of Icechunk SHOULD aim to scope their stores only to related arrays and groups that require consistent transactional updates. | ||||||
|
||||||
All the data and metadata for a store are stored in **warehouse**, typically a directory in object storage or file storage. | ||||||
A separate **catalog** is used to track the latest version of of store. | ||||||
|
||||||
## Goals | ||||||
|
||||||
The goals of the specification are as follows: | ||||||
|
||||||
1. **Serializable isolation** - Reads will be isolated from concurrent writes and always use a committed snapshot of a store. Writes to arrays will be commited via a single atomic operation and will not be partially visible. Readers will not acquire locks. | ||||||
2. **Chunk sharding and references** - Chunk storage is decoupled from specific file names. Multiple chunks can be packed into a single object (sharding). Zarr-compatible chunks within other file formats (e.g. HDF5, NetCDF) can be referenced. | ||||||
3. **Time travel** - Previous snapshots of a store remain accessible after new ones have been written. Reverting to an early snapshot is trivial and inexpensive. | ||||||
4. **Schema Evolution** - Arrays and Groups can be added, renamed, and removed from the hierarchy with minimal overhead. | ||||||
|
||||||
### Non Goals | ||||||
|
||||||
1. **Low Latency** - Icechunk is designed to support analytical workloads for large stores. We accept that the extra layers of metadata files and indirection will introduce additional cold-start latency compared to regular Zarr. | ||||||
|
||||||
### Filesytem Operations | ||||||
|
||||||
Icechunk only requires that file systems support the following operations: | ||||||
|
||||||
- **In-place write** - Files are not moved or altered once they are written. Strong read-after-write consistency is expected. | ||||||
- **Seekable reads** - Chunk file formats may require seek support (e.g. shards). | ||||||
- **Deletes** - Stores delete files that are no longer used (via a garbage-collection operation). | ||||||
|
||||||
These requirements are compatible with object stores, like S3, as well as with filesystems. | ||||||
|
||||||
Stores do not require random-access writes. Once written, chunk and metadata files are immutable until they are deleted. | ||||||
|
||||||
## Specification | ||||||
|
||||||
### Overview | ||||||
|
||||||
Icechunk uses a series of linked metadata files to describe the state of the store. | ||||||
|
||||||
- The **state file** is the entry point to the store. It stores a record of snapshots, each of which is a pointer to a single structure file. | ||||||
- The **structure file** records all of the different arrays and groups in the store, plus their metadata. Every new commit creates a new structure file. The structure file contains pointers to one or more chunk manifests files and [optionally] attribute files. | ||||||
- **Chunk Manifests** store references to individual chunks. A single manifest may store references for multiple arrays or a subset of all the references for a single array. | ||||||
- **Attributes files** provide a way to store additional user-defined attributes for arrays and groups outside of the structure file. This is important when the attributes are very large. | ||||||
rabernat marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
- **Chunk files** store the actual compressed chunk data, potentially containing data for multiple chunks in a single file. | ||||||
|
||||||
When reading a store, the client receives a pointer to a state file from the catalog, read the state file, and chooses a structure file corresponding to a specific snapshot to open. | ||||||
The client then reads the structure file to determine the structure and hierarchy of the store. | ||||||
When fetching data from an array, the client first examines the chunk manifest file[s] for that array and finally fetches the chunks referenced therein. | ||||||
|
||||||
When writing a new store snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new structure file and state file. | ||||||
Finally, in an atomic swap operation, it updates the pointer to the state file in the catalog. | ||||||
Ensuring atomicity of the swap operation is the responsibility of the [catalog](#catalog). | ||||||
|
||||||
|
||||||
```mermaid | ||||||
flowchart TD | ||||||
subgraph catalog | ||||||
cat_pointer[Current Statefile Pointer] | ||||||
end | ||||||
subgraph metadata | ||||||
subgraph state_files | ||||||
old_state[State File 1] | ||||||
state[State File 2] | ||||||
end | ||||||
subgraph structure | ||||||
structure1[Structure File 1] | ||||||
structure2[Structure File 2] | ||||||
end | ||||||
subgraph attributes | ||||||
attrs[Attribute File] | ||||||
end | ||||||
subgraph manifests | ||||||
manifestA[Chunk Manifest A] | ||||||
manifestB[Chunk Manifest B] | ||||||
end | ||||||
end | ||||||
subgraph data | ||||||
chunk1[Chunk File 1] | ||||||
chunk2[Chunk File 2] | ||||||
chunk3[Chunk File 3] | ||||||
chunk4[Chunk File 4] | ||||||
end | ||||||
|
||||||
cat_pointer --> state | ||||||
state -- snapshot ID --> structure2 | ||||||
structure1 --> attrs | ||||||
structure1 --> manifestA | ||||||
structure2 --> attrs | ||||||
structure2 -->manifestA | ||||||
structure2 -->manifestB | ||||||
manifestA --> chunk1 | ||||||
manifestA --> chunk2 | ||||||
manifestB --> chunk3 | ||||||
manifestB --> chunk4 | ||||||
|
||||||
``` | ||||||
|
||||||
### File Layout | ||||||
|
||||||
All data and metadata files are stored in a warehouse (typically an object store) using the following directory structure. | ||||||
|
||||||
- `$ROOT` base URI (s3, gcs, file, etc.) | ||||||
- `$ROOT/t/` state files | ||||||
- `$ROOT/s/` for the structure files | ||||||
- `$ROOT/a/` for attribute files | ||||||
- `$ROOT/m/` for array chunk manifests | ||||||
- `$ROOT/c/` for array chunks | ||||||
|
||||||
### State File | ||||||
|
||||||
The **state file** records the current state and history of the store. | ||||||
All commits occur by creating a new state file and updating the pointer in the catalog to this new state file. | ||||||
The state file contains a list of active (non-expired) snapshots. | ||||||
Each snapshot includes a pointer to the structure file for that snapshot. | ||||||
|
||||||
The state file is a JSON file with the following JSON schema: | ||||||
|
||||||
[TODO: convert to JSON schema] | ||||||
|
||||||
| Name | Required | Type | Description | | ||||||
|--|--|--|--| | ||||||
| id | YES | str UID | A unique identifier for the store | | ||||||
| generation | YES | int | An integer which must be incremented whenever the state file is updated | | ||||||
| store_root | NO | str | A URI which points to the root location of the store in object storage. If blank, the store root is assumed to be in the same directory as the state file itself. | | ||||||
| snapshots | YES | array[snapshot] | A list of all of the snapshots. | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure we want just snapshot IDs. There is potentially other metadata in this data structure (timestamp, parent snapshot, etc.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah my bad. I was confused by what |
||||||
| refs | NO | mapping[reference] | A mapping of references (string names) to snapshots | | ||||||
|
||||||
A snapshot contains the following properties | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is a "Mapping"? Do you mean like a dict? This is JSON, so I think the correct entity type would be an Object. |
||||||
|
||||||
| Name | Required | Type | Description | | ||||||
|--|--|--|--| | ||||||
| snapshot-id | YES | str UID | Unique identifier for the snapshot | | ||||||
| parent-snapshot-id | YES | null OR str UID | Parent snapshot (null for no parent) | | ||||||
| timestamp-ms | YES | int | When was snapshot commited | | ||||||
| structure-file | YES | str | Name of the structure file for this snapshot | | ||||||
| properties | NO | object | arbitrary user-defined attributes to associate with this snapshot | | ||||||
|
||||||
References are a mapping of string names to snapshots | ||||||
|
||||||
| Name | Required | Type | Description | | ||||||
|--|--|--|--| | ||||||
| name | YES | str | Name of the reference| | ||||||
| snapshot-id | YES | str UID | What snaphot does it point to | | ||||||
| type | YES | "tag" / "branch" | Whether the reference is a tag or a branch | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we should leave this field open and let people use whatever they want, maybe reserving tag and branch as internal to the format. |
||||||
|
||||||
### Structure Files | ||||||
|
||||||
The structure file fully describes the schema of the store, including all arrays and groups. | ||||||
|
||||||
The structure file is a Parquet file. | ||||||
Each row of the file represents an individual node (array or group) of the Zarr store. | ||||||
|
||||||
The structure file has the following Arrow schema: | ||||||
|
||||||
``` | ||||||
id: uint32 not null | ||||||
-- field metadata -- | ||||||
description: 'unique identifier for the node' | ||||||
type: string not null | ||||||
-- field metadata -- | ||||||
description: 'array or group' | ||||||
path: string not null | ||||||
-- field metadata -- | ||||||
description: 'path to the node within the store' | ||||||
array_metadata: struct<shape: list<item: uint16> not null, data_type: string not null, fill_value: binary, dimension_names: list<item: string>, chunk_grid: struct<name: string not null, configuration: struct<chunk_shape: list<item: uint16> not null> not null>, chunk_key_encoding: struct<name: string not null, configuration: struct<separator: string not null> not null>, codecs: list<item: struct<name: string not null, configuration: binary>>> | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For all array indexes we are currently using uint64. For the fill_value we'll use a Union array. For the chunk_grid, the current implementation is ignoring it and using just a normal uniform grid. Similar for the chunk key encoding, currently only recording one of the two values possible in zarr. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. uint64 makes sense for array indexes. But not for chunk indexes; the chunk grid is orders of magnitude smaller than the actual array grid. Given that we have to store a lot of chunk indexes in this spec, I feel like we should try to use the smallest thing possible. But maybe that's premature optimization.
I agree that makes sense from an implementation point of view. From a spec point of view, I think it makes sense to store the actual official Zarr V3 metadata directly, rather than designing a new schema and having to convert between them. Are you ok with that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should call out that we are only intending to support Zarr v3 stores / metadata at this point. Else we need to make the structure file more flexible than is currently shown (or a separate v2 structure file). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Think height=1 pancakes, in which the array's time grid and chunk grid go in parallel. Regarding how to store the zarr array metadata ... we can discuss more about it. Initially, I'm storing the columns explicitly, not as a blob, I thought the advantages were enough to justify the extra work. But it's close. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seba, I think it makes sense to iterate on the details of these parts of the schema as you are implementing. We won't know the "right" choice until we actually code it up. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Absolutely @rabernat ! That's the approach I'm taking, implementing the more basic encodings now, and iterate later once things work. This should be true for everything in this spec. |
||||||
child 0, shape: list<item: uint16> not null | ||||||
child 0, item: uint16 | ||||||
child 1, data_type: string not null | ||||||
child 2, fill_value: binary | ||||||
child 3, dimension_names: list<item: string> | ||||||
child 0, item: string | ||||||
child 4, chunk_grid: struct<name: string not null, configuration: struct<chunk_shape: list<item: uint16> not null> not null> | ||||||
child 0, name: string not null | ||||||
child 1, configuration: struct<chunk_shape: list<item: uint16> not null> not null | ||||||
child 0, chunk_shape: list<item: uint16> not null | ||||||
child 0, item: uint16 | ||||||
child 5, chunk_key_encoding: struct<name: string not null, configuration: struct<separator: string not null> not null> | ||||||
child 0, name: string not null | ||||||
child 1, configuration: struct<separator: string not null> not null | ||||||
child 0, separator: string not null | ||||||
child 6, codecs: list<item: struct<name: string not null, configuration: binary>> | ||||||
child 0, item: struct<name: string not null, configuration: binary> | ||||||
child 0, name: string not null | ||||||
child 1, configuration: binary | ||||||
-- field metadata -- | ||||||
description: 'All the Zarr array metadata' | ||||||
inline_attrs: binary | ||||||
-- field metadata -- | ||||||
description: 'user-defined attributes, stored inline with this entry' | ||||||
attrs_reference: struct<attrs_file: string not null, row: uint16 not null, flags: uint16> | ||||||
child 0, attrs_file: string not null | ||||||
child 1, row: uint16 not null | ||||||
child 2, flags: uint16 | ||||||
Comment on lines
+202
to
+203
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we describe these two parameters please. I don't understand what they mean. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These come directly from Seba's Icechunk V2 Notes My understanding is that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We want to know where to search for the metadata in a big attributes file, so we store the row to index to it in O(1). Flags is a way to know what to expect of a file before trying to parse it, so every time we have a pointer, we include a flags field that could tell you something like "this is the parquet format version 0.3, with support for X". I'm not loving this because the field is repeated in every row, which takes memory, but we can be cautious with the encoding. I'll think more about it once we get to implement these more detailed features. |
||||||
-- field metadata -- | ||||||
description: 'user-defined attributes, stored in a separate attributes ' + 4 | ||||||
manifests: list<item: struct<manifest_id: uint16 not null, row: uint16 not null, extent: list<item: fixed_size_list<item: uint16>[2]> not null, flags: uint16>> | ||||||
child 0, item: struct<manifest_id: uint16 not null, row: uint16 not null, extent: list<item: fixed_size_list<item: uint16>[2]> not null, flags: uint16> | ||||||
child 0, manifest_id: uint16 not null | ||||||
child 1, row: uint16 not null | ||||||
child 2, extent: list<item: fixed_size_list<item: uint16>[2]> not null | ||||||
child 0, item: fixed_size_list<item: uint16>[2] | ||||||
child 0, item: uint16 | ||||||
child 3, flags: uint16 | ||||||
``` | ||||||
|
||||||
### Attributes Files | ||||||
|
||||||
Attribute files hold user-defined attributes separately from the structure file. | ||||||
|
||||||
### Chunk Manifest Files | ||||||
|
||||||
A chunk manifest file stores chunk references. | ||||||
Chunk references from multiple arrays can be stored in the same chunk manifest. | ||||||
The chunks from a single array can also be spread across multiple manifests. | ||||||
|
||||||
Chunk manifest files are Parquet files. | ||||||
They have the following arrow schema. | ||||||
|
||||||
``` | ||||||
array_id: uint32 not null | ||||||
coord: binary not null | ||||||
inline_data: binary | ||||||
chunk_id: binary | ||||||
virtual_path: string | ||||||
offset: uint64 | ||||||
length: uint32 not null | ||||||
``` | ||||||
|
||||||
- **id** - unique ID for the chunk. | ||||||
- **array_id** - ID for the array this is part of | ||||||
- **coord** - position of the chunk within the array. See _chunk coord encoding_ for more detail | ||||||
- **chunk_file** - the name of the file in which the chunk resides | ||||||
- **offset** - offset in bytes | ||||||
- **length** - size in bytes | ||||||
|
||||||
#### Chunk Coord Encoding | ||||||
|
||||||
Chunk coords are tuples of positive ints (e.g. `(5, 30, 10)`). | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A few nits here: a) remember that 0 is valid here |
||||||
In normal Zarr, chunk keys are encoded as strings (e.g. `5.30.10`). | ||||||
We want an encoding is: | ||||||
- efficient (minimal storage size) | ||||||
- sortable | ||||||
- useable as a predicate in Arrow | ||||||
|
||||||
The first two requirements rule out string encoding. | ||||||
The latter requirement rules out structs or lists. | ||||||
|
||||||
So we opt for a variable length binary encoding. | ||||||
The chunk coord is created by encoding each element of the tuple a big endian `uint16` and then simply concatenating the bytes together in order. For the common case of arrays <= 4 dimensions, this would use 8 bytes or less per chunk coord. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. uint16 can only hold < 10 years of an hourly dataset. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's a different approach we could take to the coord index. Rather than encoding the coord into a single field, we could define multiple optional int fields we could use as a multiindex, i.e.
This would put a constraint on how many dimensions the chunk grid could support. Some tests using pyarrow suggest that this is about 4x slower in terms of indexing than the binary encoding. |
||||||
|
||||||
### Chunk Files | ||||||
|
||||||
Chunk files contain the compressed binary chunks of a Zarr array. | ||||||
Icechunk permits quite a bit of flexibility about how chunks are stored. | ||||||
Chunk files can be: | ||||||
|
||||||
- One chunk per chunk file (i.e. standard Zarr) | ||||||
- Multiple contiguous chunks from the same array in a single chunk file (similar to Zarr V3 shards) | ||||||
- Chunks from multiple different arrays in the same file | ||||||
- Other file types (e.g. NetCDF, HDF5) which contain Zarr-compatible chunks | ||||||
|
||||||
Applications may choose to arrange chunks within files in different ways to optimize I/O patterns. | ||||||
|
||||||
## Catalog | ||||||
|
||||||
An Icechunk _catalog_ is a database for keeping track of pointers to state files for Icechunk Stores. | ||||||
This specification is limited to the Store itself, and does not specify in detail all of the possible features or capabilities of a catalog. | ||||||
|
||||||
A catalog MUST support the following basic logical interface (here defined in Python pseudocode): | ||||||
|
||||||
```python | ||||||
def get_store_statefile_location(store_identifier) -> URI: | ||||||
"""Get the location of a store state file.""" | ||||||
... | ||||||
|
||||||
def set_store_statefile_location(store_identifier, previous_statefile_location) -> None: | ||||||
"""Set the location of a store state file. | ||||||
Should fail of the client's previous_statefile_location | ||||||
is not consistent with the catalog.""" | ||||||
... | ||||||
|
||||||
def delete_store(store_identifier) -> None: | ||||||
"""Remove a store from the catalog.""" | ||||||
... | ||||||
``` | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does the catalog support listing datasets? Or any other discovery mechanisms? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right now I'm putting that in the category of "things that could be implemented by a catalog but aren't strictly required" |
||||||
|
||||||
A catalog MAY also store the state metadata directly within its database, eliminating the need for an additional request to fetch the state file. | ||||||
This does not remove the need for the state file to be stored in the warehouse. | ||||||
|
||||||
## Algorithms | ||||||
|
||||||
### Initialize New Store | ||||||
|
||||||
### Write Snapshot | ||||||
|
||||||
### Read Snapshot | ||||||
|
||||||
### Expire Snapshots | ||||||
|
||||||
|
||||||
## Appendices | ||||||
rabernat marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
### Comparison with Iceberg | ||||||
|
||||||
Like Iceberg, Icechunk uses a series of linked metadata files to describe the state of the store. | ||||||
But while Iceberg describes a table, the Icechunk store is a Zarr store (hierarchical structure of Arrays and Groups.) | ||||||
|
||||||
| Iceberg Entity | Icechunk Entity | Comment | | ||||||
|--|--|--| | ||||||
| Table | Store | The fundamental entity described by the spec | | ||||||
| Column | Array | The logical container for a homogenous collection of values | | ||||||
| Metadata File | State File | The highest-level entry point into the dataset | | ||||||
| Snapshot | Snapshot | A single committed snapshot of the dataset | | ||||||
| Catalog | Catalog | A central place to track changes to one or more state files | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fast data move operation should be mentioned, as part of 2 or as 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what you mean by "data move"? Do you mean renaming groups and arrays?
The best analogy there with iceberg might be "schema evolution", i.e. adding, deleting, renaming columns. Do you think that's a useful concept to introduce?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry, I mean renaming of both groups and arrays without the need to copy the chunks. I like the analogy with schema evolution because both are done mostly for the same reasons. You start with a model, it evolves, you learn your hierarchy would be better in some other form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC Seba has consistently made the point that "read performance is prioritized over write performance." Is that right? If so, should it be mentioned here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand how that constraint has impacted our design. AFAICT Icechunk write performance is no worse than vanilla Zarr, modulo the overhead of managing the manifests etc. (which also applies to reads).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have a write performance impact, given by the need for transactionality. This is expected I think, a price payed to get the feature. This cost will be somewhat linear with the size of the write, but with a small constant. Example, we'll need to rewrite manifests, adding changes.