Skip to content

Commit

Permalink
rename dataset -> store again
Browse files Browse the repository at this point in the history
  • Loading branch information
rabernat committed Aug 8, 2024
1 parent 14a4e66 commit 2e998a1
Showing 1 changed file with 9 additions and 9 deletions.
18 changes: 9 additions & 9 deletions spec/icechunk_spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
The Icechunk specification is a storage specification for [Zarr](https://zarr-specs.readthedocs.io/en/latest/specs.html) data.
Icechunk is inspired by Apache Iceberg and borrows many concepts and ideas from the [Iceberg Spec](https://iceberg.apache.org/spec/#version-2-row-level-deletes).

This specification describes a single Icechunk **dataset**.
A dataset is defined as a Zarr store containing one or more interrelated Arrays and Groups which must be updated consistently.
The most common scenario is for a dataset to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates.
However, formally a dataset can be any valid Zarr hierarchy, from a single Array to a deeply nested structure of Groups and Arrays.
This specification describes a single Icechunk **store**.
A store is defined as a Zarr store containing one or more interrelated Arrays and Groups which must be updated consistently.
The most common scenario is for a store to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates.
However, formally a store can be any valid Zarr hierarchy, from a single Array to a deeply nested structure of Groups and Arrays.

## Goals

Expand Down Expand Up @@ -39,17 +39,17 @@ Stores do not require random-access writes. Once written, chunk and metadata fil

Icechunk uses a series of linked metadata files to describe the state of the store.

- The **state file** is the entry point to the dataset. It stores a record of snapshots, each of which is a pointer to a single structure file.
- The **structure file** records all of the different arrays and groups in the dataset, plus their metadata. Every new commit creates a new structure file. The structure file contains pointers to one or more chunk manifests files and [optionally] attribute files.
- The **state file** is the entry point to the store. It stores a record of snapshots, each of which is a pointer to a single structure file.
- The **structure file** records all of the different arrays and groups in the store, plus their metadata. Every new commit creates a new structure file. The structure file contains pointers to one or more chunk manifests files and [optionally] attribute files.
- **Chunk Manifests** store references to individual chunks. A single manifest may store references for multiple arrays or a subset of all the references for a single array.
- **Attributes files** provide a way to store additional user-defined attributes for arrays and groups outside of the structure file. This is important when the attributes are very large.
- **Chunk files** store the actual compressed chunk data, potentially containing data for multiple chunks in a single file.

When reading a dataset, the client first opens the state file and chooses a structure file corresponding to a specific snapshot to open.
The client then reads the structure file to determine the structure and hierarchy of the dataset.
When reading a store, the client first opens the state file and chooses a structure file corresponding to a specific snapshot to open.
The client then reads the structure file to determine the structure and hierarchy of the store.

When fetching data from an array, the client first examines the chunk manifest file[s] for that array and finally fetches the chunks referenced therein.
When writing a new dataset snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new structure file. Finally, in an atomic swap operation, it replaces the state file with a new state file recording the presence of the new snapshot.
When writing a new store snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new structure file. Finally, in an atomic swap operation, it replaces the state file with a new state file recording the presence of the new snapshot.
Ensuring atomicity of the swap operation is the responsibility of the [catalog](#catalog).


Expand Down

0 comments on commit 2e998a1

Please sign in to comment.