From 86730387869af8a5bb7e0d29ad89ca5b2361783a Mon Sep 17 00:00:00 2001
From: Ryan Abernathey <ryan.abernathey@gmail.com>
Date: Sun, 4 Aug 2024 05:13:34 -0400
Subject: [PATCH 01/11] add spec doc

---
 spec/icechunk_spec.md | 271 ++++++++++++++++++++++++++++++++++++++++++
 spec/schema.py        | 110 +++++++++++++++++
 2 files changed, 381 insertions(+)
 create mode 100644 spec/icechunk_spec.md
 create mode 100644 spec/schema.py

diff --git a/spec/icechunk_spec.md b/spec/icechunk_spec.md
new file mode 100644
index 00000000..b599c76b
--- /dev/null
+++ b/spec/icechunk_spec.md
@@ -0,0 +1,271 @@
+# Icechunk Specification
+
+The Icechunk specification is a storage specification for [Zarr](https://zarr-specs.readthedocs.io/en/latest/specs.html) data.
+Icechunk is inspired by Apache Iceberg and borrows many concepts and ideas from the [Iceberg Spec](https://iceberg.apache.org/spec/#version-2-row-level-deletes).
+
+This specification describes a single Icechunk **dataset**.
+A dataset is defined as a Zarr store containing one or more interrelated Arrays and Groups which must be updated consistently.
+The most common scenarios is for a dataset to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates.
+
+## Comparison with Iceberg
+
+| Iceberg Entity | Icechunk Entity |
+|--|--|
+| Table | Dataset |
+| Column | Array |
+| Catalog | State File |
+| Snapshot | Snapshot |
+
+## Goals
+
+The goals of the specification are as follows:
+
+1. **Serializable isolation** - Reads will be isolated from concurrent writes and always use a committed snapshot of a dataset. Writes across multiple arrays and chunks will be commited via a single atomic operation and will not be partially visible. Readers will not acquire locks.
+2. **Chunk sharding and references** - Chunk storage is decoupled from specific file names. Multiple chunks can be packed into a single object (sharding). Zarr-compatible chunks within other file formats (e.g. HDF5, NetCDF) can be referenced.
+
+[TODO: there must be more, but these seem like the big ones for now]
+
+### Filesytem Operations
+
+The required filesystem operations are identical to Iceberg. Icechunk only requires that file systems support the following operations:
+
+- **In-place write** - Files are not moved or altered once they are written.
+- **Seekable reads** - Chunk file formats may require seek support (e.g. shards).
+- **Deletes** - Datasets delete files that are no longer used (via a garbage-collection operation).
+
+These requirements are compatible with object stores, like S3.
+
+Datasets do not require random-access writes. Once written, chunk and metadata files are immutable until they are deleted.
+
+## Specification
+
+### Overview
+
+Like Iceberg, Icechunk uses a series of linked metadata files to describe the state of the dataset.
+
+- The **state file** is the entry point to the dataset. It stores a record of snapshots, each of which is a pointer to a single structure file.
+- The **structure file** records all of the different arrays and groups in the dataset, plus their metadata. Every new commit creates a new structure file. The structure file contains pointers to one or more chunk manifests files and [optionally] attribute files.
+- **Chunk Manifests** store references to individual chunks.
+- **Attributes files** provide a way to store additional user-defined attributes for arrays and groups outside of the structure file.
+- **Chunk files** store the actual compressed chunk data.
+
+When reading a dataset, the client first open the state file and chooses a specific snapshot to open.
+The client then reads the structure file to determine the structure and hierarchy of the dataset.
+When fetching data from an array, the client first examines the chunk manifest file[s] for that array and finally fetches the chunks referenced therein.
+
+When writing a new dataset snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new structure file. Finally, in an atomic swap operation, it replaces the state file with a new state file recording the presence of the new snapshot .
+
+
+```mermaid
+flowchart TD
+    subgraph catalog
+    state[State File]
+    end
+    subgraph metadata
+    subgraph structure
+    structure1[Structure File 1]
+    structure2[Structure File 2]
+    end
+    subgraph attributes
+    attrs[Attribute File]
+    end
+    subgraph manifests
+    manifestA[Chunk Manifest A]
+    manifestB[Chunk Manifest B]
+    end
+    end
+    subgraph data
+    chunk1[Chunk File 1]
+    chunk2[Chunk File 2]
+    chunk3[Chunk File 3]
+    chunk4[Chunk File 4]
+    end
+    
+    state -- snapshot ID --> structure2
+    structure1 --> attrs
+    structure1 --> manifestA
+    structure2 --> attrs
+    structure2 -->manifestA
+    structure2 -->manifestB
+    manifestA --> chunk1
+    manifestA --> chunk2
+    manifestB --> chunk3
+    manifestB --> chunk4
+    
+```
+
+### State File
+
+The **state file** records the current state of the dataset.
+All transactions occur by updating or replacing the state file.
+The state file contains, at minimum, a pointer to the latest structure file snapshot.
+
+
+The state file is a JSON file. It contains the following required and optional fields.
+
+[TODO: convert to JSON schema]
+
+| Name | Required | Type | Description |
+|--|--|--|--|
+| id | YES | str UID | A unique identifier for the dataset |
+| generation | YES | int | An integer which must be incremented whenever the state file is updated |
+| store_root | NO | str | A URI which points to the root location of the store in object storage. If blank, the store root is assumed to be in the same directory as the state file itself. | 
+| snapshots | YES | array[snapshot] | A list of all of the snapshots. |
+| refs | NO | mapping[reference] | A mapping of references to snapshots |
+
+A snapshot contains the following properties
+
+| Name | Required | Type | Description |
+|--|--|--|--|
+| snapshot-id | YES | str UID | Unique identifier for the snapshot |
+| parent-snapshot-id | NO | str UID | Parent snapshot (null for no parent) |
+| timestamp-ms | YES | int | When was snapshot commited |
+| structure-file | YES | str | Name of the structure file for this snapshot |
+| properties | NO | object | arbitrary user-defined attributes to associate with this snapshot | 
+
+References are a mapping of string names to snapshots
+
+
+| Name | Required | Type | Description |
+|--|--|--|--|
+| name | YES | str | Name of the reference|
+| snapshot-id | YES | str UID | What snaphot does it point to |
+| type | YES | "tag" / "branch" | Whether the reference is a tag or a branch | 
+
+### File Layout
+
+The state file can be stored separately from the rest of the data or together with it. The rest of the data files in the dataset must be kept in a directory with the following structure.
+
+- `$ROOT` base URI (s3, gcs, file, etc.)
+- `$ROOT/state.json` (optional) state file
+- `$ROOT/s/` for the structure files
+- `$ROOT/a/` arrays and groups attribute information
+- `$ROOT/i/` array chunk manifests (i for index or inventory)
+- `$ROOT/c/` array chunks
+
+### Structure Files
+
+The structure file fully describes the schema of the dataset, including all arrays and groups.
+
+The structure file is a Parquet file.
+Each row of the file represents an individual node (array or group) of the Zarr dataset. 
+
+The structure file has the following Arrow schema:
+
+```
+id: uint16 not null
+  -- field metadata --
+  description: 'unique identifier for the node'
+type: string not null
+  -- field metadata --
+  description: 'array or group'
+path: string not null
+  -- field metadata --
+  description: 'path to the node within the store'
+array_metadata: struct<shape: list<item: uint16> not null, data_type: string not null, fill_value: binary, dimension_names: list<item: string>, chunk_grid: struct<name: string not null, configuration: struct<chunk_shape: list<item: uint16> not null> not null>, chunk_key_encoding: struct<name: string not null, configuration: struct<separator: string not null> not null>, codecs: list<item: struct<name: string not null, configuration: binary>>>
+  child 0, shape: list<item: uint16> not null
+      child 0, item: uint16
+  child 1, data_type: string not null
+  child 2, fill_value: binary
+  child 3, dimension_names: list<item: string>
+      child 0, item: string
+  child 4, chunk_grid: struct<name: string not null, configuration: struct<chunk_shape: list<item: uint16> not null> not null>
+      child 0, name: string not null
+      child 1, configuration: struct<chunk_shape: list<item: uint16> not null> not null
+          child 0, chunk_shape: list<item: uint16> not null
+              child 0, item: uint16
+  child 5, chunk_key_encoding: struct<name: string not null, configuration: struct<separator: string not null> not null>
+      child 0, name: string not null
+      child 1, configuration: struct<separator: string not null> not null
+          child 0, separator: string not null
+  child 6, codecs: list<item: struct<name: string not null, configuration: binary>>
+      child 0, item: struct<name: string not null, configuration: binary>
+          child 0, name: string not null
+          child 1, configuration: binary
+  -- field metadata --
+  description: 'All the Zarr array metadata'
+inline_attrs: binary
+  -- field metadata --
+  description: 'user-defined attributes, stored inline with this entry'
+attrs_reference: struct<attrs_file: string not null, row: uint16 not null, flags: uint16>
+  child 0, attrs_file: string not null
+  child 1, row: uint16 not null
+  child 2, flags: uint16
+  -- field metadata --
+  description: 'user-defined attributes, stored in a separate attributes ' + 4
+inventories: list<item: struct<inventory_file: string not null, row: uint16 not null, extent: list<item: fixed_size_list<item: uint16>[2]> not null, flags: uint16>>
+  child 0, item: struct<inventory_file: string not null, row: uint16 not null, extent: list<item: fixed_size_list<item: uint16>[2]> not null, flags: uint16>
+      child 0, inventory_file: string not null
+      child 1, row: uint16 not null
+      child 2, extent: list<item: fixed_size_list<item: uint16>[2]> not null
+          child 0, item: fixed_size_list<item: uint16>[2]
+              child 0, item: uint16
+      child 3, flags: uint16
+```
+
+### Attributes Files
+
+[TODO: do we really need attributes files?]
+
+### Chunk Manifest Files
+
+A chunk manifest file stores chunk references.
+Chunk references from multiple arrays can be stored in the same chunk manifest.
+The chunks from a single array can also be spread across multiple manifests.
+
+Chunk manifest files are Parquet files.
+They have the following arrow schema.
+
+```
+id: uint32 not null
+array_id: uint32 not null
+coord: binary not null
+inline_data: binary
+chunk_file: string
+offset: uint64
+length: uint32 not null
+```
+
+- **id** - unique ID for the chunk.
+- **array_id** - ID for the array this is part of
+- **coord** - position of the chunk within the array. See _chunk coord encoding_ for more detail
+- **chunk_file** - the name of the file in which the chunk resides
+- **offset** - offset in bytes
+- **length** - size in bytes
+
+#### Chunk Coord Encoding
+
+Chunk coords are tuples of positive ints (e.g. `(5, 30, 10)`).
+In normal Zarr, chunk keys are encoded as strings (e.g. `5.30.10`).
+We want an encoding is:
+- efficient (minimal storage size)
+- sortable
+- useable as a predicate in Arrow
+
+The first two requirements rule out string encoding.
+The latter requirement rules out structs or lists.
+
+So we opt for a variable length binary encoding.
+
+### Chunk Files
+
+Chunk files contain the compressed binary chunks of a Zarr array.
+Icechunk permits quite a bit of flexibility about how chunks are stored.
+Chunk files can be:
+
+- One chunk per chunk file (i.e. standard Zarr)
+- Multiple contiguous chunks from the same array in a single chunk file (similar to Zarr V3 shards)
+- Chunks from multiple different arrays in the same file
+- Other file types (e.g. NetCDF, HDF5) which contain Zarr-compatible chunks
+
+Applications may choose to arrange chunks within files in different ways to optimize I/O patterns.
+
+## Algorithms
+
+### Initialize New Store
+
+### Write Snapshot
+
+### Read Snapshot
+
+### Expire Snapshots
diff --git a/spec/schema.py b/spec/schema.py
new file mode 100644
index 00000000..236f2b62
--- /dev/null
+++ b/spec/schema.py
@@ -0,0 +1,110 @@
+import pyarrow as pa
+
+structure_schema = pa.schema(
+    [
+        pa.field("id", pa.uint16(), nullable=False, metadata={"description": "unique identifier for the node"}),
+        pa.field("type", pa.string(), nullable=False, metadata={"description": "array or group"}),
+        pa.field("path", pa.string(), nullable=False, metadata={"description": "path to the node within the store"}),
+        pa.field(
+            "array_metadata",
+            pa.struct(
+                [
+                    pa.field("shape", pa.list_(pa.uint16()), nullable=False),
+                    pa.field("data_type", pa.string(), nullable=False),
+                    pa.field("fill_value", pa.binary(), nullable=True),
+                    pa.field("dimension_names", pa.list_(pa.string())),
+                    pa.field(
+                        "chunk_grid",
+                        pa.struct(
+                            [
+                                pa.field("name", pa.string(), nullable=False),
+                                pa.field(
+                                    "configuration",
+                                    pa.struct(
+                                        [
+                                             pa.field("chunk_shape", pa.list_(pa.uint16()), nullable=False),
+                                        ]
+                                    ),
+                                    nullable=False
+                                )
+                            ]
+                        )
+                    ),
+                    pa.field(
+                        "chunk_key_encoding",
+                        pa.struct(
+                            [
+                                pa.field("name", pa.string(), nullable=False),
+                                pa.field(
+                                    "configuration",
+                                    pa.struct(
+                                        [
+                                             pa.field("separator", pa.string(), nullable=False),
+                                        ]
+                                    ),
+                                    nullable=False
+                                )
+                            ]
+                        )
+                    ),
+                    pa.field(
+                        "codecs",
+                        pa.list_(
+                            pa.struct(
+                                [
+                                    pa.field("name", pa.string(), nullable=False),
+                                    pa.field("configuration", pa.binary(), nullable=True)
+                                ]
+                            )
+                        )
+                    )
+                ]
+            ),
+            nullable=True,
+            metadata={"description": "All the Zarr array metadata"}
+        ),
+        pa.field("inline_attrs", pa.binary(), nullable=True, metadata={"description": "user-defined attributes, stored inline with this entry"}),
+        pa.field(
+            "attrs_reference",
+            pa.struct(
+                [
+                    pa.field("attrs_file", pa.string(), nullable=False),
+                    pa.field("row", pa.uint16(), nullable=False),
+                    pa.field("flags", pa.uint16(), nullable=True)
+                ]
+            ),
+            nullable=True,
+            metadata={"description": "user-defined attributes, stored in a separate attributes file"}
+        ),
+        pa.field(
+            "inventories",
+            pa.list_(
+                pa.struct(
+                    [
+                        pa.field("inventory_file", pa.string(), nullable=False),
+                        pa.field("row", pa.uint16(), nullable=False),
+                        pa.field("extent", pa.list_(pa.list_(pa.uint16(), 2)), nullable=False),
+                        pa.field("flags", pa.uint16(), nullable=True)
+                    ]
+                )
+            ),
+            nullable=True
+        ),
+    ]
+)
+
+print(structure_schema)
+
+manifest_schema = pa.schema(
+    [
+        pa.field("id", pa.uint32(), nullable=False),
+        pa.field("array_id", pa.uint32(), nullable=False),
+        pa.field("coord", pa.binary(), nullable=False),
+        pa.field("inline_data", pa.binary(), nullable=True),
+        pa.field("chunk_file", pa.string(), nullable=True),
+        pa.field("offset", pa.uint64(), nullable=True),
+        pa.field("length", pa.uint32(), nullable=False)
+    ]
+)
+
+print(manifest_schema)
\ No newline at end of file

From 5e5393664150c2d1f663270ae3e7f8c2e12e7bc5 Mon Sep 17 00:00:00 2001
From: Ryan Abernathey <ryan.abernathey@gmail.com>
Date: Sun, 4 Aug 2024 05:42:06 -0400
Subject: [PATCH 02/11] add part about chunk coord encoding

---
 spec/icechunk_spec.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/spec/icechunk_spec.md b/spec/icechunk_spec.md
index b599c76b..a80c03f7 100644
--- a/spec/icechunk_spec.md
+++ b/spec/icechunk_spec.md
@@ -246,6 +246,7 @@ The first two requirements rule out string encoding.
 The latter requirement rules out structs or lists.
 
 So we opt for a variable length binary encoding.
+The chunk coord is created by encoding each element of the tuple a big endian `uint16` and then simply concatenating the bytes together in order. For the common case of arrays <= 4 dimensions, this would use 8 bytes or less per chunk coord.
 
 ### Chunk Files
 

From bba35093f168157049e8b84429504b589b6e3b09 Mon Sep 17 00:00:00 2001
From: Ryan Abernathey <ryan.abernathey@gmail.com>
Date: Wed, 7 Aug 2024 12:22:15 -0400
Subject: [PATCH 03/11] Update spec/icechunk_spec.md

---
 spec/icechunk_spec.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/spec/icechunk_spec.md b/spec/icechunk_spec.md
index a80c03f7..5b54fe5f 100644
--- a/spec/icechunk_spec.md
+++ b/spec/icechunk_spec.md
@@ -13,7 +13,7 @@ The most common scenarios is for a dataset to contain a single Zarr group with m
 |--|--|
 | Table | Dataset |
 | Column | Array |
-| Catalog | State File |
+| Metadata File | State File |
 | Snapshot | Snapshot |
 
 ## Goals

From a77bb67e8b6fbcf1871c3957ef04a538617e8445 Mon Sep 17 00:00:00 2001
From: Ryan Abernathey <ryan.abernathey@gmail.com>
Date: Wed, 7 Aug 2024 13:27:10 -0400
Subject: [PATCH 04/11] add catalog to comparison

---
 spec/icechunk_spec.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/spec/icechunk_spec.md b/spec/icechunk_spec.md
index 5b54fe5f..6fc3c7d8 100644
--- a/spec/icechunk_spec.md
+++ b/spec/icechunk_spec.md
@@ -15,6 +15,7 @@ The most common scenarios is for a dataset to contain a single Zarr group with m
 | Column | Array |
 | Metadata File | State File |
 | Snapshot | Snapshot |
+| Catalog | Catalog |
 
 ## Goals
 

From e1690ebb16f080d0bb78577cacd80a183af24b4e Mon Sep 17 00:00:00 2001
From: Ryan Abernathey <ryan.abernathey@gmail.com>
Date: Thu, 8 Aug 2024 10:56:30 -0400
Subject: [PATCH 05/11] updates from Seba's review

---
 spec/icechunk_spec.md | 50 +++++++++++++++++++++++--------------------
 spec/schema.py        |  6 +++---
 2 files changed, 30 insertions(+), 26 deletions(-)

diff --git a/spec/icechunk_spec.md b/spec/icechunk_spec.md
index 6fc3c7d8..b336fc11 100644
--- a/spec/icechunk_spec.md
+++ b/spec/icechunk_spec.md
@@ -6,29 +6,19 @@ Icechunk is inspired by Apache Iceberg and borrows many concepts and ideas from
 This specification describes a single Icechunk **dataset**.
 A dataset is defined as a Zarr store containing one or more interrelated Arrays and Groups which must be updated consistently.
 The most common scenarios is for a dataset to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates.
-
-## Comparison with Iceberg
-
-| Iceberg Entity | Icechunk Entity |
-|--|--|
-| Table | Dataset |
-| Column | Array |
-| Metadata File | State File |
-| Snapshot | Snapshot |
-| Catalog | Catalog |
+However, formally a dataset can be any valid Zarr hierarchy, from a single Array to a deeply nested structure of Groups and Arrays.
 
 ## Goals
 
 The goals of the specification are as follows:
 
-1. **Serializable isolation** - Reads will be isolated from concurrent writes and always use a committed snapshot of a dataset. Writes across multiple arrays and chunks will be commited via a single atomic operation and will not be partially visible. Readers will not acquire locks.
+1. **Serializable isolation** - Reads will be isolated from concurrent writes and always use a committed snapshot of a dataset. Writes to arrays will be commited via a single atomic operation and will not be partially visible. Readers will not acquire locks.
 2. **Chunk sharding and references** - Chunk storage is decoupled from specific file names. Multiple chunks can be packed into a single object (sharding). Zarr-compatible chunks within other file formats (e.g. HDF5, NetCDF) can be referenced.
-
-[TODO: there must be more, but these seem like the big ones for now]
+3. **Time travel** - Previous snapshots of a dataset remain accessible after new ones have been written. Reverting to an early snapshot is trivial and inexpensive.
 
 ### Filesytem Operations
 
-The required filesystem operations are identical to Iceberg. Icechunk only requires that file systems support the following operations:
+Icechunk only requires that file systems support the following operations:
 
 - **In-place write** - Files are not moved or altered once they are written.
 - **Seekable reads** - Chunk file formats may require seek support (e.g. shards).
@@ -47,14 +37,15 @@ Like Iceberg, Icechunk uses a series of linked metadata files to describe the st
 - The **state file** is the entry point to the dataset. It stores a record of snapshots, each of which is a pointer to a single structure file.
 - The **structure file** records all of the different arrays and groups in the dataset, plus their metadata. Every new commit creates a new structure file. The structure file contains pointers to one or more chunk manifests files and [optionally] attribute files.
 - **Chunk Manifests** store references to individual chunks.
-- **Attributes files** provide a way to store additional user-defined attributes for arrays and groups outside of the structure file.
+- **Attributes files** provide a way to store additional user-defined attributes for arrays and groups outside of the structure file. This is important when the attributes are very large.
 - **Chunk files** store the actual compressed chunk data.
 
-When reading a dataset, the client first open the state file and chooses a specific snapshot to open.
+When reading a dataset, the client first opens the state file and chooses a specific snapshot to open.
 The client then reads the structure file to determine the structure and hierarchy of the dataset.
 When fetching data from an array, the client first examines the chunk manifest file[s] for that array and finally fetches the chunks referenced therein.
 
-When writing a new dataset snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new structure file. Finally, in an atomic swap operation, it replaces the state file with a new state file recording the presence of the new snapshot .
+When writing a new dataset snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new structure file. Finally, in an atomic swap operation, it replaces the state file with a new state file recording the presence of the new snapshot.
+Ensuring atomicity of the swap operation is the responsibility of the catalog.
 
 
 ```mermaid
@@ -140,9 +131,9 @@ The state file can be stored separately from the rest of the data or together wi
 - `$ROOT` base URI (s3, gcs, file, etc.)
 - `$ROOT/state.json` (optional) state file
 - `$ROOT/s/` for the structure files
-- `$ROOT/a/` arrays and groups attribute information
-- `$ROOT/i/` array chunk manifests (i for index or inventory)
-- `$ROOT/c/` array chunks
+- `$ROOT/a/` for attribute files
+- `$ROOT/m/` for array chunk manifests
+- `$ROOT/c/` for array chunks
 
 ### Structure Files
 
@@ -154,7 +145,7 @@ Each row of the file represents an individual node (array or group) of the Zarr
 The structure file has the following Arrow schema:
 
 ```
-id: uint16 not null
+id: uint32 not null
   -- field metadata --
   description: 'unique identifier for the node'
 type: string not null
@@ -218,11 +209,11 @@ Chunk manifest files are Parquet files.
 They have the following arrow schema.
 
 ```
-id: uint32 not null
 array_id: uint32 not null
 coord: binary not null
 inline_data: binary
-chunk_file: string
+chunk_id: binary
+virtual_path: string
 offset: uint64
 length: uint32 not null
 ```
@@ -271,3 +262,16 @@ Applications may choose to arrange chunks within files in different ways to opti
 ### Read Snapshot
 
 ### Expire Snapshots
+
+
+## Appendices
+
+### Comparison with Iceberg
+
+| Iceberg Entity | Icechunk Entity |
+|--|--|
+| Table | Dataset |
+| Column | Array |
+| Metadata File | State File |
+| Snapshot | Snapshot |
+| Catalog | Catalog |
\ No newline at end of file
diff --git a/spec/schema.py b/spec/schema.py
index 236f2b62..3a81de63 100644
--- a/spec/schema.py
+++ b/spec/schema.py
@@ -2,7 +2,7 @@
 
 structure_schema = pa.schema(
     [
-        pa.field("id", pa.uint16(), nullable=False, metadata={"description": "unique identifier for the node"}),
+        pa.field("id", pa.uint32(), nullable=False, metadata={"description": "unique identifier for the node"}),
         pa.field("type", pa.string(), nullable=False, metadata={"description": "array or group"}),
         pa.field("path", pa.string(), nullable=False, metadata={"description": "path to the node within the store"}),
         pa.field(
@@ -97,11 +97,11 @@
 
 manifest_schema = pa.schema(
     [
-        pa.field("id", pa.uint32(), nullable=False),
         pa.field("array_id", pa.uint32(), nullable=False),
         pa.field("coord", pa.binary(), nullable=False),
         pa.field("inline_data", pa.binary(), nullable=True),
-        pa.field("chunk_file", pa.string(), nullable=True),
+        pa.field("chunk_id", pa.binary(), nullable=True),
+        pa.field("virtual_path", pa.string(), nullable=True),
         pa.field("offset", pa.uint64(), nullable=True),
         pa.field("length", pa.uint32(), nullable=False)
     ]

From 6a38427115b777735f2f83a3ace57854dacf5ab8 Mon Sep 17 00:00:00 2001
From: Ryan Abernathey <ryan.abernathey@gmail.com>
Date: Thu, 8 Aug 2024 11:44:22 -0400
Subject: [PATCH 06/11] add catalog details

---
 spec/icechunk_spec.md | 60 ++++++++++++++++++++++++++++++++-----------
 spec/schema.py        |  4 +--
 2 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/spec/icechunk_spec.md b/spec/icechunk_spec.md
index b336fc11..51d88463 100644
--- a/spec/icechunk_spec.md
+++ b/spec/icechunk_spec.md
@@ -32,7 +32,7 @@ Datasets do not require random-access writes. Once written, chunk and metadata f
 
 ### Overview
 
-Like Iceberg, Icechunk uses a series of linked metadata files to describe the state of the dataset.
+Icechunk uses a series of linked metadata files to describe the state of the dataset.
 
 - The **state file** is the entry point to the dataset. It stores a record of snapshots, each of which is a pointer to a single structure file.
 - The **structure file** records all of the different arrays and groups in the dataset, plus their metadata. Every new commit creates a new structure file. The structure file contains pointers to one or more chunk manifests files and [optionally] attribute files.
@@ -45,7 +45,7 @@ The client then reads the structure file to determine the structure and hierarch
 When fetching data from an array, the client first examines the chunk manifest file[s] for that array and finally fetches the chunks referenced therein.
 
 When writing a new dataset snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new structure file. Finally, in an atomic swap operation, it replaces the state file with a new state file recording the presence of the new snapshot.
-Ensuring atomicity of the swap operation is the responsibility of the catalog.
+Ensuring atomicity of the swap operation is the responsibility of the [catalog](#catalog).
 
 
 ```mermaid
@@ -91,9 +91,10 @@ flowchart TD
 The **state file** records the current state of the dataset.
 All transactions occur by updating or replacing the state file.
 The state file contains, at minimum, a pointer to the latest structure file snapshot.
+A state file doesn't actually have to be a file; responsibility for storing, retrieving, and updating a state file lies with the [catalog](#catalog), and different catalog implementations may do this in different ways.
+Below we describe the state file as a JSON file, which is the most straightforward implementation.
 
-
-The state file is a JSON file. It contains the following required and optional fields.
+The contents of the state file metadata must be compatible with the following JSON schema:
 
 [TODO: convert to JSON schema]
 
@@ -185,9 +186,9 @@ attrs_reference: struct<attrs_file: string not null, row: uint16 not null, flags
   child 2, flags: uint16
   -- field metadata --
   description: 'user-defined attributes, stored in a separate attributes ' + 4
-inventories: list<item: struct<inventory_file: string not null, row: uint16 not null, extent: list<item: fixed_size_list<item: uint16>[2]> not null, flags: uint16>>
-  child 0, item: struct<inventory_file: string not null, row: uint16 not null, extent: list<item: fixed_size_list<item: uint16>[2]> not null, flags: uint16>
-      child 0, inventory_file: string not null
+manifests: list<item: struct<manifest_id: uint16 not null, row: uint16 not null, extent: list<item: fixed_size_list<item: uint16>[2]> not null, flags: uint16>>
+  child 0, item: struct<manifest_id: uint16 not null, row: uint16 not null, extent: list<item: fixed_size_list<item: uint16>[2]> not null, flags: uint16>
+      child 0, manifest_id: uint16 not null
       child 1, row: uint16 not null
       child 2, extent: list<item: fixed_size_list<item: uint16>[2]> not null
           child 0, item: fixed_size_list<item: uint16>[2]
@@ -197,7 +198,7 @@ inventories: list<item: struct<inventory_file: string not null, row: uint16 not
 
 ### Attributes Files
 
-[TODO: do we really need attributes files?]
+Attribute files hold user-defined attributes separately from the structure file.
 
 ### Chunk Manifest Files
 
@@ -253,6 +254,32 @@ Chunk files can be:
 
 Applications may choose to arrange chunks within files in different ways to optimize I/O patterns.
 
+## Catalog
+
+An Icechunk _catalog_ is a database for keeping track of one or more state files for Icechunk Datasets.
+This specification is limited to the Dataset itself, and does not specify in detail all of the possible features or capabilities of a catalog.
+
+A catalog must support the following basic logical interface (here defined in Python pseudocode):
+
+```python
+def create_dataset(dataset_identifier, initial_state: StateMetadata) -> None
+    """Create a new dataset in the catalog"""
+    ...
+
+def load_dataset(dataset_identifier) -> StateMetadata:
+    """Retrieve the state metadata for a single dataset."""
+    ...
+
+def commit_dataset(dataset_identifier, previous_generation: int, new_state: StateMetadata) -> None:
+    """Atomically update a dataset's statefile.
+    Should fail if another session has incremented the generation parameter."""
+    ...
+
+def delete_dataset(dataset_identifier) -> None:
+    """Remove a dataset from the catalog."""
+    ...
+```
+
 ## Algorithms
 
 ### Initialize New Store
@@ -268,10 +295,13 @@ Applications may choose to arrange chunks within files in different ways to opti
 
 ### Comparison with Iceberg
 
-| Iceberg Entity | Icechunk Entity |
-|--|--|
-| Table | Dataset |
-| Column | Array |
-| Metadata File | State File |
-| Snapshot | Snapshot |
-| Catalog | Catalog |
\ No newline at end of file
+Like Iceberg, Icechunk uses a series of linked metadata files to describe the state of the dataset.
+But while Iceberg describes a table, the Icechunk dataset is a Zarr store (hierarchical structure of Arrays and Groups.)
+
+| Iceberg Entity | Icechunk Entity | Comment |
+|--|--|--|
+| Table | Dataset | The fundamental entity described by the spec |
+| Column | Array | The logical container for a homogenous collection of values | 
+| Metadata File | State File | The highest-level entry point into the dataset |
+| Snapshot | Snapshot | A single committed snapshot of the dataset |
+| Catalog | Catalog | A central place to track changes to one or more state files |
\ No newline at end of file
diff --git a/spec/schema.py b/spec/schema.py
index 3a81de63..f003c14d 100644
--- a/spec/schema.py
+++ b/spec/schema.py
@@ -77,11 +77,11 @@
             metadata={"description": "user-defined attributes, stored in a separate attributes file"}
         ),
         pa.field(
-            "inventories",
+            "manifests",
             pa.list_(
                 pa.struct(
                     [
-                        pa.field("inventory_file", pa.string(), nullable=False),
+                        pa.field("manifest_id", pa.uint16(), nullable=False),
                         pa.field("row", pa.uint16(), nullable=False),
                         pa.field("extent", pa.list_(pa.list_(pa.uint16(), 2)), nullable=False),
                         pa.field("flags", pa.uint16(), nullable=True)

From c6640b314210f50cb77838700f5dfc535a5a15f6 Mon Sep 17 00:00:00 2001
From: Ryan Abernathey <ryan.abernathey@gmail.com>
Date: Thu, 8 Aug 2024 16:58:28 -0400
Subject: [PATCH 07/11] replace Dataset with Store; add goals and non goals

---
 spec/icechunk_spec.md | 69 +++++++++++++++++++++++--------------------
 1 file changed, 37 insertions(+), 32 deletions(-)

diff --git a/spec/icechunk_spec.md b/spec/icechunk_spec.md
index 51d88463..0015c647 100644
--- a/spec/icechunk_spec.md
+++ b/spec/icechunk_spec.md
@@ -3,18 +3,23 @@
 The Icechunk specification is a storage specification for [Zarr](https://zarr-specs.readthedocs.io/en/latest/specs.html) data.
 Icechunk is inspired by Apache Iceberg and borrows many concepts and ideas from the [Iceberg Spec](https://iceberg.apache.org/spec/#version-2-row-level-deletes).
 
-This specification describes a single Icechunk **dataset**.
-A dataset is defined as a Zarr store containing one or more interrelated Arrays and Groups which must be updated consistently.
-The most common scenarios is for a dataset to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates.
-However, formally a dataset can be any valid Zarr hierarchy, from a single Array to a deeply nested structure of Groups and Arrays.
+This specification describes a single Icechunk **store**.
+A store is a Zarr store containing one or more interrelated Arrays and Groups, which must be updated consistently.
+The most common scenarios is for a store to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates.
+However, formally a store can be any valid Zarr hierarchy, from a single Array to a deeply nested structure of Groups and Arrays.
 
 ## Goals
 
 The goals of the specification are as follows:
 
-1. **Serializable isolation** - Reads will be isolated from concurrent writes and always use a committed snapshot of a dataset. Writes to arrays will be commited via a single atomic operation and will not be partially visible. Readers will not acquire locks.
+1. **Serializable isolation** - Reads will be isolated from concurrent writes and always use a committed snapshot of a store. Writes to arrays will be commited via a single atomic operation and will not be partially visible. Readers will not acquire locks.
 2. **Chunk sharding and references** - Chunk storage is decoupled from specific file names. Multiple chunks can be packed into a single object (sharding). Zarr-compatible chunks within other file formats (e.g. HDF5, NetCDF) can be referenced.
-3. **Time travel** - Previous snapshots of a dataset remain accessible after new ones have been written. Reverting to an early snapshot is trivial and inexpensive.
+3. **Time travel** - Previous snapshots of a store remain accessible after new ones have been written. Reverting to an early snapshot is trivial and inexpensive.
+4. **Schema Evolution** - Arrays and Groups can be added, renamed, and removed from the hierarchy with minimal overhead.
+
+### Non Goals
+
+1. **Low Latency** - Icechunk is designed to support analytical workloads for large stores. We accept that the extra layers of metadata files and indirection will introduce additional cold-start latency compared to regular Zarr. 
 
 ### Filesytem Operations
 
@@ -22,29 +27,29 @@ Icechunk only requires that file systems support the following operations:
 
 - **In-place write** - Files are not moved or altered once they are written.
 - **Seekable reads** - Chunk file formats may require seek support (e.g. shards).
-- **Deletes** - Datasets delete files that are no longer used (via a garbage-collection operation).
+- **Deletes** - Stores delete files that are no longer used (via a garbage-collection operation).
 
 These requirements are compatible with object stores, like S3.
 
-Datasets do not require random-access writes. Once written, chunk and metadata files are immutable until they are deleted.
+Stores do not require random-access writes. Once written, chunk and metadata files are immutable until they are deleted.
 
 ## Specification
 
 ### Overview
 
-Icechunk uses a series of linked metadata files to describe the state of the dataset.
+Icechunk uses a series of linked metadata files to describe the state of the store.
 
-- The **state file** is the entry point to the dataset. It stores a record of snapshots, each of which is a pointer to a single structure file.
-- The **structure file** records all of the different arrays and groups in the dataset, plus their metadata. Every new commit creates a new structure file. The structure file contains pointers to one or more chunk manifests files and [optionally] attribute files.
+- The **state file** is the entry point to the store. It stores a record of snapshots, each of which is a pointer to a single structure file.
+- The **structure file** records all of the different arrays and groups in the store, plus their metadata. Every new commit creates a new structure file. The structure file contains pointers to one or more chunk manifests files and [optionally] attribute files.
 - **Chunk Manifests** store references to individual chunks.
 - **Attributes files** provide a way to store additional user-defined attributes for arrays and groups outside of the structure file. This is important when the attributes are very large.
 - **Chunk files** store the actual compressed chunk data.
 
-When reading a dataset, the client first opens the state file and chooses a specific snapshot to open.
-The client then reads the structure file to determine the structure and hierarchy of the dataset.
+When reading a store, the client first opens the state file and chooses a specific snapshot to open.
+The client then reads the structure file to determine the structure and hierarchy of the store.
 When fetching data from an array, the client first examines the chunk manifest file[s] for that array and finally fetches the chunks referenced therein.
 
-When writing a new dataset snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new structure file. Finally, in an atomic swap operation, it replaces the state file with a new state file recording the presence of the new snapshot.
+When writing a new store snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new structure file. Finally, in an atomic swap operation, it replaces the state file with a new state file recording the presence of the new snapshot.
 Ensuring atomicity of the swap operation is the responsibility of the [catalog](#catalog).
 
 
@@ -88,7 +93,7 @@ flowchart TD
 
 ### State File
 
-The **state file** records the current state of the dataset.
+The **state file** records the current state of the store.
 All transactions occur by updating or replacing the state file.
 The state file contains, at minimum, a pointer to the latest structure file snapshot.
 A state file doesn't actually have to be a file; responsibility for storing, retrieving, and updating a state file lies with the [catalog](#catalog), and different catalog implementations may do this in different ways.
@@ -100,7 +105,7 @@ The contents of the state file metadata must be compatible with the following JS
 
 | Name | Required | Type | Description |
 |--|--|--|--|
-| id | YES | str UID | A unique identifier for the dataset |
+| id | YES | str UID | A unique identifier for the store |
 | generation | YES | int | An integer which must be incremented whenever the state file is updated |
 | store_root | NO | str | A URI which points to the root location of the store in object storage. If blank, the store root is assumed to be in the same directory as the state file itself. | 
 | snapshots | YES | array[snapshot] | A list of all of the snapshots. |
@@ -127,7 +132,7 @@ References are a mapping of string names to snapshots
 
 ### File Layout
 
-The state file can be stored separately from the rest of the data or together with it. The rest of the data files in the dataset must be kept in a directory with the following structure.
+The state file can be stored separately from the rest of the data or together with it. The rest of the data files in the store must be kept in a directory with the following structure.
 
 - `$ROOT` base URI (s3, gcs, file, etc.)
 - `$ROOT/state.json` (optional) state file
@@ -138,10 +143,10 @@ The state file can be stored separately from the rest of the data or together wi
 
 ### Structure Files
 
-The structure file fully describes the schema of the dataset, including all arrays and groups.
+The structure file fully describes the schema of the store, including all arrays and groups.
 
 The structure file is a Parquet file.
-Each row of the file represents an individual node (array or group) of the Zarr dataset. 
+Each row of the file represents an individual node (array or group) of the Zarr store. 
 
 The structure file has the following Arrow schema:
 
@@ -256,27 +261,27 @@ Applications may choose to arrange chunks within files in different ways to opti
 
 ## Catalog
 
-An Icechunk _catalog_ is a database for keeping track of one or more state files for Icechunk Datasets.
-This specification is limited to the Dataset itself, and does not specify in detail all of the possible features or capabilities of a catalog.
+An Icechunk _catalog_ is a database for keeping track of one or more state files for Icechunk Stores.
+This specification is limited to the Store itself, and does not specify in detail all of the possible features or capabilities of a catalog.
 
 A catalog must support the following basic logical interface (here defined in Python pseudocode):
 
 ```python
-def create_dataset(dataset_identifier, initial_state: StateMetadata) -> None
-    """Create a new dataset in the catalog"""
+def create_store(store_identifier, initial_state: StateMetadata) -> None
+    """Create a new store in the catalog"""
     ...
 
-def load_dataset(dataset_identifier) -> StateMetadata:
-    """Retrieve the state metadata for a single dataset."""
+def load_store(store_identifier) -> StateMetadata:
+    """Retrieve the state metadata for a single store."""
     ...
 
-def commit_dataset(dataset_identifier, previous_generation: int, new_state: StateMetadata) -> None:
-    """Atomically update a dataset's statefile.
+def commit_store(store_identifier, previous_generation: int, new_state: StateMetadata) -> None:
+    """Atomically update a store's statefile.
     Should fail if another session has incremented the generation parameter."""
     ...
 
-def delete_dataset(dataset_identifier) -> None:
-    """Remove a dataset from the catalog."""
+def delete_store(store_identifier) -> None:
+    """Remove a store from the catalog."""
     ...
 ```
 
@@ -295,12 +300,12 @@ def delete_dataset(dataset_identifier) -> None:
 
 ### Comparison with Iceberg
 
-Like Iceberg, Icechunk uses a series of linked metadata files to describe the state of the dataset.
-But while Iceberg describes a table, the Icechunk dataset is a Zarr store (hierarchical structure of Arrays and Groups.)
+Like Iceberg, Icechunk uses a series of linked metadata files to describe the state of the store.
+But while Iceberg describes a table, the Icechunk store is a Zarr store (hierarchical structure of Arrays and Groups.)
 
 | Iceberg Entity | Icechunk Entity | Comment |
 |--|--|--|
-| Table | Dataset | The fundamental entity described by the spec |
+| Table | Store | The fundamental entity described by the spec |
 | Column | Array | The logical container for a homogenous collection of values | 
 | Metadata File | State File | The highest-level entry point into the dataset |
 | Snapshot | Snapshot | A single committed snapshot of the dataset |

From 03a329fea4a10d28726274b5c0b5abff74df8d2d Mon Sep 17 00:00:00 2001
From: Ryan Abernathey <ryan.abernathey@gmail.com>
Date: Thu, 8 Aug 2024 17:04:20 -0400
Subject: [PATCH 08/11] more review comments

---
 spec/icechunk_spec.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/spec/icechunk_spec.md b/spec/icechunk_spec.md
index 0015c647..4c285e10 100644
--- a/spec/icechunk_spec.md
+++ b/spec/icechunk_spec.md
@@ -29,7 +29,7 @@ Icechunk only requires that file systems support the following operations:
 - **Seekable reads** - Chunk file formats may require seek support (e.g. shards).
 - **Deletes** - Stores delete files that are no longer used (via a garbage-collection operation).
 
-These requirements are compatible with object stores, like S3.
+These requirements are compatible with object stores, like S3, as well as with filesystems.
 
 Stores do not require random-access writes. Once written, chunk and metadata files are immutable until they are deleted.
 
@@ -42,7 +42,7 @@ Icechunk uses a series of linked metadata files to describe the state of the sto
 - The **state file** is the entry point to the store. It stores a record of snapshots, each of which is a pointer to a single structure file.
 - The **structure file** records all of the different arrays and groups in the store, plus their metadata. Every new commit creates a new structure file. The structure file contains pointers to one or more chunk manifests files and [optionally] attribute files.
 - **Chunk Manifests** store references to individual chunks.
-- **Attributes files** provide a way to store additional user-defined attributes for arrays and groups outside of the structure file. This is important when the attributes are very large.
+- **Attributes files** provide a way to store additional user-defined attributes for arrays and groups outside of the structure file. This is important when the attributes are very large, to prevent the structure file from becoming too big.
 - **Chunk files** store the actual compressed chunk data.
 
 When reading a store, the client first opens the state file and chooses a specific snapshot to open.
@@ -132,7 +132,7 @@ References are a mapping of string names to snapshots
 
 ### File Layout
 
-The state file can be stored separately from the rest of the data or together with it. The rest of the data files in the store must be kept in a directory with the following structure.
+The state file can be stored separately from the rest of the data or together with it. The rest of the data files must be kept in a directory with the following structure.
 
 - `$ROOT` base URI (s3, gcs, file, etc.)
 - `$ROOT/state.json` (optional) state file

From ad737334f510a15a6ef2993e80ccb6e52776f682 Mon Sep 17 00:00:00 2001
From: Ryan Abernathey <ryan.abernathey@gmail.com>
Date: Thu, 8 Aug 2024 17:04:26 -0400
Subject: [PATCH 09/11] Apply suggestions from code review

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
---
 spec/icechunk_spec.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/spec/icechunk_spec.md b/spec/icechunk_spec.md
index 51d88463..f0ba55a5 100644
--- a/spec/icechunk_spec.md
+++ b/spec/icechunk_spec.md
@@ -5,7 +5,7 @@ Icechunk is inspired by Apache Iceberg and borrows many concepts and ideas from
 
 This specification describes a single Icechunk **dataset**.
 A dataset is defined as a Zarr store containing one or more interrelated Arrays and Groups which must be updated consistently.
-The most common scenarios is for a dataset to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates.
+The most common scenario is for a dataset to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates.
 However, formally a dataset can be any valid Zarr hierarchy, from a single Array to a deeply nested structure of Groups and Arrays.
 
 ## Goals
@@ -36,14 +36,14 @@ Icechunk uses a series of linked metadata files to describe the state of the dat
 
 - The **state file** is the entry point to the dataset. It stores a record of snapshots, each of which is a pointer to a single structure file.
 - The **structure file** records all of the different arrays and groups in the dataset, plus their metadata. Every new commit creates a new structure file. The structure file contains pointers to one or more chunk manifests files and [optionally] attribute files.
-- **Chunk Manifests** store references to individual chunks.
+- **Chunk Manifests** store references to individual chunks. A single manifest may store references for multiple arrays or a subset of all the references for a single array.
 - **Attributes files** provide a way to store additional user-defined attributes for arrays and groups outside of the structure file. This is important when the attributes are very large.
-- **Chunk files** store the actual compressed chunk data.
+- **Chunk files** store the actual compressed chunk data, potentially containing data for multiple chunks in a single file.
 
-When reading a dataset, the client first opens the state file and chooses a specific snapshot to open.
+When reading a dataset, the client first opens the state file and chooses a structure file corresponding to a specific snapshot to open.
 The client then reads the structure file to determine the structure and hierarchy of the dataset.
-When fetching data from an array, the client first examines the chunk manifest file[s] for that array and finally fetches the chunks referenced therein.
 
+When fetching data from an array, the client first examines the chunk manifest file[s] for that array and finally fetches the chunks referenced therein.
 When writing a new dataset snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new structure file. Finally, in an atomic swap operation, it replaces the state file with a new state file recording the presence of the new snapshot.
 Ensuring atomicity of the swap operation is the responsibility of the [catalog](#catalog).
 
@@ -104,14 +104,14 @@ The contents of the state file metadata must be compatible with the following JS
 | generation | YES | int | An integer which must be incremented whenever the state file is updated |
 | store_root | NO | str | A URI which points to the root location of the store in object storage. If blank, the store root is assumed to be in the same directory as the state file itself. | 
 | snapshots | YES | array[snapshot] | A list of all of the snapshots. |
-| refs | NO | mapping[reference] | A mapping of references to snapshots |
+| refs | NO | mapping[reference] | A mapping of references (string names) to snapshots |
 
 A snapshot contains the following properties
 
 | Name | Required | Type | Description |
 |--|--|--|--|
 | snapshot-id | YES | str UID | Unique identifier for the snapshot |
-| parent-snapshot-id | NO | str UID | Parent snapshot (null for no parent) |
+| parent-snapshot-id | YES |  null OR str UID | Parent snapshot (null for no parent) |
 | timestamp-ms | YES | int | When was snapshot commited |
 | structure-file | YES | str | Name of the structure file for this snapshot |
 | properties | NO | object | arbitrary user-defined attributes to associate with this snapshot | 

From 2e998a1aadcecffd4a8bfe45b0dda2c853a53276 Mon Sep 17 00:00:00 2001
From: Ryan Abernathey <ryan.abernathey@gmail.com>
Date: Thu, 8 Aug 2024 17:08:59 -0400
Subject: [PATCH 10/11] rename dataset -> store again

---
 spec/icechunk_spec.md | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/spec/icechunk_spec.md b/spec/icechunk_spec.md
index 7b811fd7..e6317955 100644
--- a/spec/icechunk_spec.md
+++ b/spec/icechunk_spec.md
@@ -3,10 +3,10 @@
 The Icechunk specification is a storage specification for [Zarr](https://zarr-specs.readthedocs.io/en/latest/specs.html) data.
 Icechunk is inspired by Apache Iceberg and borrows many concepts and ideas from the [Iceberg Spec](https://iceberg.apache.org/spec/#version-2-row-level-deletes).
 
-This specification describes a single Icechunk **dataset**.
-A dataset is defined as a Zarr store containing one or more interrelated Arrays and Groups which must be updated consistently.
-The most common scenario is for a dataset to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates.
-However, formally a dataset can be any valid Zarr hierarchy, from a single Array to a deeply nested structure of Groups and Arrays.
+This specification describes a single Icechunk **store**.
+A store is defined as a Zarr store containing one or more interrelated Arrays and Groups which must be updated consistently.
+The most common scenario is for a store to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates.
+However, formally a store can be any valid Zarr hierarchy, from a single Array to a deeply nested structure of Groups and Arrays.
 
 ## Goals
 
@@ -39,17 +39,17 @@ Stores do not require random-access writes. Once written, chunk and metadata fil
 
 Icechunk uses a series of linked metadata files to describe the state of the store.
 
-- The **state file** is the entry point to the dataset. It stores a record of snapshots, each of which is a pointer to a single structure file.
-- The **structure file** records all of the different arrays and groups in the dataset, plus their metadata. Every new commit creates a new structure file. The structure file contains pointers to one or more chunk manifests files and [optionally] attribute files.
+- The **state file** is the entry point to the store. It stores a record of snapshots, each of which is a pointer to a single structure file.
+- The **structure file** records all of the different arrays and groups in the store, plus their metadata. Every new commit creates a new structure file. The structure file contains pointers to one or more chunk manifests files and [optionally] attribute files.
 - **Chunk Manifests** store references to individual chunks. A single manifest may store references for multiple arrays or a subset of all the references for a single array.
 - **Attributes files** provide a way to store additional user-defined attributes for arrays and groups outside of the structure file. This is important when the attributes are very large.
 - **Chunk files** store the actual compressed chunk data, potentially containing data for multiple chunks in a single file.
 
-When reading a dataset, the client first opens the state file and chooses a structure file corresponding to a specific snapshot to open.
-The client then reads the structure file to determine the structure and hierarchy of the dataset.
+When reading a store, the client first opens the state file and chooses a structure file corresponding to a specific snapshot to open.
+The client then reads the structure file to determine the structure and hierarchy of the store.
 
 When fetching data from an array, the client first examines the chunk manifest file[s] for that array and finally fetches the chunks referenced therein.
-When writing a new dataset snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new structure file. Finally, in an atomic swap operation, it replaces the state file with a new state file recording the presence of the new snapshot.
+When writing a new store snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new structure file. Finally, in an atomic swap operation, it replaces the state file with a new state file recording the presence of the new snapshot.
 Ensuring atomicity of the swap operation is the responsibility of the [catalog](#catalog).
 
 

From 202de3a6232e242a08def6af27a64be960e430fc Mon Sep 17 00:00:00 2001
From: Ryan Abernathey <ryan.abernathey@gmail.com>
Date: Fri, 9 Aug 2024 17:25:46 -0400
Subject: [PATCH 11/11] closes #6

---
 spec/icechunk_spec.md | 82 +++++++++++++++++++++++++------------------
 1 file changed, 47 insertions(+), 35 deletions(-)

diff --git a/spec/icechunk_spec.md b/spec/icechunk_spec.md
index e6317955..699ce008 100644
--- a/spec/icechunk_spec.md
+++ b/spec/icechunk_spec.md
@@ -1,12 +1,20 @@
 # Icechunk Specification
 
+The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
+
+## Introduction
+
 The Icechunk specification is a storage specification for [Zarr](https://zarr-specs.readthedocs.io/en/latest/specs.html) data.
 Icechunk is inspired by Apache Iceberg and borrows many concepts and ideas from the [Iceberg Spec](https://iceberg.apache.org/spec/#version-2-row-level-deletes).
 
 This specification describes a single Icechunk **store**.
-A store is defined as a Zarr store containing one or more interrelated Arrays and Groups which must be updated consistently.
+A store is defined as a Zarr store containing one or more Arrays and Groups.
 The most common scenario is for a store to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates.
 However, formally a store can be any valid Zarr hierarchy, from a single Array to a deeply nested structure of Groups and Arrays.
+Users of Icechunk SHOULD aim to scope their stores only to related arrays and groups that require consistent transactional updates.
+
+All the data and metadata for a store are stored in **warehouse**, typically a directory in object storage or file storage.
+A separate **catalog** is used to track the latest version of of store.
 
 ## Goals
 
@@ -25,7 +33,7 @@ The goals of the specification are as follows:
 
 Icechunk only requires that file systems support the following operations:
 
-- **In-place write** - Files are not moved or altered once they are written.
+- **In-place write** - Files are not moved or altered once they are written. Strong read-after-write consistency is expected.
 - **Seekable reads** - Chunk file formats may require seek support (e.g. shards).
 - **Deletes** - Stores delete files that are no longer used (via a garbage-collection operation).
 
@@ -45,20 +53,25 @@ Icechunk uses a series of linked metadata files to describe the state of the sto
 - **Attributes files** provide a way to store additional user-defined attributes for arrays and groups outside of the structure file. This is important when the attributes are very large.
 - **Chunk files** store the actual compressed chunk data, potentially containing data for multiple chunks in a single file.
 
-When reading a store, the client first opens the state file and chooses a structure file corresponding to a specific snapshot to open.
+When reading a store, the client receives a pointer to a state file from the catalog, read the state file, and  chooses a structure file corresponding to a specific snapshot to open.
 The client then reads the structure file to determine the structure and hierarchy of the store.
-
 When fetching data from an array, the client first examines the chunk manifest file[s] for that array and finally fetches the chunks referenced therein.
-When writing a new store snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new structure file. Finally, in an atomic swap operation, it replaces the state file with a new state file recording the presence of the new snapshot.
+
+When writing a new store snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new structure file and state file.
+Finally, in an atomic swap operation, it updates the pointer to the state file in the catalog.
 Ensuring atomicity of the swap operation is the responsibility of the [catalog](#catalog).
 
 
 ```mermaid
 flowchart TD
     subgraph catalog
-    state[State File]
+    cat_pointer[Current Statefile Pointer]
     end
     subgraph metadata
+    subgraph state_files
+    old_state[State File 1]
+    state[State File 2]
+    end
     subgraph structure
     structure1[Structure File 1]
     structure2[Structure File 2]
@@ -78,6 +91,7 @@ flowchart TD
     chunk4[Chunk File 4]
     end
     
+    cat_pointer --> state
     state -- snapshot ID --> structure2
     structure1 --> attrs
     structure1 --> manifestA
@@ -91,15 +105,25 @@ flowchart TD
     
 ```
 
+### File Layout
+
+All data and metadata files are stored in a warehouse (typically an object store) using the following directory structure.
+
+- `$ROOT` base URI (s3, gcs, file, etc.)
+- `$ROOT/t/` state files
+- `$ROOT/s/` for the structure files
+- `$ROOT/a/` for attribute files
+- `$ROOT/m/` for array chunk manifests
+- `$ROOT/c/` for array chunks
+
 ### State File
 
-The **state file** records the current state of the store.
-All transactions occur by updating or replacing the state file.
-The state file contains, at minimum, a pointer to the latest structure file snapshot.
-A state file doesn't actually have to be a file; responsibility for storing, retrieving, and updating a state file lies with the [catalog](#catalog), and different catalog implementations may do this in different ways.
-Below we describe the state file as a JSON file, which is the most straightforward implementation.
+The **state file** records the current state and history of the store.
+All commits occur by creating a new state file and updating the pointer in the catalog to this new state file.
+The state file contains a list of active (non-expired) snapshots.
+Each snapshot includes a pointer to the structure file for that snapshot.
 
-The contents of the state file metadata must be compatible with the following JSON schema:
+The state file is a JSON file with the following JSON schema:
 
 [TODO: convert to JSON schema]
 
@@ -123,24 +147,12 @@ A snapshot contains the following properties
 
 References are a mapping of string names to snapshots
 
-
 | Name | Required | Type | Description |
 |--|--|--|--|
 | name | YES | str | Name of the reference|
 | snapshot-id | YES | str UID | What snaphot does it point to |
 | type | YES | "tag" / "branch" | Whether the reference is a tag or a branch | 
 
-### File Layout
-
-The state file can be stored separately from the rest of the data or together with it. The rest of the data files must be kept in a directory with the following structure.
-
-- `$ROOT` base URI (s3, gcs, file, etc.)
-- `$ROOT/state.json` (optional) state file
-- `$ROOT/s/` for the structure files
-- `$ROOT/a/` for attribute files
-- `$ROOT/m/` for array chunk manifests
-- `$ROOT/c/` for array chunks
-
 ### Structure Files
 
 The structure file fully describes the schema of the store, including all arrays and groups.
@@ -261,23 +273,20 @@ Applications may choose to arrange chunks within files in different ways to opti
 
 ## Catalog
 
-An Icechunk _catalog_ is a database for keeping track of one or more state files for Icechunk Stores.
+An Icechunk _catalog_ is a database for keeping track of pointers to state files for Icechunk Stores.
 This specification is limited to the Store itself, and does not specify in detail all of the possible features or capabilities of a catalog.
 
-A catalog must support the following basic logical interface (here defined in Python pseudocode):
+A catalog MUST support the following basic logical interface (here defined in Python pseudocode):
 
 ```python
-def create_store(store_identifier, initial_state: StateMetadata) -> None
-    """Create a new store in the catalog"""
+def get_store_statefile_location(store_identifier) -> URI:
+    """Get the location of a store state file."""
     ...
 
-def load_store(store_identifier) -> StateMetadata:
-    """Retrieve the state metadata for a single store."""
-    ...
-
-def commit_store(store_identifier, previous_generation: int, new_state: StateMetadata) -> None:
-    """Atomically update a store's statefile.
-    Should fail if another session has incremented the generation parameter."""
+def set_store_statefile_location(store_identifier, previous_statefile_location) -> None:
+    """Set the location of a store state file.
+    Should fail of the client's previous_statefile_location
+    is not consistent with the catalog."""
     ...
 
 def delete_store(store_identifier) -> None:
@@ -285,6 +294,9 @@ def delete_store(store_identifier) -> None:
     ...
 ```
 
+A catalog MAY also store the state metadata directly within its database, eliminating the need for an additional request to fetch the state file.
+This does not remove the need for the state file to be stored in the warehouse.
+
 ## Algorithms
 
 ### Initialize New Store