Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Icechunk CLI Design Document #714

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@
/devel

.ipynb_checkpoints
.vscode
148 changes: 148 additions & 0 deletions design-docs/008-command-line-interface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Icechunk Command Line Interface

This document outlines the design of the Icechunk command line interface.

## Functionality

Here is a list of tasks a user might want to do with Icechunk:

- List repositories in the configuration
- List a history of a repo
- List branches in a repo
- List tags in a repo
- Print the zarr hierarchy
- Get repo statistics (e.g. `getsize`)
- Create a new repository
- Check configuration
- Diff between two commits
- Invoke administrative tasks (garbage collection, compaction, etc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add:

  • Print the zarr hierarchy
  • Get repo statistics, example: getsize
  • Fetch metadata for a node
  • Update node user attributes?
  • One day search metadata (IC cannot do this today)
  • In the future we could include "export" functionality, like "export array foo to a zarr store"

Copy link
Contributor Author

@DahnJ DahnJ Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not yet reflected in the current Interface section. Detailed API is out of scope, but it might be useful to think about this to see if the proposed command structure works.


This is not an exhaustive list.

## Interface

General command structure

```bash
icechunk <object> <action> <args>
```

Examples

```bash
icechunk repos list

icechunk repo create <repo>
icechunk repo info <repo>
icechunk repo tree <repo>
icechunk repo delete <repo>

icechunk branch list <repo>
icechunk branch create <repo> <branch_name>
icechunk snapshot list <repo>
icechunk snapshot diff <repo> <snapshot_id_1> <snapshot_id_2>
icechunk ref list <repo>

icechunk config init # init: interactive setup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! interactive could be very usefu

icechunk config list
icechunk config get <key>
icechunk config set <key> <value>

```

### Git-like interface

Alternative would be a more git-like structure (`git diff`, `git show`, ..).

The git interface is familiar, but

- The differences between git and Icechunk can be deceptive to new users
- The git interface is (arguably) not very user-friendly if you're not familiar with it
- This structure is more extensible
- Example: Docker [adopting](https://www.docker.com/blog/whats-new-in-docker-1-13/) this structure over time (`docker ps` -> `docker container ls`)

## Configuration

Two guiding use-cases

- User just wants to `icechunk repo create s3://bucket/path`, get credentials from environment/aws config, and use default repo settings.
- User wants to manage multiple repositories stored in different locations, with different credentials and settings.

Following Icechunk's config module, there are four types of information needed to work with a repository:

- Location: `bucket`, `path`
- Credentials: `access_key_id`, `secret_access_key`, ..
- Options: `region`, `endpoint_url`, ..
- Repo configuration: `compression`, `caching`, `virtual_chunk_containers`, ..

There are three ways to provide this information, in the standard order of precedence:

1. Command line arguments
2. Environment variables
3. Configuration file


### Repositories configuration

The CLI repositories configuration file.

> Note: This configuration could also be used by the library.

A first draft of the structure:

```rust
use std::collections::HashMap;

use crate::config::{RepositoryConfig, ObjectStoreConfig, Credentials}

pub struct RepoLocation {
bucket: String,
prefix: String,
}

pub struct RepositoryDefinition {
location: RepoLocation,
object_store_config: ObjectStoreConfig,
credentials: Credentials,
config: RepositoryConfig,
}

pub struct RepositoryAlias(String);

pub struct Repositories {
repos: HashMap<RepositoryAlias, RepositoryDefinition>,
}
```

## Python packaging

Following the [Python entrypoint](https://www.maturin.rs/bindings#both-binary-and-library) approach.

- cli implemented in `icechunk/src/cli/`
- cli exposed to Rust in `icechunk/src/bin/icechunk/`
- cli exposed to Python through an entrypoint function, exposed in `pyproject.toml`

```ini
[project.scripts]
icechunk = "icechunk._icechunk_python:cli_entrypoint"
```

The disadvantage is that Python users need to call Python to use the CLI, resulting in hundreds of milliseconds of latency.

The user can also install the Rust binary directly through `cargo install`.

## Implementation details

Implemented with

- [clap](https://crates.io/crates/clap) for the CLI
- [clap_complete](https://crates.io/crates/clap_complete) for shell completion
- [anyhow](https://crates.io/crates/anyhow) for error handling
- [serde_yaml_ng](https://crates.io/crates/serde_yaml_ng) for configuration
- [dialoguer](https://crates.io/crates/dialoguer) for user input

## Optional features

- Structured output option (e.g. JSON)
- Short version of the command (e.g. `ic`)
- Support for tab completion
2 changes: 1 addition & 1 deletion icechunk/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ impl ManifestConfig {

#[derive(Clone, Debug, Serialize, Deserialize, PartialEq, Eq, Default)]
pub struct RepositoryConfig {
/// Chunks smaller than this will be stored inline in the manifst
/// Chunks smaller than this will be stored inline in the manifest
pub inline_chunk_threshold_bytes: Option<u16>,
/// Unsafely overwrite refs on write. This is not recommended, users should only use it at their
/// own risk in object stores for which we don't support write-object-if-not-exists. There is
Expand Down
2 changes: 1 addition & 1 deletion icechunk/src/storage/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@ pub trait Storage: fmt::Debug + private::Sealed + Sync + Send {
) -> StorageResult<Box<dyn AsyncRead + Unpin + Send>>;
/// Returns whatever reader is more efficient.
///
/// For example, if processesed with multiple requests, it will return a synchronous `Buf`
/// For example, if processed with multiple requests, it will return a synchronous `Buf`
/// instance pointing the different parts. If it was executed in a single request, it's more
/// efficient to return the network `AsyncRead` directly
async fn fetch_manifest_known_size(
Expand Down