Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Icechunk CLI Design Document #714

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@
/devel

.ipynb_checkpoints
.vscode
143 changes: 143 additions & 0 deletions design-docs/008-command-line-interface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Icechunk Command Line Interface

This document outlines the design of the Icechunk command line interface.

## Functionality

Here is a list of tasks a user might want to do with Icechunk:

- List my repositories
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listing repositories is something Icechunk cannot do today. It can only verify if a repository exists at a given location.

We don't expect to add this functionality, we tend to see anything that is outside the repo prefix as "unknown" to Icechunk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about listing the repositories defined in the proposed repo config?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should this list? Probably all repositories in the repositories config?

It would also be great to be able to point at a location and auto-discover repos.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto discover would be cool... seems like an advanced feature we may not need for a while.

- List a history of a repo
- List branches in a repo
- List tags in a repo
- Create a new repository
- Check configuration
- Diff between two commits
- Invoke administrative tasks (garbage collection, compaction, etc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add:

  • Print the zarr hierarchy
  • Get repo statistics, example: getsize
  • Fetch metadata for a node
  • Update node user attributes?
  • One day search metadata (IC cannot do this today)
  • In the future we could include "export" functionality, like "export array foo to a zarr store"

Copy link
Contributor Author

@DahnJ DahnJ Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not yet reflected in the current Interface section. Detailed API is out of scope, but it might be useful to think about this to see if the proposed command structure works.


## Interface

General command structure

```bash
icechunk <object> <action> <args>
```

Examples

```bash
icechunk repo list

icechunk repo create <repo>
icechunk repo info <repo>
icechunk repo tree <repo>
icechunk repo delete <repo>

icechunk branch list <repo>
icechunk branch create <repo> <branch_name>
icechunk snapshot list <repo>
icechunk snapshot diff <repo> <snapshot_id_1> <snapshot_id_2>
icechunk ref list <repo>

icechunk config init # init: interactive setup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! interactive could be very usefu

icechunk config list
icechunk config get <key>
icechunk config set <key> <value>

```

### Git-like interface

Alternative would be a more git-like structure (`git diff`, `git show`, ..).

The git interface is familiar, but

- The differences between git and Icechunk can be deceptive to new users
- The git interface is (arguably) not very user-friendly if you're not familiar with it
- This structure is more extensible
- Example: Docker [adopting](https://www.docker.com/blog/whats-new-in-docker-1-13/) this structure over time (`docker ps` -> `docker container ls`)

## Configuration

Two guiding use-cases

- User just wants to `icechunk repo create s3://bucket/path`, get credentials from environment/aws config, and use default repo settings.
- User wants to manage multiple repositories stored in different locations, with different credentials and settings.

Following Icechunk's config module, there are four types of information needed to work with a repository:

- Location: `bucket`, `path`
- Credentials: `access_key_id`, `secret_access_key`, ..
- Options: `region`, `endpoint_url`, ..
- Repo configuration: `compression`, `caching`, `virtual_chunk_containers`, ..

There are three ways to provide this information, in the standard order of precedence:

1. Command line arguments
2. Environment variables
3. Configuration file


### Repositories configuration

The CLI repositories configuration file.

> Note: This configuration could also be used by the library.

A first draft of the structure:

```rust
use std::collections::HashMap;

use crate::config::{RepositoryConfig, ObjectStoreConfig, Credentials}

pub struct RepoLocation {
bucket: String,
prefix: String,
}

pub struct RepositoryDefinition {
location: RepoLocation,
object_store_config: ObjectStoreConfig,
credentials: Credentials,
config: RepositoryConfig,
}

pub struct RepositoryAlias(String);

pub struct Repositories {
repos: HashMap<RepositoryAlias, RepositoryDefinition>,
}
```

## Python packaging

Following the [Python entrypoint](https://www.maturin.rs/bindings#both-binary-and-library) approach.

- cli implemented in `icechunk/src/cli/`
- cli exposed to Rust in `icechunk/src/bin/icechunk/`
- cli exposed to Python through an entrypoint function, exposed in `pyproject.toml`

```ini
[project.scripts]
icechunk = "icechunk._icechunk_python:cli_entrypoint"
```

The disadvantage is that Python users need to call Python to use the CLI, resulting in hundreds of milliseconds of latency.

The user can also install the Rust binary directly through `cargo install`.

## Implementation details

Implemented with

- `clap` for the CLI
- `clap_complete` for shell completion
- `anyhow` for error handling
- `serde_yaml_ng` for configuration

## Optional features

- Structured output option (e.g. JSON)
- Short version of the command (e.g. `ic`)
- Support for tab completion
2 changes: 1 addition & 1 deletion icechunk/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ impl ManifestConfig {

#[derive(Clone, Debug, Serialize, Deserialize, PartialEq, Eq, Default)]
pub struct RepositoryConfig {
/// Chunks smaller than this will be stored inline in the manifst
/// Chunks smaller than this will be stored inline in the manifest
pub inline_chunk_threshold_bytes: Option<u16>,
/// Unsafely overwrite refs on write. This is not recommended, users should only use it at their
/// own risk in object stores for which we don't support write-object-if-not-exists. There is
Expand Down
2 changes: 1 addition & 1 deletion icechunk/src/storage/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@ pub trait Storage: fmt::Debug + private::Sealed + Sync + Send {
) -> StorageResult<Box<dyn AsyncRead + Unpin + Send>>;
/// Returns whatever reader is more efficient.
///
/// For example, if processesed with multiple requests, it will return a synchronous `Buf`
/// For example, if processed with multiple requests, it will return a synchronous `Buf`
/// instance pointing the different parts. If it was executed in a single request, it's more
/// efficient to return the network `AsyncRead` directly
async fn fetch_manifest_known_size(
Expand Down
Loading