Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icechunk could use a command line interface #461

Open
paraseba opened this issue Dec 9, 2024 · 12 comments
Open

Icechunk could use a command line interface #461

paraseba opened this issue Dec 9, 2024 · 12 comments
Labels
good first issue 🐣 Good for newcomers

Comments

@paraseba
Copy link
Collaborator

paraseba commented Dec 9, 2024

Examples:

  • create a new repo
  • list history of a repo
  • list refs
  • check configuraiton
  • invoke administrative tasks (garbage collection, compaction, etc)
  • get statistics
  • potentially do simple reads and writes
@paraseba paraseba added the good first issue 🐣 Good for newcomers label Dec 9, 2024
@hendrikmakait
Copy link
Contributor

This sounds like a fun project to work on while getting acquainted with icechunk, I'd love to work on this. Do you have any preferences on how this should be implemented (e.g., Rust vs. Python, choice of CLI toolkit)?

@paraseba
Copy link
Collaborator Author

Very happy if you want to take this @hendrikmakait !

We want it in Rust. No preference of CLI toolkit. I'd say start small, with one operation in a PR that we can look at, and then we keep adding.

I think a challenge is going to be how to pass all the arguments needed to define a repo. What object store, location, credentials, etc.

Should be fun!

@mpiannucci
Copy link
Contributor

clap is the best cli toolkit for rust IMO, so that may be a good place to start!

It can also start out in the bin directory of the main icechunk create as a bundled binary.

@paraseba
Copy link
Collaborator Author

It can also start out in the bin directory of the main icechunk create as a bundled binary.

+1

We'll need to see how to distribute it easily for people who install the python library, but that seems like an easy problem.

@DahnJ
Copy link
Contributor

DahnJ commented Feb 8, 2025

@hendrikmakait I would love to join this effort. Have you been able to pick it up yet? If not, I could get us started with an initial PR in the next few days and we could divide the work from there.

@paraseba
Copy link
Collaborator Author

paraseba commented Feb 8, 2025

Amazing news @DahnJ ! Feel free to ping me in slack if you want pointers. I guess I would start with a short design document explaining what functionality we want to include, how users are going to pass arguments via command line, how it's going to be used from python, etc.

One tricky thing, as I mention above, is that you need a bunch of information to "point" to an icechunk repo. You need the object store details (potentially including things like region, endpoint_url, etc), you need credentials, you need a prefix. It sounds like a lot to pass in the command line. Maybe we should thing if it's worth having a config file for the CLI, where you can alias all this set of parameters with a single name. So I can say --repository foo and ic-cli pulls all the details from config.

For reference, this is the list of arguments needed to create an instance of S3 Storage:

def s3_storage(
    *,
    bucket: str,
    prefix: str | None,
    region: str | None = None,
    endpoint_url: str | None = None,
    allow_http: bool = False,
    access_key_id: str | None = None,
    secret_access_key: str | None = None,
    session_token: str | None = None,
    expires_after: datetime | None = None,
    anonymous: bool | None = None,
    from_env: bool | None = None,
    get_credentials: Callable[[], S3StaticCredentials] | None = None,
) -> Storage:

And that doesn't include the repository config.

@DahnJ
Copy link
Contributor

DahnJ commented Feb 9, 2025

Thanks for the pointers @paraseba. I'm not feeling ready for a full design document yet, but I'll try to work iteratively and report here so as not to block this completely.

Hopefully this writeup helps make headway.


Packaging with Python

So far I tried to look into the question of distributing the CLI with Python. I see two ways.

Separate binary

Here we write the CLI code in (for example) icechunk/src/bin/cli and bundle it together with icechunk-python into the wheel.

I didn't actually get this to work. I attempted to use the [[bin]] section like so:

# Cargo.toml

[[bin]]
name = "icechunk"
path = "src/bin/icechunk/main.rs"
# pyproject.toml
[tool.maturin]
bindings = "bin"

This only adds the CLI binary into wheel and not icechunk-python as a library, since bindings is no longer set to the default pyo3. I ended up on PyO3/maturin#368 and didn't find a way forward.

Advantages/disadvantages

Assuming we could get this to work:

  • Fast rust binary
  • ~Double the size of the wheel

Alternative: two wheels

As mentioned in PyO3/maturin#368 (comment), this could be "solved" by simply having two wheels that the users would install separately.

Python entrypoint

Maturin docs suggest wrapping the CLI in a Python entrypoint. To make this work, I

  • implemented the CLI in icechunk/src/cli.rs and exposed it through the library
    • exposed it in icechunk/src/bin/icechunk/main.rs to still get a rust binary
  • wrote an entrypoint function in icechunk-python and exposed it in pyproject.toml as:
[project.scripts]
icechunk = "icechunk._icechunk_python:cli_entrypoint"

This works for me, but has the major disadvantage that Python users would need to call Python to run the CLI. On my machine this is around 250ms of delay when calling the CLI.

@paraseba
Copy link
Collaborator Author

paraseba commented Feb 9, 2025

@DahnJ I feel the "Python entrypoint" approach is simple and good enough. The CLI is not meant for "low latency" operations anyway. And, we can always:

  • distribute the rust binary directly
  • change how things are packaged in the future

Thank you for the doing the analysis and posting the details!

@hendrikmakait
Copy link
Contributor

@DahnJ, I've unfortunately gotten distracted by some more pressing stuff before being able to put together something meaningful. Great to see your work on this!

@hendrikmakait
Copy link
Contributor

Maybe we should thing if it's worth having a config file for the CLI, where you can alias all this set of parameters with a single name. So I can say --repository foo and ic-cli pulls all the details from config.

+1, this was the first thing that I stumbled over. Having a config akin to pyiceberg would make things much more manageabe.

@DahnJ
Copy link
Contributor

DahnJ commented Feb 10, 2025

I started a draft PR with the design doc in #714 with a skeleton implementation in #716

I probably won't work on this for a couple of days, so I'm putting it up for feedback.

@paraseba
Copy link
Collaborator Author

One more feature idea:

  • fetch and export metadata files in human readable form (snapshots, manifests, transaction logs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue 🐣 Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants