Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose write options #1006

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

matko
Copy link

@matko matko commented Jan 29, 2025

Which issue does this PR close?

Closes #1005 .

This is still a draft. Todo list:

  • DataFrameWriteOptions
  • CsvOptions
  • JsonOptions
  • TableParquetOptions

What changes are included in this PR?

Generic write options

dataframe.write_csv(...), dataframe.write_json(...) and dataframe.write_parquet now take an additional optional argument write_options, corresponding with DataFrameWriteOptions in datafusion. This is a dictionary with the following optional keys:

  • "insert_operation": one of "append", "overwrite" or "replace", corresponding to InsertOp in datafusion.
  • "single_file_option": a boolean
  • "partition_by": list of strings, names of columns to hive partition on
  • "sort_by": list of sort expressions

Csv

Todo..

JSON

Todo..

Parquet

Todo..

Are there any user-facing changes?

The api changes described above. These should be backwards compatible.

#[derive(FromPyObject)]
#[pyo3(from_item_all)]
pub struct PyDataFrameWriteOptions {
insert_operation: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted, you could use an Enum here instead of a String, and then defer more of the validation to pyo3 instead of manually checking the string values below.

Copy link
Author

@matko matko Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. I did initially try to do this with an enum. Unfortunately, to use an enum in an object that derives FromPyObject, the enum itself has to also derive FromPyObject, but tagging a simple enum (no variants, just tags) with #[derive(FromPyObject)] results in error: cannot derive FromPyObject for empty structs and variants.

So this errors:

#[derive(FromPyObject)]
pub enum PyInsertOperation {
    Insert,
    Overwrite,
    Replace,
}

This might be where my pyo3 knowledge is lacking. Can you point me in the right direction on how to do this as an enum properly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I usually separate out the enum but then implement the FromPyObject directly on that enum. It's simpler with better separation of concerns, I think. And if you ever need to use PyInsertOperation from multiple functions, then you can reuse the same implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose all write options
2 participants