You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This issue is about improving the behavior of write_csv(...), write_json(...) and write_parquet(...).
Right now datafusion-python only exposes a handful of write options that are actually supported by datafusion. Notably, DataFrameWriteOptions is always just initialized with its defaults, which means the following features are not supported for any format:
partitioned (hive style) writes
sorted writes
insert option (though I suspect this is actually not supported anyway for any of the 3 exposed formats)
single file output (though in my experiments I've not been able to make this actually do anything in rust)
Furthermore, there's options for each format, of which only some are now exposed:
parquet: only global compression options are exposed, but the writer actually supports pretty fine grained column options that are now unusable from python.
csv: only header inclusion/exclusion is supported. The underlying writer supports a lot of options for setting up things like delimiters, quote style, what to do with nulls, etc.
json: No options are supported at all right now. The underlying writer supports compression.
Describe the solution you'd like
Expose all supported write options in datafusion-python. I think we should just support sending in dictionaries with these options in the names people would expect from the rust documentation. More important options could additionally be top-level keyword arguments in their own right, much like is already the case for parquet global compression.
Describe alternatives you've considered
One alternative is bypassing datafusion and using parquet directly from an arrow stream. This means not being able to work with object stores though, and even when object stores are not needed it's not very ergonomic.
Additional context
None
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This issue is about improving the behavior of
write_csv(...)
,write_json(...)
andwrite_parquet(...)
.Right now datafusion-python only exposes a handful of write options that are actually supported by datafusion. Notably,
DataFrameWriteOptions
is always just initialized with its defaults, which means the following features are not supported for any format:Furthermore, there's options for each format, of which only some are now exposed:
Describe the solution you'd like
Expose all supported write options in datafusion-python. I think we should just support sending in dictionaries with these options in the names people would expect from the rust documentation. More important options could additionally be top-level keyword arguments in their own right, much like is already the case for parquet global compression.
Describe alternatives you've considered
One alternative is bypassing datafusion and using parquet directly from an arrow stream. This means not being able to work with object stores though, and even when object stores are not needed it's not very ergonomic.
Additional context
None
The text was updated successfully, but these errors were encountered: