Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COPY fails on cli with Invalid statement #9927

Closed
hveiga opened this issue Apr 3, 2024 · 8 comments · Fixed by #9931
Closed

COPY fails on cli with Invalid statement #9927

hveiga opened this issue Apr 3, 2024 · 8 comments · Fixed by #9931
Labels
bug Something isn't working

Comments

@hveiga
Copy link
Contributor

hveiga commented Apr 3, 2024

Describe the bug

I downloaded https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-37.0.0-rc2/ to test the new partition_by feature. I built datafusion-cli by running cargo build --release under datafusion-cli.

The use case is simple: load a parquet file and create multiple parquet files using hive-partitioned partitions.
When I try to run the documented COPY command on https://arrow.apache.org/datafusion/user-guide/sql/write_options.html I get an error.

To Reproduce

  1. Build datafusion-cli from https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-37.0.0-rc2/.
  2. Run ./datafusion-cli.
  3. Create a table from a parquet file:
CREATE EXTERNAL TABLE t1
STORED AS PARQUET
LOCATION '/tmp/file.parquet';
  1. Execute partition_by command:
COPY t1 TO '/tmp/hive_output/' (format parquet, partition_by 'col1');
  1. Get an error: 🤔 Invalid statement: sql parser error: Unexpected token (

Expected behavior

Have the COPY statement generate the expected hive-partitioned parquet files.

Additional context

I don't know if I might be having an issue with my SQL statements or the COPY documentation is incorrect. Still, I thought it was good to report before 37.0.0 gets released. #9682

Thank you!

@hveiga hveiga added the bug Something isn't working label Apr 3, 2024
@tinfoil-knight
Copy link
Contributor

tinfoil-knight commented Apr 3, 2024

The documentation is outdated.

  • You need to specify options in this format: OPTIONS (...)
  • Also, partition_by needs to be specified separately as PARTITIONED BY (<column>) & not with the options.

This is how your query would look now:

COPY t1 TO '/tmp/hive_output/' PARTITIONED BY (col1) OPTIONS (format parquet);

@tinfoil-knight
Copy link
Contributor

Note to Maintainers:
#9905 will remove any compatibility with the old syntax entirely so IMO we should only add the new syntax in updated docs.

@hveiga
Copy link
Contributor Author

hveiga commented Apr 3, 2024

Just tested and COPY t1 TO '/tmp/hive_output/' PARTITIONED BY (col1) OPTIONS (format parquet); works as expected. Thank you!

@hveiga hveiga closed this as completed Apr 3, 2024
@tinfoil-knight
Copy link
Contributor

@hveiga I think we should keep this issue open until the documentation is updated with the latest syntax.

@hveiga
Copy link
Contributor Author

hveiga commented Apr 3, 2024

@hveiga I think we should keep this issue open until the documentation is updated with the latest syntax.

Makes sense. Re-opening.

@hveiga hveiga reopened this Apr 3, 2024
@alamb
Copy link
Contributor

alamb commented Apr 3, 2024

Looks like I missed a spot in #9754, sorry about that.

The new syntax is documented here: https://arrow.apache.org/datafusion/user-guide/sql/dml.html#copy

@alamb
Copy link
Contributor

alamb commented Apr 3, 2024

I'll make a PR

@alamb
Copy link
Contributor

alamb commented Apr 3, 2024

#9931

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants