Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support format in options of COPY command #9744

Merged
merged 3 commits into from
Mar 23, 2024

Conversation

tinfoil-knight
Copy link
Contributor

Which issue does this PR close?

Closes #9713 .

Rationale for this change

Please see the description of the issue.

What changes are included in this PR?

If a format key is passed in the format / write options of the COPY command, we're using it to determine the file_type which is used to resolve format options for the COPY statement.

Are these changes tested?

Yes. New tests have been added.

Are there any user-facing changes?

This PR is to improve backwards compatibility before the next release so no new changes from user's PoV.

@github-actions github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Mar 22, 2024
Copy link
Contributor Author

@tinfoil-knight tinfoil-knight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some notes for the reviewer.

----
1

query error DataFusion error: Invalid or Unsupported Configuration: This feature is not implemented: Unknown FileType: NOTVALIDFORMAT
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error feels a bit long to me compared to others. We can reduce it to:
DataFusion error: This feature is not implemented: Unknown FileType: NOTVALIDFORMAT

I didn't do this because all the other branches were wrapping downstream errors with DataFusionError::Configuration & then returning them.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the error could be nicer, but in this case I think we should keep the same basic pattern as the rest of the code (and perhaps update the pattern in a follow on PR)

@tinfoil-knight tinfoil-knight requested a review from alamb March 22, 2024 22:09
Copy link
Contributor

@devinjdangelo devinjdangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thank you @tinfoil-knight!

We could perhaps add a test to validate the behavior when both stored as and format are specified in the same query. (i.e. stored as should take precedent).

@tinfoil-knight
Copy link
Contributor Author

We could perhaps add a test to validate the behavior when both stored as and format are specified in the same query. (i.e. stored as should take precedent).

@devinjdangelo I've added the test. Thank you for the review.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tinfoil-knight and @devinjdangelo -- the code looks good to me

I also tested it out locally and it worked great 🙏

DataFusion CLI v36.0.0
❯ COPY (select * from (values (1))) to '/tmp/copy.dat' OPTIONS (format parquet);
+-------+
| count |
+-------+
| 1     |
+-------+
1 row in set. Query took 0.030 seconds.
❯ \q
(venv) andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion/datafusion-cli$ file /tmp/copy.dat
/tmp/copy.dat: Apache Parquet

While reviewing the PR I think it would be good to verify that the files created are actually the specified format as well as update the documentation. I'll make a follow on PR

Update: #9753

@alamb alamb merged commit 40fb1b8 into apache:main Mar 23, 2024
23 checks passed
@tinfoil-knight tinfoil-knight deleted the 9713-format-copy branch March 23, 2024 11:31
tinfoil-knight added a commit to tinfoil-knight/arrow-datafusion that referenced this pull request Apr 7, 2024
alamb pushed a commit that referenced this pull request Apr 8, 2024
* Revert "Add test for reading back file created with FORMAT options (#9753)"

This reverts commit b50f3aa.

* Revert "support format in options of COPY command (#9744)"

This reverts commit 40fb1b8.

* update docs and example to remove old syntax
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: Can no longer use FORMAT PARQUET in COPY command
3 participants