-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(parquet): handle nested data types correctly #20156
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM. @chenzl25 PTAL
This pull request has been modified. If you want me to regenerate unit test for any of the files related, please find the file in "Files Changed" tab and add a comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Here, there was a bug previously where the columns in the parquet file metadata returned the flattened columns, while
parquet_to_arrow_schema
.fields returned the un flattened schema. Therefore, when the parquet file contains nested columns, such as Structs, their lengths differ, which can lead to an index out of bound panic.This PR fixes the issue by not performing column trimming for parquet files that contain nested columns, and replaces the previous method of finding data types by index with a method that find them up by name.
Checklist
Documentation
Release note