fix(parquet): handle nested data types correctly #20156

wcy-fdu · 2025-01-14T08:43:19Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Here, there was a bug previously where the columns in the parquet file metadata returned the flattened columns, while parquet_to_arrow_schema.fields returned the un flattened schema. Therefore, when the parquet file contains nested columns, such as Structs, their lengths differ, which can lead to an index out of bound panic.

This PR fixes the issue by not performing column trimming for parquet files that contain nested columns, and replaces the previous method of finding data types by index with a method that find them up by name.

Checklist

I have written necessary rustdoc comments.
I have added necessary unit tests and integration tests.
I have added test labels as necessary.
I have added fuzzing tests or opened an issue to track them.
My PR contains breaking changes.
My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
My PR contains critical fixes that are necessary to be merged into the latest release.

Documentation

My PR needs documentation updates.

Release note

wcy-fdu · 2025-01-22T02:28:11Z

@hzxa21 @chenzl25 PTAL❤️
Users are waiting for this fix.

hzxa21

Generally LGTM. @chenzl25 PTAL

src/connector/src/source/iceberg/parquet_file_handler.rs

gru-agent · 2025-01-26T09:20:19Z

This pull request has been modified. If you want me to regenerate unit test for any of the files related, please find the file in "Files Changed" tab and add a comment @gru-agent. (The github "Comment on this file" feature is in the upper right corner of each file in "Files Changed" tab.)

chenzl25

LGTM

parquet file source: fix nested data type

054a89f

github-actions bot added the type/fix Bug fix label Jan 14, 2025

fix file scan read nested data type

c10b72e

BugenZhao changed the title ~~fix(connector): handle nested data types correctly~~ fix(parquet): handle nested data types correctly Jan 15, 2025

add test

42169f5

wcy-fdu added ci/main-cron/run-selected ci/run-s3-source-tests labels Jan 16, 2025

fix

0de7ae2

wcy-fdu requested review from hzxa21, chenzl25 and zwang28 January 17, 2025 05:36

hzxa21 approved these changes Jan 22, 2025

View reviewed changes

chenzl25 reviewed Jan 22, 2025

View reviewed changes

src/connector/src/source/iceberg/parquet_file_handler.rs Outdated Show resolved Hide resolved

use projectmask roots

1fa6705

chenzl25 approved these changes Jan 26, 2025

View reviewed changes

resolve conflict

38e1899

wcy-fdu enabled auto-merge January 26, 2025 09:55

wcy-fdu added this pull request to the merge queue Jan 26, 2025

Merged via the queue into main with commit 1384d45 Jan 26, 2025
29 of 30 checks passed

wcy-fdu deleted the wcy/fix_parquet_nested_data_type.pr branch January 26, 2025 10:41

wcy-fdu added the need-cherry-pick-release-2.1 label Jan 27, 2025

github-actions bot mentioned this pull request Jan 27, 2025

cherry-pick fix(parquet): handle nested data types correctly (#20156) to branch release-2.1 #20317

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(parquet): handle nested data types correctly #20156

fix(parquet): handle nested data types correctly #20156

wcy-fdu commented Jan 14, 2025

wcy-fdu commented Jan 22, 2025

hzxa21 left a comment

gru-agent bot commented Jan 26, 2025

chenzl25 left a comment

fix(parquet): handle nested data types correctly #20156

fix(parquet): handle nested data types correctly #20156

Conversation

wcy-fdu commented Jan 14, 2025

What's changed and what's your intention?

Checklist

Documentation

wcy-fdu commented Jan 22, 2025

hzxa21 left a comment

Choose a reason for hiding this comment

gru-agent bot commented Jan 26, 2025

chenzl25 left a comment

Choose a reason for hiding this comment