feat: Allow parquet column access by field_id #6156

devinrsmith · 2024-09-30T22:28:41Z

This allows the the resolution of a parquet column by field_id instead of by its "path". This is a lower-level option that will not typically be used by end-users; as such, this option has not been plumbed through to python. This feature will be used in follow-up PRs in combination with Iceberg's field-ids to improve column mappings.

Writing support has also been added.

Fixes #6128

This allows the the resolution of a parquet column by field_id instead of by its "path". This is a lower-level option that will not typically be used by end-users; as such, this option has not been plumbed through to python. This feature will be used in follow-up PRs in combination with Iceberg's field-ids to improve column mappings. Fixes deephaven#6128

malhotrashivam

First level of review, can do a more detailed review tomorrow.

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java

malhotrashivam · 2024-09-30T22:52:11Z

Do verify the nightlies pass before merging.

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

This also fixes a bug where `parquetColumnNameToInstructions.put(parquetColumnName, ci);` was called without setting the parqute column name on ci and the KeyDef would blow up.

…t skip the logic when a user explicitly sets the parquet column name the same as the column name

devinrsmith · 2024-10-01T14:40:15Z

Do verify the nightlies pass before merging.

Verified.

malhotrashivam

I really like the change, minor comments.

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ParquetFileReader.java

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

devinrsmith · 2024-10-01T16:53:45Z

I couldn't find any resources to confirm, but this does feel incorrect to me, having two columns with same field ID. For example, if we get a field ID by Iceberg, it would expect a single column, right?

Iceberg probably mandates the uniqueness of field-ids.

Parquet doesn't have any mandates wrt that. And even the column names aren't guaranteed to be unique. I need to find the reference I found earlier that the parquet format "strongly recommends" unique column names, but it's not even a guarantee.

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/TypeInfos.java

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

rcaudy · 2024-10-07T15:14:01Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

        private final String columnName;
        private String parquetColumnName;
        private String codecName;
        private String codecArgs;
        private boolean useDictionary;
+        private Integer fieldId;


What's the point of this? Seems like we don't get anything but a little bit of extra allocation for this.

The javadoc on OptionalInt specifically calls out that it is intended for return types.

/* * ... * @apiNote * {@code OptionalInt} is primarily intended for use as a method return type where * there is a clear need to represent "no result." A variable whose type is * {@code OptionalInt} should never itself be {@code null}; it should always point * to an {@code OptionalInt} instance. */

IntelliJ, and likely other editors, will complain.

Immutables will also use this pattern internally when you have an object that returns OptionalInt.

I'm very heavily in favor of preferring the Java-canonical approach, especially when it comes to configuration objects which we should not really care about in terms of performance implications.

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java

rcaudy · 2024-10-07T16:13:14Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

-        for (final ColumnDescriptor column : schema.getColumns()) {
+        final Map<Integer, Long> fieldIdCount = schema.getColumns()
+                .stream()
+                .map(ColumnDescriptor::getPrimitiveType)


I find it weird that field ID is on the primitive type.

Good catch, yes; in the case of a list type, the field id is on the list and not on the primitive.

rcaudy · 2024-10-07T16:15:39Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

+            throw new IllegalStateException(String.format(
+                    "Field count inconsistent with number of columns, schema.getFieldCount()=%d, schema.getColumns().size()=%d",
+                    schema.getFieldCount(), schema.getColumns().size()));


Do we really think this holds? I wonder, since "field" and "column" are distinct names.

Great callout - I've dug into the distinction between "field" and "column"; for nested types, there is 1 field and multiple columns (potentially recursive).

I've improved the code to iterate through the each field with it's respective starting column index.

For inference purposes, we'll fail saying "we can't handle nested types" #871. For reading purposes when the user provides a specific table definition, we'll skip over nested columns.

I suspect we could greatly improve inference if we wanted (potentially to give the user the option to continue failing or to skip inference of nested fields by default) to not fail on these cases. I also suspect it should be pretty easy to actually support nested fields, at least a single level deep, by flattening them out into the table definition.

rcaudy · 2024-10-07T16:36:11Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

+                        colName = columnNames.get(0);
+                        break COL_NAME;
+                    } else if (columnNames.size() > 1) {
+                        throw new IllegalArgumentException(String.format(


I think this limitation is entirely because you didn't want to refactor the code. If you're going to argue for that, we should at least guide the user to use an updateView to achieve their goals.

Agreed. Added a comment that this could be improved with refactoring of the code.

Did you also update the error message? If we're not going to let the user do this, tell them how they can achieve the same result.

rcaudy · 2024-10-07T16:40:24Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

+                if (mappedName != null) {
+                    colName = mappedName;
+                    break COL_NAME;
+                }
                final String legalized = legalizeColumnNameFunc.apply(


I think there may be a name legalization bug:

I think we should be using builderSupplier in the below code.

I think we should be recording any column we assign as a taken name, in order to ensure that we don't collide between a user-specified name and a legalized name.

Potentially related to #6119

…ield id mappings are provided

rcaudy · 2024-10-08T19:47:38Z

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ParquetFileReader.java

+import shaded.parquet.org.apache.thrift.protocol.TSimpleJSONProtocol;
+import shaded.parquet.org.apache.thrift.transport.TIOStreamTransport;
+import shaded.parquet.org.apache.thrift.transport.TTransport;


Feels a little questionable to depend on someone else's shaded packages.

rcaudy · 2024-10-08T19:54:52Z

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReader.java

     * @param fieldId the field_id to fetch
     * @return the accessor to a given Column Chunk, or null if the column is not present in this Row Group
     */
    @Nullable
-    ColumnChunkReader getColumnChunk(@NotNull String columnName, @NotNull List<String> path, @Nullable Integer fieldId);
+    ColumnChunkReader getColumnChunk(@NotNull String columnName, @NotNull List<String> defaultPath,


Document defaultPath. Is it a parquet path?

rcaudy · 2024-10-08T19:58:57Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

+         * In the case where both a field id mapping and a parquet colum name mapping is provided, the field id will
+         * take precedence over the parquet column name. This may happen in cases where the parquet file is managed by a
+         * higher-level schema that has the concept of a "field id"; for example, Iceberg. As <a href=


Change it. There's no precedence if both are present, we insist that they be consistent.

rcaudy · 2024-10-08T20:03:21Z

...s/parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetColumnLocation.java

     * @param columnChunkReaders The {@link ColumnChunkReader column chunk readers} for this location
     */
    ParquetColumnLocation(
            @NotNull final ParquetTableLocation tableLocation,
            @NotNull final String columnName,
-            @NotNull final String parquetColumnName,
+            @Nullable final String parquetColumnName,


We should check what happens if we're inferring and legalized a Parquet column name to get the Deephaven column name. I think in that case, this change is broken as-is.

rcaudy · 2024-10-08T20:11:19Z

...ns/parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetTableLocation.java

-                columnPath == null ? Collections.singletonList(parquetColumnName) : Arrays.asList(columnPath);
+        final List<String> defaultPath;
+        {
+            final String[] path = parquetColumnNameToPath.get(columnName);


There's something about this making me uncomfortable. I think there may be a buggy path if legalization is used.

rcaudy · 2024-10-08T20:28:50Z

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java

+        for (String nonUniquePath : nonUniquePaths) {
+            byPath.remove(nonUniquePath);
        }
        for (Integer nonUniqueFieldId : nonUniqueFieldIds) {
-            chunkMapByFieldId.remove(nonUniqueFieldId);
-            schemaMapByFieldId.remove(nonUniqueFieldId);
+            byFieldId.remove(nonUniqueFieldId);
        }


Last wins or first wins is better than "pretend we had nothing, and just give nulls".

I wonder what pyarrow does.

rcaudy · 2024-10-08T20:37:12Z

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java

+            if (byFieldId != null && byPath != null) {
+                if (byFieldId != byPath) {
+                    throw new IllegalArgumentException(String.format(
+                            "For columnName=%s, providing an explicit parquet column name path (%s) and field id (%d) mapping, but they are resolving to different columns, byFieldId=[%s], byPath=[%s]",


Suggested change

"For columnName=%s, providing an explicit parquet column name path (%s) and field id (%d) mapping, but they are resolving to different columns, byFieldId=[%s], byPath=[%s]",

"For columnName=%s, instructions provided an explicit parquet column name path (%s) and field id (%d) mapping, but they are resolving to different columns, byFieldId=[%s], byPath=[%s]",

rcaudy · 2024-10-08T20:46:45Z

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java

-            }
-            columnChunk = cc;
-            nonRequiredFields = schemaMap.get(key);
+            holder = byFieldId != null ? byFieldId : byPath;


If the user specified a field ID and we didn't find it, I'm not sure it's correct to fall back to name mapping.
https://iceberg.apache.org/spec/#schema-evolution specifies a set of rules, and we should be making sure our Parquet implementation will let our Iceberg implementation follow them.

For Iceberg support, it looks like we need:

A list of name mappings, which we fall back to if and only if the field ID was not found.

Some kind of handling for encountering multiple Parquet fields with names from the name mappings: first? last? exception?

Some kind of handling for finding a fallback field by name mappings, and determining that it does not match the expected field ID. Exception?

rcaudy · 2024-10-08T21:12:02Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

+                // TODO(deephaven-core#871): Parquet: Support repetition level >1 and multi-column fields
+                throw new UnsupportedOperationException(
+                        String.format("Encountered unsupported multi-column field %s, has %d total columns",
+                                fieldType.getName(), numColumns));


You suggested we should maybe just start skipping nested fields. I bet we could also choose to include them, with some weird default. Like "UnprocessedField" singleton POJOs.

rcaudy · 2024-10-08T21:14:03Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

-        if (fieldIt.hasNext() || columnDescriptorIterator.hasNext()) {
-            throw new IllegalStateException("Iterators not exhausted");
+        if (columnIx != columnDescriptors.size()) {
+            throw new IllegalStateException("Not proper size");


We can do better than this.

This allows for the writing of Parquet column field_ids. This is an extraction from deephaven#6156 (which will need to expose deeper hooks to allow Iceberg to fully control field_id resolution during reading). This is to ensure we can correctly write down Iceberg tables which must write field_ids which is necessary for deephaven#5989 In addition, it was noticed that Parquet ColumnMetaData encoding was written down in a non-deterministic order due to the use of a HashSet; it has been updated to an EnumSet to support a more consistent Parquet serialization. This was necessary to test out field_id writing.

This allows for the writing of Parquet column field_ids. This is an extraction from #6156 (which will need to expose deeper hooks to allow Iceberg to fully control field_id resolution during reading). This is to ensure we can correctly write down Iceberg tables which must write field_ids which is necessary for #5989 In addition, it was noticed that Parquet ColumnMetaData encoding was written down in a non-deterministic order due to the use of a HashSet; it has been updated to an EnumSet to support a more consistent Parquet serialization. This was necessary to test out field_id writing.

devinrsmith · 2025-01-13T18:13:58Z

There is going to be a more general follow-up to this that allows for custom logic.

devinrsmith added parquet Related to the Parquet integration NoDocumentationNeeded ReleaseNotesNeeded Release notes are needed labels Sep 30, 2024

devinrsmith added this to the 0.37.0 milestone Sep 30, 2024

devinrsmith requested a review from malhotrashivam September 30, 2024 22:28

devinrsmith self-assigned this Sep 30, 2024

devinrsmith requested a review from rcaudy September 30, 2024 22:28

malhotrashivam reviewed Sep 30, 2024

View reviewed changes

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java Show resolved Hide resolved

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java Outdated Show resolved Hide resolved

malhotrashivam reviewed Sep 30, 2024

View reviewed changes

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java Outdated Show resolved Hide resolved

Review response

65453c4

devinrsmith requested a review from malhotrashivam October 1, 2024 00:00

devinrsmith added 3 commits September 30, 2024 17:52

Cleanup ParquetInstructions.addColumnNameMapping

9fa979e

This also fixes a bug where `parquetColumnNameToInstructions.put(parquetColumnName, ci);` was called without setting the parqute column name on ci and the KeyDef would blow up.

Given statefulness we maintain around parquetColumnName, we should no…

6b58468

…t skip the logic when a user explicitly sets the parquet column name the same as the column name

Add ParquetInstructions test

3e34bfa

malhotrashivam reviewed Oct 1, 2024

View reviewed changes

Add writing support

19a6490

Review response

ce2f2b8

devinrsmith requested a review from malhotrashivam October 1, 2024 16:56

malhotrashivam reviewed Oct 1, 2024

View reviewed changes

devinrsmith added 2 commits October 1, 2024 12:18

Handle case where a parquet field has non-unique field ids

a6ed292

Ensure LIST support for field_id

35e2983

devinrsmith requested a review from malhotrashivam October 1, 2024 19:59

malhotrashivam reviewed Oct 2, 2024

View reviewed changes

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupReaderImpl.java Outdated Show resolved Hide resolved

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java Outdated Show resolved Hide resolved

review response

1a2aa69

devinrsmith requested a review from malhotrashivam October 2, 2024 18:42

malhotrashivam previously approved these changes Oct 2, 2024

View reviewed changes

malhotrashivam mentioned this pull request Oct 2, 2024

feat: Added support to write iceberg tables #5989

Merged

rcaudy reviewed Oct 7, 2024

View reviewed changes

devinrsmith added 4 commits October 7, 2024 10:19

easy

057bd2c

Ensure getColumnChunk is consistent if both parquet column name and f…

1408d2a

…ield id mappings are provided

Merge remote-tracking branch 'upstream/main' into parquet-field-ids

83ea295

Add Nested parquet file testing

4e7a3b1

devinrsmith dismissed malhotrashivam’s stale review via 4e7a3b1 October 8, 2024 17:32

devinrsmith requested a review from rcaudy October 8, 2024 17:32

rcaudy reviewed Oct 8, 2024

View reviewed changes

devinrsmith mentioned this pull request Nov 15, 2024

feat: Add Parquet field_id writing #6381

Merged

devinrsmith closed this Jan 13, 2025

github-actions bot locked and limited conversation to collaborators Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Allow parquet column access by field_id #6156

feat: Allow parquet column access by field_id #6156

devinrsmith commented Sep 30, 2024 •

edited

Loading

malhotrashivam left a comment

malhotrashivam commented Sep 30, 2024

devinrsmith commented Oct 1, 2024

malhotrashivam left a comment

devinrsmith commented Oct 1, 2024

rcaudy Oct 7, 2024

devinrsmith Oct 8, 2024

rcaudy Oct 7, 2024

devinrsmith Oct 7, 2024

rcaudy Oct 7, 2024

devinrsmith Oct 8, 2024

rcaudy Oct 7, 2024

devinrsmith Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 7, 2024

devinrsmith Oct 7, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

rcaudy Oct 8, 2024

devinrsmith commented Jan 13, 2025

	"For columnName=%s, providing an explicit parquet column name path (%s) and field id (%d) mapping, but they are resolving to different columns, byFieldId=[%s], byPath=[%s]",
	"For columnName=%s, instructions provided an explicit parquet column name path (%s) and field id (%d) mapping, but they are resolving to different columns, byFieldId=[%s], byPath=[%s]",

feat: Allow parquet column access by field_id #6156

feat: Allow parquet column access by field_id #6156

Conversation

devinrsmith commented Sep 30, 2024 • edited Loading

malhotrashivam left a comment

Choose a reason for hiding this comment

malhotrashivam commented Sep 30, 2024

devinrsmith commented Oct 1, 2024

malhotrashivam left a comment

Choose a reason for hiding this comment

devinrsmith commented Oct 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devinrsmith commented Jan 13, 2025

devinrsmith commented Sep 30, 2024 •

edited

Loading