refactor: use parquet library for reading metadata #6189

devinrsmith · 2024-10-09T21:35:42Z

No description provided.

devinrsmith · 2024-10-09T21:36:46Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java

-                lastKey.getFileReader().getSchema(),
-                lastKey.getMetadata().getFileMetaData().getKeyValueMetaData(),


This is one of the logic changes; previously, we were getting the schema from the file instead of from the metadata file.

devinrsmith · 2024-10-09T21:36:54Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java

-                tableLocationKey.getFileReader().getSchema(),
-                tableLocationKey.getMetadata().getFileMetaData().getKeyValueMetaData(),


devinrsmith · 2024-10-09T21:38:08Z

...ns/parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetTableLocation.java

-                .sorted(Comparator.comparingInt(RowGroup::getOrdinal))
-                .toArray(RowGroup[]::new);
-        final long maxRowCount = Arrays.stream(rowGroups).mapToLong(RowGroup::getNum_rows).max().orElse(0L);
+                .mapToObj(rgi -> parquetMetadata.getBlocks().get(rgi))


This is also a change; previously the block metadata was coming from the file instead of the metadata file.

I'm concerned about this. I think this is a merged list of blocks, so we might be including all the row groups from all the files, once per file.

devinrsmith · 2024-10-09T21:38:19Z

...ns/parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetTableLocation.java

        regionParameters = new RegionedPageStore.Parameters(
                RegionedColumnSource.ROW_KEY_TO_SUB_REGION_ROW_INDEX_MASK, rowGroupCount, maxRowCount);

        parquetColumnNameToPath = new HashMap<>();
-        for (final ColumnDescriptor column : parquetFileReader.getSchema().getColumns()) {
+        for (final ColumnDescriptor column : parquetMetadata.getFileMetaData().getSchema().getColumns()) {


devinrsmith · 2024-10-09T21:38:30Z

...ns/parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetTableLocation.java

            return rowGroupReaders = IntStream.of(rowGroupIndices)
-                    .mapToObj(idx -> parquetFileReader.getRowGroup(idx, version))
+                    .mapToObj(idx -> parquetFileReader.getRowGroup(parquetMetadata, idx, version))


Ditto; this is a place where parquetFileReader was implicitly sourcing its own metadata.

devinrsmith · 2024-10-09T21:39:29Z

...parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetTableLocationKey.java

@@ -154,7 +148,7 @@ public synchronized int[] getRowGroupIndices() {
        if (rowGroupIndices != null) {
            return rowGroupIndices;
        }
-        final List<RowGroup> rowGroups = getFileReader().fileMetaData.getRow_groups();
+        final List<BlockMetaData> rowGroups = getFileReader().getMetadata().getBlocks();


TODO: should this actually be calling getMetadata() instead? The javadoc does specifically call out file reader...

rcaudy

Partial review

rcaudy · 2024-10-10T21:41:17Z

extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java

-                final SeekableByteChannel ch =
-                        channelsProvider.getReadChannel(holder.get(), getURI()).position(dictionaryPageOffset)) {
+                final SeekableByteChannel ch = channelsProvider.getReadChannel(holder.get(), getURI())) {
+            ch.position(columnChunk.getStartingPos());


This has a new check, that the dictionary offset is before the first data page offset. Do we think that's just about rejecting malformed metadata that claims a dictionary page that cannot exist?

rcaudy · 2024-10-10T21:46:18Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetSchemaReader.java

@@ -284,6 +294,7 @@ public static Pair<List<ColumnDefinition<?>>, ParquetInstructions> convertSchema
            @NotNull final MessageType schema,
            @NotNull final Map<String, String> keyValueMetadata,
            @NotNull final ParquetInstructions readInstructionsIn) {
+        // TODO: what is this warning about?


rcaudy · 2024-10-10T21:48:49Z

...ns/parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetTableLocation.java

-                .sorted(Comparator.comparingInt(RowGroup::getOrdinal))
-                .toArray(RowGroup[]::new);
-        final long maxRowCount = Arrays.stream(rowGroups).mapToLong(RowGroup::getNum_rows).max().orElse(0L);
+                .mapToObj(rgi -> parquetMetadata.getBlocks().get(rgi))


I'm concerned about this. I think this is a merged list of blocks, so we might be including all the row groups from all the files, once per file.

…adata

refactor: use parquet objects instead of thrift for metadata

f2c8d81

devinrsmith self-assigned this Oct 9, 2024

devinrsmith commented Oct 9, 2024

View reviewed changes

devinrsmith requested review from rcaudy and malhotrashivam October 9, 2024 21:43

rcaudy reviewed Oct 10, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into nightly/parquet-met…

dd492e5

…adata

devinrsmith mentioned this pull request Jan 10, 2025

feat: parse Parquet schema from ParquetMetadataConverter #6550

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: use parquet library for reading metadata #6189

refactor: use parquet library for reading metadata #6189

devinrsmith commented Oct 9, 2024

devinrsmith Oct 9, 2024

devinrsmith Oct 9, 2024

devinrsmith Oct 9, 2024

rcaudy Oct 10, 2024

devinrsmith Oct 9, 2024

devinrsmith Oct 9, 2024 •

edited

Loading

devinrsmith Oct 9, 2024 •

edited

Loading

rcaudy left a comment

rcaudy Oct 10, 2024

rcaudy Oct 10, 2024

rcaudy Oct 10, 2024

		lastKey.getFileReader().getSchema(),
		lastKey.getMetadata().getFileMetaData().getKeyValueMetaData(),

		tableLocationKey.getFileReader().getSchema(),
		tableLocationKey.getMetadata().getFileMetaData().getKeyValueMetaData(),

refactor: use parquet library for reading metadata #6189

Are you sure you want to change the base?

refactor: use parquet library for reading metadata #6189

Conversation

devinrsmith commented Oct 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devinrsmith Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

devinrsmith Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

rcaudy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devinrsmith Oct 9, 2024 •

edited

Loading

devinrsmith Oct 9, 2024 •

edited

Loading