Update the `iceberg_scan` internals to make use of the MultiFileReader API #101

Tishj · 2025-02-14T12:30:18Z

These changes should enable future optimizations that would not be possible (or very difficult/limited) with the old bind_replace based method.

The duckdb submodule has been updated to point to v1.2-histrionicus

One change is made that fixes an issue on main (git hash 6571dca3c05e7cb4b047fc81175580fff392c86a) related to reader_data.column_indexes

That is also relevant for the duckdb_delta extension.

Thanks to @Tmonster for working on the testing improvements, wouldn't have found a lot of the bugs that crept in with this rework without that work 🚀

…der API

…number' column

…ll failing

…eList and ParseOption, so the copy we made of the IcebergOptions in the MultiFileList constructor is wrong now, making it a const reference instead

…o account for this

…anifest has been depleted of data files

…aps..

…g_tables_with_deletes

…nto multi_file_reader

samansmink · 2025-02-26T14:52:15Z

test/sql/local/iceberg_scans/iceberg_scan.test

@@ -42,22 +41,22 @@ SELECT count(*) FROM ICEBERG_SCAN('data/persistent/iceberg/lineitem_iceberg', ve
 #   1 = 2023-02-15 15:07:54.504
 #	2 = 2023-02-15 15:08:14.73
 query I
-SELECT count(*) FROM ICEBERG_SCAN('data/persistent/iceberg/lineitem_iceberg', '2023-02-15 15:07:54.504'::TIMESTAMP, ALLOW_MOVED_PATHS=TRUE);
+SELECT count(*) FROM ICEBERG_SCAN('data/persistent/iceberg/lineitem_iceberg', snapshot_from_timestamp='2023-02-15 15:07:54.504'::TIMESTAMP, ALLOW_MOVED_PATHS=TRUE);


This could potentially break things for people, we should think about whether thats ok

src/iceberg_functions/iceberg_multi_file_reader.cpp

src/common/iceberg.cpp

samansmink · 2025-02-26T15:16:18Z

src/iceberg_functions/iceberg_multi_file_reader.cpp

+		return make_uniq<NodeStatistics>(0, 0);
+	}
+
+	// FIXME: visit metadata to get a cardinality count


Maybe we can check if we can get this one in asap, good cardinality estimates do make a big difference

I agree, the added_rows_count and existing_rows_count contain this information per manifest, which we can get by just scanning the manifest list.
Optional in v1 but required by >=v2

Would like to do that after #110 lands and we integrate those changes in to this PR.

I don't really want to figure out how to extend the avro-cpp reading logic to read this information, especially if we're going to replace it anyways

duckdb

src/iceberg_functions/iceberg_multi_file_reader.cpp

…e amount of lookups in the unordered_map

Tishj and others added 8 commits February 13, 2025 16:05

rework the internals of the iceberg extension to use the MultiFileRea…

311dc60

…der API

bring back accidentally deleted changes

6d4101e

default table version

8572345

apply changes to test

4cb7285

fix crash in 'StructColumnReader::GetChildReader', for the 'file_row_…

2fba666

…number' column

update duckdb submodule to 1.2

57aebb9

change irc to work with refactored iceberg_scan, deletion vectors sti…

5d56ce9

…ll failing

'Mytherin/multifilereaderrework' switches the order of CreateMultiFil…

d50fbe7

…eList and ParseOption, so the copy we made of the IcebergOptions in the MultiFileList constructor is wrong now, making it a const reference instead

Tishj mentioned this pull request Feb 19, 2025

Multi File Reader Rework: Add MultiFileReaderFunction that is used to wrap a single-file reader, and use it for the Parquet reader duckdb/duckdb#16299

Merged

positional delete files optionally have a third column, 'row', have t…

f53dbb9

…o account for this

samansmink mentioned this pull request Feb 21, 2025

Predicate Pushdown for scans #2

Open

Tishj and others added 9 commits February 21, 2025 13:01

dont move to the next manifest directly, only move once the current m…

17310a9

…anifest has been depleted of data files

data probably had the same problem, this should fix it

c531742

dont blindly assume that rowids are only sequential, there could be g…

7340270

…aps..

simplify and fix the Apply method

1e6b32f

fixed Apply thanks to Tom, reminder to make this performant later..

59e234c

add more tests, especially tests taht test delete vectors

d71eb18

revert tests so they pass on current CI

71c1429

remove untested files

9647ed5

fix last few tests so they pass

d0ad861

Tishj force-pushed the multi_file_reader branch from 3214a96 to 59e234c Compare February 25, 2025 13:48

Tishj added 2 commits February 25, 2025 15:02

Merge branch 'multi_file_reader' into add_more_tests_to_verify_readin…

a204f78

…g_tables_with_deletes

update test result

f0b6976

Tmonster mentioned this pull request Feb 26, 2025

Tests to verify reading from tables with deletes #109

Closed

Tmonster and others added 4 commits February 26, 2025 12:50

remove mode skips and enable logging code

439ae8f

Merge branch 'add_more_tests_to_verify_reading_tables_with_deletes' i…

5e56306

…nto multi_file_reader

Merge remote-tracking branch 'upstream/main' into multi_file_reader

1aa1085

load parquet before iceberg

ee475a2

samansmink reviewed Feb 28, 2025

View reviewed changes

point duckdb at the v1.2.0 tag

573ac59

Tishj added 3 commits February 28, 2025 11:29

exploit the sorted property of (positional) delete files, reducing th…

c15d301

…e amount of lookups in the unordered_map

return nullptr for ComplexFilterPushdown for now

9e351c7

fix up error messages

1c8a1ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the `iceberg_scan` internals to make use of the MultiFileReader API #101

Update the `iceberg_scan` internals to make use of the MultiFileReader API #101

Tishj commented Feb 14, 2025 •

edited

Loading

samansmink Feb 26, 2025

samansmink Feb 26, 2025

Tishj Feb 28, 2025 •

edited

Loading

Tishj Feb 28, 2025

Update the iceberg_scan internals to make use of the MultiFileReader API #101

Are you sure you want to change the base?

Update the iceberg_scan internals to make use of the MultiFileReader API #101

Conversation

Tishj commented Feb 14, 2025 • edited Loading

samansmink Feb 26, 2025

Choose a reason for hiding this comment

samansmink Feb 26, 2025

Choose a reason for hiding this comment

Tishj Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Tishj Feb 28, 2025

Choose a reason for hiding this comment

Update the `iceberg_scan` internals to make use of the MultiFileReader API #101

Update the `iceberg_scan` internals to make use of the MultiFileReader API #101

Tishj commented Feb 14, 2025 •

edited

Loading

Tishj Feb 28, 2025 •

edited

Loading