Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the iceberg_scan internals to make use of the MultiFileReader API #101

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

Tishj
Copy link
Collaborator

@Tishj Tishj commented Feb 14, 2025

These changes should enable future optimizations that would not be possible (or very difficult/limited) with the old bind_replace based method.

The duckdb submodule has been updated to point to v1.2-histrionicus

One change is made that fixes an issue on main (git hash 6571dca3c05e7cb4b047fc81175580fff392c86a) related to reader_data.column_indexes

That is also relevant for the duckdb_delta extension.

Thanks to @Tmonster for working on the testing improvements, wouldn't have found a lot of the bugs that crept in with this rework without that work 🚀

@Tishj Tishj force-pushed the multi_file_reader branch from 3214a96 to 59e234c Compare February 25, 2025 13:48
@@ -42,22 +41,22 @@ SELECT count(*) FROM ICEBERG_SCAN('data/persistent/iceberg/lineitem_iceberg', ve
# 1 = 2023-02-15 15:07:54.504
# 2 = 2023-02-15 15:08:14.73
query I
SELECT count(*) FROM ICEBERG_SCAN('data/persistent/iceberg/lineitem_iceberg', '2023-02-15 15:07:54.504'::TIMESTAMP, ALLOW_MOVED_PATHS=TRUE);
SELECT count(*) FROM ICEBERG_SCAN('data/persistent/iceberg/lineitem_iceberg', snapshot_from_timestamp='2023-02-15 15:07:54.504'::TIMESTAMP, ALLOW_MOVED_PATHS=TRUE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially break things for people, we should think about whether thats ok

return make_uniq<NodeStatistics>(0, 0);
}

// FIXME: visit metadata to get a cardinality count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can check if we can get this one in asap, good cardinality estimates do make a big difference

Copy link
Collaborator Author

@Tishj Tishj Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the added_rows_count and existing_rows_count contain this information per manifest, which we can get by just scanning the manifest list.
Optional in v1 but required by >=v2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to do that after #110 lands and we integrate those changes in to this PR.

I don't really want to figure out how to extend the avro-cpp reading logic to read this information, especially if we're going to replace it anyways

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants