-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TableChangesScan::schema
method to get logical schema
#589
Add TableChangesScan::schema
method to get logical schema
#589
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #589 +/- ##
==========================================
- Coverage 83.48% 83.44% -0.04%
==========================================
Files 74 74
Lines 16855 16858 +3
Branches 16855 16858 +3
==========================================
- Hits 14071 14067 -4
- Misses 2126 2131 +5
- Partials 658 660 +2 ☔ View full report in Codecov by Sentry. |
kernel/src/table_changes/scan.rs
Outdated
} | ||
|
||
/// Get the predicate [`ExpressionRef`] of the scan. | ||
pub fn physical_predicate(&self) -> Option<ExpressionRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure we want to expose the physical predicate? this would include column-mapped names etc. right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it would. Perhaps we want the logical predicate that they passed though the builder?
I did this because we already have this method as pub in Scan::physical_predicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this is useful if the reader needs to do scan_data
then read those files themselves and apply the physical predicate themselves (to physical data) prior to physical to logical translation? How about we keep private for now (no such scan_data
public API in table_changes) and we can always open it up in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of context: #512 changed the (already public)Scan::predicate
to Scan::physical_predicate
. The in-kernel callers send it to the parquet reader to enable row group skipping. So it's no worse than exposing the physical read schema to the parquet reader IMO -- the engine only needs to know the concept exists (as in, don't pass the logical schema/predicate there), but doesn't need to interpret either one. Similar story to expression eval in kernel, actually -- engine knows how to evaluate expressions but doesn't need to understand why a given expression needs to be evaluated, nor what Delta spec details the expression relates to.
Since CDF doesn't need to do any parquet reads, we should only expose the physical predicate if we want to pass it to the json reader. There isn't any equivalent to row group skipping for json readers, so most json readers don't do predicate pushdown. But I suppose that's actually engine's decision so we probably should (allow it to) pass the predicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since CDF doesn't need to do any parquet reads, we should only expose the physical predicate if we want to pass it to the json reader
Isn't the physical predicate useful for the data path when reading parquet data files?
But I suppose that's actually engine's decision so we probably should (allow it to) pass the predicate
For this, should we eagerly do that, or wait until we actually expose a split between metadata and data path for CDF? Currently it's all done through execute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the physical predicate useful for the data path when reading parquet data files?
Good point... I was thinking about meta-skipping which doesn't make sense in CDF context.
should we eagerly [expose/pass the predicate]?
Your call... If duckdb picked up CDF support it would likely want to keep the datapath separate, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @scovich for the detailed reply! And generally I agree with that idea of making public. I'm curious though, if the read_json_files
already takes a predicate (and i'm not noticing maybe we should rename that to physical_predicate?) how would the API here on the scan be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at the examples that currently access Scan::physical_predicate
method. They pass it to the parquet reader they're using to fetch data. Which only works because that method is public.
TableChangesScan::schema
and make physical_predicate
pubTableChangesScan::schema
method to get logical schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pls update pr desc then good to go :)
What changes are proposed in this pull request?
This PR adds a pub method to fetch the logical schema of a
TableChangesScan
How was this change tested?
Everything compiles.