Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TableChangesScan::schema method to get logical schema #589

Merged

Conversation

OussamaSaoudi-db
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db commented Dec 11, 2024

What changes are proposed in this pull request?

This PR adds a pub method to fetch the logical schema of a TableChangesScan

How was this change tested?

Everything compiles.

Copy link

codecov bot commented Dec 11, 2024

Codecov Report

Attention: Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.

Project coverage is 83.44%. Comparing base (04ccccb) to head (d057822).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/table_changes/scan.rs 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #589      +/-   ##
==========================================
- Coverage   83.48%   83.44%   -0.04%     
==========================================
  Files          74       74              
  Lines       16855    16858       +3     
  Branches    16855    16858       +3     
==========================================
- Hits        14071    14067       -4     
- Misses       2126     2131       +5     
- Partials      658      660       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

}

/// Get the predicate [`ExpressionRef`] of the scan.
pub fn physical_predicate(&self) -> Option<ExpressionRef> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure we want to expose the physical predicate? this would include column-mapped names etc. right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it would. Perhaps we want the logical predicate that they passed though the builder?

I did this because we already have this method as pub in Scan::physical_predicate

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is useful if the reader needs to do scan_data then read those files themselves and apply the physical predicate themselves (to physical data) prior to physical to logical translation? How about we keep private for now (no such scan_data public API in table_changes) and we can always open it up in the future?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of context: #512 changed the (already public)Scan::predicate to Scan::physical_predicate. The in-kernel callers send it to the parquet reader to enable row group skipping. So it's no worse than exposing the physical read schema to the parquet reader IMO -- the engine only needs to know the concept exists (as in, don't pass the logical schema/predicate there), but doesn't need to interpret either one. Similar story to expression eval in kernel, actually -- engine knows how to evaluate expressions but doesn't need to understand why a given expression needs to be evaluated, nor what Delta spec details the expression relates to.

Since CDF doesn't need to do any parquet reads, we should only expose the physical predicate if we want to pass it to the json reader. There isn't any equivalent to row group skipping for json readers, so most json readers don't do predicate pushdown. But I suppose that's actually engine's decision so we probably should (allow it to) pass the predicate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since CDF doesn't need to do any parquet reads, we should only expose the physical predicate if we want to pass it to the json reader

Isn't the physical predicate useful for the data path when reading parquet data files?

But I suppose that's actually engine's decision so we probably should (allow it to) pass the predicate

For this, should we eagerly do that, or wait until we actually expose a split between metadata and data path for CDF? Currently it's all done through execute.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the physical predicate useful for the data path when reading parquet data files?

Good point... I was thinking about meta-skipping which doesn't make sense in CDF context.

should we eagerly [expose/pass the predicate]?

Your call... If duckdb picked up CDF support it would likely want to keep the datapath separate, for example.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @scovich for the detailed reply! And generally I agree with that idea of making public. I'm curious though, if the read_json_files already takes a predicate (and i'm not noticing maybe we should rename that to physical_predicate?) how would the API here on the scan be useful?

Copy link
Collaborator

@scovich scovich Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the examples that currently access Scan::physical_predicate method. They pass it to the parquet reader they're using to fetch data. Which only works because that method is public.

@OussamaSaoudi-db OussamaSaoudi-db changed the title Add TableChangesScan::schema and make physical_predicate pub Add TableChangesScan::schema method to get logical schema Dec 11, 2024
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pls update pr desc then good to go :)

@OussamaSaoudi-db OussamaSaoudi-db merged commit be1453f into delta-io:main Dec 11, 2024
18 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants