Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helper methods for CDF Physical to Logical Transformation #579

Merged
merged 58 commits into from
Dec 10, 2024

Conversation

OussamaSaoudi-db
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db commented Dec 9, 2024

What changes are proposed in this pull request?

This PR introduces methods methods that will be useful to read CDF data and transform it from its physical form into the logical schema. The methods are:

  • get_cdf_columns: Generates a map from cdf column name to expression that fills that column
  • physical_to_logical_expression: Generates the physical to logical expression used to transform the engine data
  • scan_file_read_schema: Gets the physical schema. This depends on the cdf scan file type

We also introduce the method Scalar::timestamp_ntz_from_millis which converts from an i64 millisecond value to a Scalar::TimestampNtz.

How was this change tested?

We test that physical_to_logical_expression generates the correct expression for a CdfScanFile with:

  • partition columns
  • normal selected columns
  • generate expression for the _change_type column in the case of add, cdc, and remove CdfScanFile.

@OussamaSaoudi-db
Copy link
Collaborator Author

Tests are currently failing because we they depend on the FileMeta fix.

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added few more comments (should be quick) but still LGTM

Comment on lines 237 to 239
ColumnType::Selected(CHANGE_TYPE_COL_NAME.to_string()),
ColumnType::Selected(COMMIT_VERSION_COL_NAME.to_string()),
ColumnType::Selected(COMMIT_TIMESTAMP_COL_NAME.to_string()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually (and sorry for the churn) I think I like the literal strings here: this let's us check that they are the actual strings we expect in the protocol (failure case we prevent is someone changes the const and then this test would change and pass)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha that makes sense. So we get to assert more about the code through this test 👍

scan_file: &CdfScanFile,
global_state: &GlobalScanState,
logical_schema: &StructType,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! i like taking the more constrained data

@@ -198,6 +213,7 @@ mod tests {
use crate::expressions::{column_expr, Scalar};
use crate::scan::ColumnType;
use crate::schema::{DataType, StructField, StructType};
use crate::table_changes::COMMIT_VERSION_COL_NAME;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can just do use super::*?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem like it. super would point to the scan module, but we need its parent.

@OussamaSaoudi-db OussamaSaoudi-db changed the title Perform Physical to Logical Transformation for CDF Helper methods for CDF Physical to Logical Transformation Dec 10, 2024
@OussamaSaoudi-db OussamaSaoudi-db merged commit c1b202a into delta-io:main Dec 10, 2024
18 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants