Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should pruningpredicate coerce? #14944

Open
ion-elgreco opened this issue Feb 28, 2025 · 4 comments
Open

Should pruningpredicate coerce? #14944

ion-elgreco opened this issue Feb 28, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@ion-elgreco
Copy link

ion-elgreco commented Feb 28, 2025

Describe the bug

Currently when you pass a pruning predicate where the predicate has a different type as the targeted column it will not prune it, even though in theory the value is castable to the target column type.

The predicate looks like this: "month_id = '202502' AND date_id = '20250226'"

However the columns are int column not utf8. So in theory these string values can be casted to int but I don't believe this is happening. Is this something that should be added

This is our rust code btw

// delta-rs/crates/core/src/delta_datafusion/mod.rs
let pruning_predicate = PruningPredicate::try_new(predicate.clone(), logical_schema.clone())?;
let files_to_prune = pruning_predicate.prune(self.snapshot)?;

To Reproduce

Related delta-rs issue: delta-io/delta-rs#3278 (comment)

Expected behavior

Allow literal value coercion during pruning

Additional context

No response

@ion-elgreco ion-elgreco added the bug Something isn't working label Feb 28, 2025
@alamb
Copy link
Contributor

alamb commented Feb 28, 2025

I think the conclusion is that PruningPredicaate expects a properly coerced expression already

You can use SessionContext::create_physical_expr to automatically perform coercsion as shown in this example

// 1. Type coercion with `SessionContext::create_physical_expr` which implicitly applies type coercion before constructing the physical expr.
let physical_expr =
SessionContext::new().create_physical_expr(expr.clone(), &df_schema)?;
assert!(physical_expr.evaluate(&batch).is_ok());

There are a bunch of other examples there about how to apply coercion

@ion-elgreco
Copy link
Author

ion-elgreco commented Feb 28, 2025

@alamb this is actually already done like that, but it still doesn't prune properly:

.map(|expr| context.create_physical_expr(expr, &df_schema).unwrap());

https://github.com/delta-io/delta-rs/blob/main/crates%2Fcore%2Fsrc%2Fdelta_datafusion%2Fmod.rs#L547-L550

@alamb
Copy link
Contributor

alamb commented Feb 28, 2025

🤔 What are the coerced expressions?

If it is like this:

cast(month_id, 'utf8') = '202502'

That is not going to prune because the cast is happening on month_id

The predicate needs to be

month_id = cast('202502')

This is done in this analyzer pass: https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/unwrap_cast_in_comparison.rs

Maybe we should do the same thing in create_physical_expr too 🤔

@ion-elgreco
Copy link
Author

It's like cast(month_id, 'utf8') = '202502', see below:

[crates/core/src/delta_datafusion/mod.rs:551:9] self.filter.clone() = Some(
    BinaryExpr(
        BinaryExpr {
            left: BinaryExpr(
                BinaryExpr {
                    left: Cast(
                        Cast {
                            expr: Column(
                                Column {
                                    relation: None,
                                    name: "month_id",
                                },
                            ),
                            data_type: Utf8,
                        },
                    ),
                    op: Eq,
                    right: Literal(
                        Utf8("202502"),
                    ),
                },
            ),
            op: And,
            right: BinaryExpr(
                BinaryExpr {
                    left: Cast(
                        Cast {
                            expr: Column(
                                Column {
                                    relation: None,
                                    name: "date_id",
                                },
                            ),
                            data_type: Utf8,
                        },
                    ),
                    op: Eq,
                    right: Literal(
                        Utf8("20250226"),
                    ),
                },
            ),
        },
    ),
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants