Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: interval analysis unit tests #14189

Merged
merged 1 commit into from
Jan 22, 2025

Conversation

hiltontj
Copy link
Contributor

Added unit tests to interval analysis method which converts Expr tree to a set of Intervals for columns in a given schema.

Which issue does this PR close?

I did not have an issue for this, but I was experimenting with the analyze method for converting Expr trees into sets of Intervals and pushed the resulting tests up, since I did not see any that directly tested the analyze method.

Rationale for this change

I did not see any unit tests for the analyze method. In addition to this example, these provide a bit more of a sense of how the method works for different boundary expressions in queries.

What changes are included in this PR?

Two new unit tests in the datafusion/physical-expr/src/analysis.rs module.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the physical-expr Physical Expressions label Jan 18, 2025
@@ -57,6 +57,7 @@ petgraph = "0.7.1"
[dev-dependencies]
arrow = { workspace = true, features = ["test_utils"] }
criterion = "0.5"
datafusion = { workspace = true }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is the line that caused the circular dependencies check to fail. I added it so I could get the SessionContext in my test for parsing the SQL expressions - perhaps there is a better way to do that...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way to do so would be to put the test into one of the core integration suites: https://github.com/apache/datafusion/blob/main/datafusion/core/tests/core_integration.rs

Then you run it like

cargo test --test core_integration

Basically SessionContext is in a different crate that depends on this crate (but not the other way around)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hiltontj -- this looks quite cool

Maybe @berkaysynnada knows if there are existing tests and/or where the tests could go

@@ -57,6 +57,7 @@ petgraph = "0.7.1"
[dev-dependencies]
arrow = { workspace = true, features = ["test_utils"] }
criterion = "0.5"
datafusion = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way to do so would be to put the test into one of the core integration suites: https://github.com/apache/datafusion/blob/main/datafusion/core/tests/core_integration.rs

Then you run it like

cargo test --test core_integration

Basically SessionContext is in a different crate that depends on this crate (but not the other way around)

let schema = Arc::new(Schema::new(vec![make_field("a", DataType::Int64)]));
type TestCase = (&'static str, Option<i64>, Option<i64>);
let test_cases: Vec<TestCase> = vec![
("a > 10", Some(11), None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach is to avoid parsing SQL and insteadl build these expressions programatically

Like

Suggested change
("a > 10", Some(11), None),
(col(a).gt(lit(10), Some(11), None),

There are some other examples here: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#column-references-and-literals

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think it makes more sense to use the expression helpers for two reasons:

  • It brings these tests closer to unit tests and further from integration tests, i.e., no need for SessionContext and SQL parsing
  • It removes the need for the circular dependency

I don't mind changing them over to use the helpers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did just that in a43bf00

@hiltontj hiltontj force-pushed the hiltontj/analyze-interval-tests branch from 656f4fd to a43bf00 Compare January 19, 2025 14:01
Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hiltontj increasing the coverage.

There are many interval_arithmetic test in

but since these tests are from the scope of AnalysisContext, I think it's fine to keep them here. This PR also has shown something I don't like; infinite bounds and empty-set results are both result with None

fn test_analyze_invalid_boundary_exprs() {
let schema = Arc::new(Schema::new(vec![make_field("a", DataType::Int32)]));
type TestCase = (Expr, &'static str);
let test_cases: Vec<TestCase> = vec![(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to extend this vec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I set it up this way because I thought the empty-set result would be an error, e.g., the test case in the other test in this PR for:

(a > 10 AND a < 20) AND (a > 20 AND a < 30)

Results in None/None instead of an error. (I believe that is what you are referring to by empty-set?)

There are other things I have noticed while using this analyze method in my work at Influx (see influxdata/influxdb#25866, and specifically here). The function produces an error for various expression types. For example IsNotNull.

So, I could extend this test to check a wider range of expression types if that would be helpful - either on this or a following PR. If not, I can trim this test down to that singular case.

I'm open to your recommendation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid any misunderstanding, I'd like to elaborate what is in my mind: There are 3 types of analyze result:

  1. The analysis result will shrink the bounds (a successful evaluation)
  2. The analysis result will be successful again, but the bounds could not be shrunk ,e.g. (a<0 || a>=0). That will be represented as [None, None], and means range of [-∞, ∞].
  3. The analysis results will be [None,None], just like in this example, which is an empty-set actually. If you evaluate a<0 AND a>0 in interval arithmetic tools, you will get the result of [false, false] (CERTAINLY_FALSE). So, there is nothing wrong there. However, in the scope of AnalysisContext, these are mixed up. The solution would be an another Option<> wrapper, representing the evaluation comes up with a valid range, or the result is an empty-set.
  4. As you mention, there are unsupported types. We are (or should) giving internal errors for them.

As you're increasing the test coverage, and there is not a bug which should be resolved immediately, we can continue with the current state of this PR. However, if you are willing to solve that possible confusion, I can provide further recommendation in detail (related with the 2nd and 3rd items).

BTW, What I wanted to emphasize in this thread, you have defined a vector of single element

    let test_cases: Vec<TestCase> = vec![(
         col("a").lt(lit(10)).or(col("a").gt(lit(20))),
             "Interval arithmetic does not support the operator OR",
         )];

I just wanted to point out that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The analysis results will be [None,None], just like in this example, which is an empty-set actually. If you evaluate a<0 AND a>0 in interval arithmetic tools, you will get the result of [false, false] (CERTAINLY_FALSE). So, there is nothing wrong there. However, in the scope of AnalysisContext, these are mixed up. The solution would be an another Option<> wrapper, representing the evaluation comes up with a valid range, or the result is an empty-set.

Perhaps it would be best to log an issue for this. I don't know if this is something I can focus on at the moment, but perhaps another contributor will like to pick it up.


Otherwise, I can simplify that test case to remove the vector wrapping the single element as that is confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@berkaysynnada - I have made an attempt at summarizing your above comment into an issue here: #14226

Otherwise, I simplified that test case in the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you :)

Added unit tests to interval analysis method which converts Expr tree
to a set of Intervals for columns in a given schema.
Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @hiltontj

@berkaysynnada berkaysynnada merged commit 5edb276 into apache:main Jan 22, 2025
25 checks passed
@hiltontj hiltontj deleted the hiltontj/analyze-interval-tests branch January 22, 2025 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants