-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: interval analysis unit tests #14189
test: interval analysis unit tests #14189
Conversation
datafusion/physical-expr/Cargo.toml
Outdated
@@ -57,6 +57,7 @@ petgraph = "0.7.1" | |||
[dev-dependencies] | |||
arrow = { workspace = true, features = ["test_utils"] } | |||
criterion = "0.5" | |||
datafusion = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this is the line that caused the circular dependencies
check to fail. I added it so I could get the SessionContext
in my test for parsing the SQL expressions - perhaps there is a better way to do that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way to do so would be to put the test into one of the core integration suites: https://github.com/apache/datafusion/blob/main/datafusion/core/tests/core_integration.rs
Then you run it like
cargo test --test core_integration
Basically SessionContext is in a different crate that depends on this crate (but not the other way around)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hiltontj -- this looks quite cool
Maybe @berkaysynnada knows if there are existing tests and/or where the tests could go
datafusion/physical-expr/Cargo.toml
Outdated
@@ -57,6 +57,7 @@ petgraph = "0.7.1" | |||
[dev-dependencies] | |||
arrow = { workspace = true, features = ["test_utils"] } | |||
criterion = "0.5" | |||
datafusion = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way to do so would be to put the test into one of the core integration suites: https://github.com/apache/datafusion/blob/main/datafusion/core/tests/core_integration.rs
Then you run it like
cargo test --test core_integration
Basically SessionContext is in a different crate that depends on this crate (but not the other way around)
let schema = Arc::new(Schema::new(vec![make_field("a", DataType::Int64)])); | ||
type TestCase = (&'static str, Option<i64>, Option<i64>); | ||
let test_cases: Vec<TestCase> = vec![ | ||
("a > 10", Some(11), None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach is to avoid parsing SQL and insteadl build these expressions programatically
Like
("a > 10", Some(11), None), | |
(col(a).gt(lit(10), Some(11), None), |
There are some other examples here: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#column-references-and-literals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I think it makes more sense to use the expression helpers for two reasons:
- It brings these tests closer to unit tests and further from integration tests, i.e., no need for
SessionContext
and SQL parsing - It removes the need for the circular dependency
I don't mind changing them over to use the helpers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did just that in a43bf00
656f4fd
to
a43bf00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hiltontj increasing the coverage.
There are many interval_arithmetic test in
mod tests { |
but since these tests are from the scope of AnalysisContext, I think it's fine to keep them here. This PR also has shown something I don't like; infinite bounds and empty-set results are both result with None
fn test_analyze_invalid_boundary_exprs() { | ||
let schema = Arc::new(Schema::new(vec![make_field("a", DataType::Int32)])); | ||
type TestCase = (Expr, &'static str); | ||
let test_cases: Vec<TestCase> = vec![( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to extend this vec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I set it up this way because I thought the empty-set result would be an error, e.g., the test case in the other test in this PR for:
(a > 10 AND a < 20) AND (a > 20 AND a < 30)
Results in None
/None
instead of an error. (I believe that is what you are referring to by empty-set?)
There are other things I have noticed while using this analyze
method in my work at Influx (see influxdata/influxdb#25866, and specifically here). The function produces an error for various expression types. For example IsNotNull
.
So, I could extend this test to check a wider range of expression types if that would be helpful - either on this or a following PR. If not, I can trim this test down to that singular case.
I'm open to your recommendation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid any misunderstanding, I'd like to elaborate what is in my mind: There are 3 types of analyze result:
- The analysis result will shrink the bounds (a successful evaluation)
- The analysis result will be successful again, but the bounds could not be shrunk ,e.g. (a<0 || a>=0). That will be represented as [None, None], and means range of [-∞, ∞].
- The analysis results will be [None,None], just like in this example, which is an empty-set actually. If you evaluate
a<0 AND a>0
in interval arithmetic tools, you will get the result of [false, false] (CERTAINLY_FALSE). So, there is nothing wrong there. However, in the scope of AnalysisContext, these are mixed up. The solution would be an another Option<> wrapper, representing the evaluation comes up with a valid range, or the result is an empty-set. - As you mention, there are unsupported types. We are (or should) giving internal errors for them.
As you're increasing the test coverage, and there is not a bug which should be resolved immediately, we can continue with the current state of this PR. However, if you are willing to solve that possible confusion, I can provide further recommendation in detail (related with the 2nd and 3rd items).
BTW, What I wanted to emphasize in this thread, you have defined a vector of single element
let test_cases: Vec<TestCase> = vec![(
col("a").lt(lit(10)).or(col("a").gt(lit(20))),
"Interval arithmetic does not support the operator OR",
)];
I just wanted to point out that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The analysis results will be [None,None], just like in this example, which is an empty-set actually. If you evaluate a<0 AND a>0 in interval arithmetic tools, you will get the result of [false, false] (CERTAINLY_FALSE). So, there is nothing wrong there. However, in the scope of AnalysisContext, these are mixed up. The solution would be an another Option<> wrapper, representing the evaluation comes up with a valid range, or the result is an empty-set.
Perhaps it would be best to log an issue for this. I don't know if this is something I can focus on at the moment, but perhaps another contributor will like to pick it up.
Otherwise, I can simplify that test case to remove the vector wrapping the single element as that is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@berkaysynnada - I have made an attempt at summarizing your above comment into an issue here: #14226
Otherwise, I simplified that test case in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you :)
Added unit tests to interval analysis method which converts Expr tree to a set of Intervals for columns in a given schema.
a43bf00
to
8e4f193
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @hiltontj
Added unit tests to interval analysis method which converts Expr tree to a set of Intervals for columns in a given schema.
Which issue does this PR close?
I did not have an issue for this, but I was experimenting with the
analyze
method for convertingExpr
trees into sets ofInterval
s and pushed the resulting tests up, since I did not see any that directly tested theanalyze
method.Rationale for this change
I did not see any unit tests for the
analyze
method. In addition to this example, these provide a bit more of a sense of how the method works for different boundary expressions in queries.What changes are included in this PR?
Two new unit tests in the
datafusion/physical-expr/src/analysis.rs
module.Are these changes tested?
Yes.
Are there any user-facing changes?
No.