-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Coalesce
casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion
#10268
Changes from 23 commits
c79156f
a36e6b2
bf16c92
407e3c7
03b9162
4abf29d
4965e8d
81f0235
bae996c
ddf9b1c
c2799ea
2574896
6a17e57
4cba8c5
f1cfb8d
d2e83d3
3a88ad7
03880a3
481f548
dfc4176
46a9060
d656645
5683447
b949fae
5aaeb5b
a968c0e
cf679c5
15471ab
e5cc46b
a810e85
cb16cda
53bedda
a37da2d
70239e0
be116f8
8f4e991
6a8fe6f
20e618e
4153593
030a519
5b797d5
b954479
829b5a2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -92,14 +92,22 @@ pub enum TypeSignature { | |||||||||||||||||
/// A function such as `concat` is `Variadic(vec![DataType::Utf8, DataType::LargeUtf8])` | ||||||||||||||||||
Variadic(Vec<DataType>), | ||||||||||||||||||
/// One or more arguments of an arbitrary but equal type. | ||||||||||||||||||
/// DataFusion attempts to coerce all argument types to match the first argument's type | ||||||||||||||||||
/// DataFusion attempts to coerce all argument types to match to the common type with comparision coercion. | ||||||||||||||||||
/// | ||||||||||||||||||
/// # Examples | ||||||||||||||||||
/// Given types in signature should be coercible to the same final type. | ||||||||||||||||||
/// A function such as `make_array` is `VariadicEqual`. | ||||||||||||||||||
/// | ||||||||||||||||||
/// `make_array(i32, i64) -> make_array(i64, i64)` | ||||||||||||||||||
VariadicEqual, | ||||||||||||||||||
This comment was marked as outdated.
Sorry, something went wrong. |
||||||||||||||||||
/// One or more arguments of an arbitrary but equal type or Null. | ||||||||||||||||||
/// Non-comparison coercion is attempted to match the signatures. | ||||||||||||||||||
/// | ||||||||||||||||||
/// Functions like `coalesce` is `VariadicEqual`. | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am a little confused about what "Non-comparison coercion" means in this situation. Specifically how comparison coercion and non comparision coercion differ 🤔 Does non comparison coercion mean "type union resolution" (aka Also, I think
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes. Actually, I think Ideally, we could shape At least I can tell the dict-coercion is different between these two now. |
||||||||||||||||||
// TODO: Temporary Signature, to differentiate existing VariadicEqual. | ||||||||||||||||||
// After we swtich `make_array` to VariadicEqualOrNull, | ||||||||||||||||||
// we can reuse VariadicEqual. | ||||||||||||||||||
VariadicEqualOrNull, | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need a similar signature but an exact args number for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we could do something like this (basically to flavor the type signature) 🤔 pub enum TypeSignature {
...
/// Rather than the usual coercion rules, special type union rules are applied
Union(Box<TypeSignature>)
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice idea! |
||||||||||||||||||
/// One or more arguments with arbitrary types | ||||||||||||||||||
VariadicAny, | ||||||||||||||||||
/// Fixed number of arguments of an arbitrary but equal type out of a list of valid types. | ||||||||||||||||||
|
@@ -193,6 +201,9 @@ impl TypeSignature { | |||||||||||||||||
TypeSignature::VariadicEqual => { | ||||||||||||||||||
vec!["CoercibleT, .., CoercibleT".to_string()] | ||||||||||||||||||
} | ||||||||||||||||||
TypeSignature::VariadicEqualOrNull => { | ||||||||||||||||||
vec!["CoercibleT or NULL, .., CoercibleT or NULL".to_string()] | ||||||||||||||||||
} | ||||||||||||||||||
TypeSignature::VariadicAny => vec!["Any, .., Any".to_string()], | ||||||||||||||||||
TypeSignature::OneOf(sigs) => { | ||||||||||||||||||
sigs.iter().flat_map(|s| s.to_string_repr()).collect() | ||||||||||||||||||
|
@@ -255,13 +266,20 @@ impl Signature { | |||||||||||||||||
volatility, | ||||||||||||||||||
} | ||||||||||||||||||
} | ||||||||||||||||||
/// An arbitrary number of arguments of the same type. | ||||||||||||||||||
/// One or more number of arguments to the same type. | ||||||||||||||||||
pub fn variadic_equal(volatility: Volatility) -> Self { | ||||||||||||||||||
Self { | ||||||||||||||||||
type_signature: TypeSignature::VariadicEqual, | ||||||||||||||||||
volatility, | ||||||||||||||||||
} | ||||||||||||||||||
} | ||||||||||||||||||
/// One or more number of arguments of the same type. | ||||||||||||||||||
pub fn variadic_equal_or_null(volatility: Volatility) -> Self { | ||||||||||||||||||
Self { | ||||||||||||||||||
type_signature: TypeSignature::VariadicEqualOrNull, | ||||||||||||||||||
volatility, | ||||||||||||||||||
} | ||||||||||||||||||
} | ||||||||||||||||||
/// An arbitrary number of arguments of any type. | ||||||||||||||||||
pub fn variadic_any(volatility: Volatility) -> Self { | ||||||||||||||||||
Self { | ||||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,6 +17,7 @@ | |
|
||
//! Coercion rules for matching argument types for binary operators | ||
|
||
use std::collections::HashSet; | ||
use std::sync::Arc; | ||
|
||
use crate::Operator; | ||
|
@@ -289,15 +290,164 @@ fn bitwise_coercion(left_type: &DataType, right_type: &DataType) -> Option<DataT | |
} | ||
} | ||
|
||
#[derive(Debug, PartialEq, Eq, Hash, Clone)] | ||
enum TypeCategory { | ||
Array, | ||
Boolean, | ||
Numeric, | ||
// String, well-defined type, but are considered as unknown type. | ||
DateTime, | ||
Composite, | ||
Unknown, | ||
NotSupported, | ||
} | ||
|
||
fn data_type_category(data_type: &DataType) -> TypeCategory { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could also be something like impl From<&DataType> for TypeCategory {
... And then you create a let category = TypeCategory::from(&type) |
||
if data_type.is_numeric() { | ||
return TypeCategory::Numeric; | ||
} | ||
|
||
if matches!(data_type, DataType::Boolean) { | ||
return TypeCategory::Boolean; | ||
} | ||
|
||
if matches!( | ||
data_type, | ||
DataType::List(_) | DataType::FixedSizeList(_, _) | DataType::LargeList(_) | ||
) { | ||
return TypeCategory::Array; | ||
} | ||
|
||
// String literal is possible to cast to many other types like numeric or datetime, | ||
// therefore, it is categorized as a unknown type | ||
if matches!( | ||
data_type, | ||
DataType::Utf8 | DataType::LargeUtf8 | DataType::Null | ||
) { | ||
return TypeCategory::Unknown; | ||
} | ||
|
||
if matches!( | ||
data_type, | ||
DataType::Date32 | ||
| DataType::Date64 | ||
| DataType::Time32(_) | ||
| DataType::Time64(_) | ||
| DataType::Timestamp(_, _) | ||
| DataType::Interval(_) | ||
| DataType::Duration(_) | ||
) { | ||
return TypeCategory::DateTime; | ||
} | ||
|
||
if matches!( | ||
data_type, | ||
DataType::Dictionary(_, _) | DataType::Struct(_) | DataType::Union(_, _) | ||
) { | ||
return TypeCategory::Composite; | ||
} | ||
|
||
TypeCategory::NotSupported | ||
} | ||
|
||
/// Coerce `lhs_type` and `rhs_type` to a common type for the purposes of constructs including | ||
/// CASE, ARRAY, VALUES, and the GREATEST and LEAST functions. | ||
/// See <https://www.postgresql.org/docs/current/typeconv-union-case.html> for more information. | ||
/// The actual rules follows the behavior of Postgres and DuckDB | ||
pub fn type_resolution(data_types: &[DataType]) -> Option<DataType> { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the biggest difference between comparison coercion is that we categorize types. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The doc string is a bit confusing here because there are no lhs_type and rhs_type. I assume that case would be Also, maybe this function name could reflect that it's finding a union type to satisfy a set of input types, for example |
||
if data_types.is_empty() { | ||
return None; | ||
} | ||
|
||
// if all the data_types is the same return first one | ||
if data_types.iter().all(|t| t == &data_types[0]) { | ||
return Some(data_types[0].clone()); | ||
} | ||
|
||
// if all the data_types are null, return string | ||
if data_types.iter().all(|t| t == &DataType::Null) { | ||
return Some(DataType::Utf8); | ||
} | ||
|
||
// Ignore Nulls, if any data_type category is not the same, return None | ||
let data_types_category: Vec<TypeCategory> = data_types | ||
.iter() | ||
.filter(|&t| t != &DataType::Null) | ||
.map(data_type_category) | ||
.collect(); | ||
|
||
if data_types_category | ||
.iter() | ||
.any(|t| t == &TypeCategory::NotSupported) | ||
{ | ||
return None; | ||
} | ||
|
||
// check if there is only one category excluding Unknown | ||
let categories: HashSet<TypeCategory> = HashSet::from_iter( | ||
data_types_category | ||
.iter() | ||
.filter(|&c| c != &TypeCategory::Unknown) | ||
.cloned(), | ||
); | ||
if categories.len() > 1 { | ||
return None; | ||
} | ||
|
||
// Ignore Nulls | ||
let mut candidate_type: Option<DataType> = None; | ||
for data_type in data_types.iter() { | ||
if data_type == &DataType::Null { | ||
continue; | ||
} | ||
if let Some(ref candidate_t) = candidate_type { | ||
// Find candidate type that all the data types can be coerced to | ||
// Follows the behavior of Postgres and DuckDB | ||
// Coerced type may be different from the candidate and current data type | ||
// For example, | ||
// i64 and decimal(7, 2) are expect to get coerced type decimal(22, 2) | ||
// numeric string ('1') and numeric (2) are expect to get coerced type numeric (1, 2) | ||
if let Some(t) = type_resolution_coercion(data_type, candidate_t) { | ||
candidate_type = Some(t); | ||
} else { | ||
return None; | ||
} | ||
} else { | ||
candidate_type = Some(data_type.clone()); | ||
} | ||
} | ||
|
||
candidate_type | ||
} | ||
|
||
/// See [type_resolution] for more information. | ||
fn type_resolution_coercion( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same point here about function naming. I think we should highlight that we're creating a union / unifying input types into a common result type |
||
lhs_type: &DataType, | ||
rhs_type: &DataType, | ||
) -> Option<DataType> { | ||
if lhs_type == rhs_type { | ||
return Some(lhs_type.clone()); | ||
} | ||
|
||
// numeric coercion is the same as comparison coercion, both find the narrowest type | ||
// that can accommodate both types | ||
binary_numeric_coercion(lhs_type, rhs_type) | ||
.or_else(|| pure_string_coercion(lhs_type, rhs_type)) | ||
.or_else(|| numeric_string_coercion(lhs_type, rhs_type)) | ||
} | ||
|
||
/// Coerce `lhs_type` and `rhs_type` to a common type for the purposes of a comparison operation | ||
/// Unlike `coerced_from`, usually the coerced type is for comparison only. | ||
/// For example, compare with Dictionary and Dictionary, only value type is what we care about | ||
pub fn comparison_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> { | ||
if lhs_type == rhs_type { | ||
// same type => equality is possible | ||
return Some(lhs_type.clone()); | ||
} | ||
comparison_binary_numeric_coercion(lhs_type, rhs_type) | ||
binary_numeric_coercion(lhs_type, rhs_type) | ||
.or_else(|| dictionary_coercion(lhs_type, rhs_type, true)) | ||
.or_else(|| temporal_coercion(lhs_type, rhs_type)) | ||
.or_else(|| pure_string_coercion(lhs_type, rhs_type)) | ||
jayzhan211 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
.or_else(|| string_coercion(lhs_type, rhs_type)) | ||
.or_else(|| null_coercion(lhs_type, rhs_type)) | ||
.or_else(|| string_numeric_coercion(lhs_type, rhs_type)) | ||
|
@@ -359,9 +509,8 @@ fn string_temporal_coercion( | |
match_rule(lhs_type, rhs_type).or_else(|| match_rule(rhs_type, lhs_type)) | ||
} | ||
|
||
/// Coerce `lhs_type` and `rhs_type` to a common type for the purposes of a comparison operation | ||
/// where one both are numeric | ||
pub(crate) fn comparison_binary_numeric_coercion( | ||
/// Coerce `lhs_type` and `rhs_type` to a common type where both are numeric | ||
pub(crate) fn binary_numeric_coercion( | ||
lhs_type: &DataType, | ||
rhs_type: &DataType, | ||
) -> Option<DataType> { | ||
|
@@ -375,27 +524,25 @@ pub(crate) fn comparison_binary_numeric_coercion( | |
return Some(lhs_type.clone()); | ||
} | ||
|
||
if let Some(t) = decimal_coercion(lhs_type, rhs_type) { | ||
return Some(t); | ||
} | ||
|
||
// these are ordered from most informative to least informative so | ||
// that the coercion does not lose information via truncation | ||
match (lhs_type, rhs_type) { | ||
// Prefer decimal data type over floating point for comparison operation | ||
(Decimal128(_, _), Decimal128(_, _)) => { | ||
get_wider_decimal_type(lhs_type, rhs_type) | ||
} | ||
(Decimal128(_, _), _) => get_comparison_common_decimal_type(lhs_type, rhs_type), | ||
(_, Decimal128(_, _)) => get_comparison_common_decimal_type(rhs_type, lhs_type), | ||
(Decimal256(_, _), Decimal256(_, _)) => { | ||
get_wider_decimal_type(lhs_type, rhs_type) | ||
} | ||
(Decimal256(_, _), _) => get_comparison_common_decimal_type(lhs_type, rhs_type), | ||
(_, Decimal256(_, _)) => get_comparison_common_decimal_type(rhs_type, lhs_type), | ||
(Float64, _) | (_, Float64) => Some(Float64), | ||
(_, Float32) | (Float32, _) => Some(Float32), | ||
// The following match arms encode the following logic: Given the two | ||
// integral types, we choose the narrowest possible integral type that | ||
// accommodates all values of both types. Note that some information | ||
// loss is inevitable when we have a signed type and a `UInt64`, in | ||
// which case we use `Int64`;i.e. the widest signed integral type. | ||
|
||
// TODO: For i64 and u64, we can use decimal or float64 | ||
// Postgres has no unsigned type :( | ||
// DuckDB v.0.10.0 has double (double precision floating-point number (8 bytes)) | ||
// for largest signed (signed sixteen-byte integer) and unsigned integer (unsigned sixteen-byte integer) | ||
(Int64, _) | ||
| (_, Int64) | ||
| (UInt64, Int8) | ||
|
@@ -426,9 +573,31 @@ pub(crate) fn comparison_binary_numeric_coercion( | |
} | ||
} | ||
|
||
/// Coerce `lhs_type` and `rhs_type` to a common type for the purposes of | ||
/// a comparison operation where one is a decimal | ||
fn get_comparison_common_decimal_type( | ||
/// Decimal coercion rules. | ||
pub(crate) fn decimal_coercion( | ||
lhs_type: &DataType, | ||
rhs_type: &DataType, | ||
) -> Option<DataType> { | ||
use arrow::datatypes::DataType::*; | ||
|
||
match (lhs_type, rhs_type) { | ||
// Prefer decimal data type over floating point for comparison operation | ||
(Decimal128(_, _), Decimal128(_, _)) => { | ||
get_wider_decimal_type(lhs_type, rhs_type) | ||
} | ||
(Decimal128(_, _), _) => get_common_decimal_type(lhs_type, rhs_type), | ||
(_, Decimal128(_, _)) => get_common_decimal_type(rhs_type, lhs_type), | ||
(Decimal256(_, _), Decimal256(_, _)) => { | ||
get_wider_decimal_type(lhs_type, rhs_type) | ||
} | ||
(Decimal256(_, _), _) => get_common_decimal_type(lhs_type, rhs_type), | ||
(_, Decimal256(_, _)) => get_common_decimal_type(rhs_type, lhs_type), | ||
(_, _) => None, | ||
} | ||
} | ||
|
||
/// Coerce `lhs_type` and `rhs_type` to a common type. | ||
fn get_common_decimal_type( | ||
decimal_type: &DataType, | ||
other_type: &DataType, | ||
) -> Option<DataType> { | ||
|
@@ -701,11 +870,12 @@ fn string_concat_internal_coercion( | |
/// to string type. | ||
fn string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> { | ||
use arrow::datatypes::DataType::*; | ||
|
||
if let Some(t) = pure_string_coercion(lhs_type, rhs_type) { | ||
return Some(t); | ||
} | ||
|
||
match (lhs_type, rhs_type) { | ||
(Utf8, Utf8) => Some(Utf8), | ||
(LargeUtf8, Utf8) => Some(LargeUtf8), | ||
(Utf8, LargeUtf8) => Some(LargeUtf8), | ||
(LargeUtf8, LargeUtf8) => Some(LargeUtf8), | ||
// TODO: cast between array elements (#6558) | ||
(List(_), List(_)) => Some(lhs_type.clone()), | ||
(List(_), _) => Some(lhs_type.clone()), | ||
|
@@ -714,6 +884,29 @@ fn string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> | |
} | ||
} | ||
|
||
fn pure_string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> { | ||
use arrow::datatypes::DataType::*; | ||
match (lhs_type, rhs_type) { | ||
(Utf8, Utf8) => Some(Utf8), | ||
(LargeUtf8, Utf8) => Some(LargeUtf8), | ||
(Utf8, LargeUtf8) => Some(LargeUtf8), | ||
(LargeUtf8, LargeUtf8) => Some(LargeUtf8), | ||
_ => None, | ||
} | ||
} | ||
|
||
fn numeric_string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> { | ||
use arrow::datatypes::DataType::*; | ||
match (lhs_type, rhs_type) { | ||
(Utf8 | LargeUtf8, other_type) | (other_type, Utf8 | LargeUtf8) | ||
if other_type.is_numeric() => | ||
{ | ||
Some(other_type.clone()) | ||
} | ||
_ => None, | ||
} | ||
} | ||
|
||
/// Coercion rules for binary (Binary/LargeBinary) to string (Utf8/LargeUtf8): | ||
/// If one argument is binary and the other is a string then coerce to string | ||
/// (e.g. for `like`) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please link to https://docs.rs/datafusion/latest/datafusion/logical_expr/type_coercion/binary/fn.comparison_coercion.html that explains (however limited) what comparison coercion is?