Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce TypeSignature::Comparable and update NullIf signature #13356

Merged
merged 8 commits into from
Nov 22, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 28 additions & 6 deletions datafusion/expr-common/src/signature.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
//! Signature module contains foundational types that are used to represent signatures, types,
//! and return types of functions in DataFusion.

use crate::type_coercion::aggregates::{NUMERICS, STRINGS};
use crate::type_coercion::aggregates::NUMERICS;
use arrow::datatypes::DataType;
use datafusion_common::types::{LogicalTypeRef, NativeType};
use itertools::Itertools;
Expand Down Expand Up @@ -134,10 +134,19 @@ pub enum TypeSignature {
/// Null is considerd as `Utf8` by default
/// Dictionary with string value type is also handled.
String(usize),
/// Fixed number of arguments of boolean types.
Boolean(usize),
/// Zero argument
NullAry,
}

impl TypeSignature {
#[inline]
pub fn is_one_of(&self) -> bool {
matches!(self, TypeSignature::OneOf(_))
}
}

#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)]
pub enum ArrayFunctionSignature {
/// Specialized Signature for ArrayAppend and similar functions
Expand Down Expand Up @@ -204,6 +213,9 @@ impl TypeSignature {
.collect::<Vec<String>>()
.join(", ")]
}
TypeSignature::Boolean(num) => {
vec![format!("Boolean({num})")]
}
TypeSignature::String(num) => {
vec![format!("String({num})")]
}
Expand Down Expand Up @@ -284,11 +296,13 @@ impl TypeSignature {
.cloned()
.map(|numeric_type| vec![numeric_type; *arg_count])
.collect(),
TypeSignature::String(arg_count) => STRINGS
.iter()
.cloned()
.map(|string_type| vec![string_type; *arg_count])
.collect(),
TypeSignature::String(arg_count) => get_data_types(&NativeType::String)
.into_iter()
.map(|dt| vec![dt; *arg_count])
.collect::<Vec<_>>(),
Comment on lines +306 to +309
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

TypeSignature::Boolean(arg_count) => {
vec![vec![DataType::Boolean; *arg_count]]
}
// TODO: Implement for other types
TypeSignature::Any(_)
| TypeSignature::NullAry
Expand Down Expand Up @@ -379,6 +393,14 @@ impl Signature {
}
}

/// A specified number of boolean arguments
pub fn boolean(arg_count: usize, volatility: Volatility) -> Self {
Self {
type_signature: TypeSignature::Boolean(arg_count),
volatility,
}
}

/// An arbitrary number of arguments of any type.
pub fn variadic_any(volatility: Volatility) -> Self {
Self {
Expand Down
37 changes: 35 additions & 2 deletions datafusion/expr/src/type_coercion/functions.rs
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,7 @@ fn is_well_supported_signature(type_signature: &TypeSignature) -> bool {
| TypeSignature::Numeric(_)
| TypeSignature::String(_)
| TypeSignature::Coercible(_)
| TypeSignature::Boolean(_)
| TypeSignature::Any(_)
| TypeSignature::NullAry
)
Expand All @@ -194,13 +195,18 @@ fn try_coerce_types(

// Well-supported signature that returns exact valid types.
if !valid_types.is_empty() && is_well_supported_signature(type_signature) {
// exact valid types
assert_eq!(valid_types.len(), 1);
// There may be many valid types if valid signature is OneOf
// Otherwise, there should be only one valid type
if !type_signature.is_one_of() {
assert_eq!(valid_types.len(), 1);
}

let valid_types = valid_types.swap_remove(0);
if let Some(t) = maybe_data_types_without_coercion(&valid_types, current_types) {
return Ok(t);
}
} else {
// TODO: Deprecate this branch after all signatures are well-supported (aka coercion are happend already)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO: Deprecate this branch after all signatures are well-supported (aka coercion are happend already)
// TODO: Deprecate this branch after all signatures are well-supported (aka coercion has happened already)

// Try and coerce the argument types to match the signature, returning the
// coerced types from the first matching signature.
for valid_types in valid_types {
Expand Down Expand Up @@ -477,6 +483,33 @@ fn get_valid_types(

vec![vec![base_type_or_default_type(&coerced_type); *number]]
}
TypeSignature::Boolean(number) => {
function_length_check(current_types.len(), *number)?;

// Find common boolean type amongs given types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Find common boolean type amongs given types
// Find common boolean type amongst the given types

let mut valid_type = current_types.first().unwrap().to_owned();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm mistaken there is only one possible valid type here - Boolean doesn't have multiple types, does it? If so, I don't see the need for this variable nor the code below the for loop. valid_type must be DataType::Boolean, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to ensure the given types are all boolean, if it isn't we should return error here. If we got null, we return boolean

for t in current_types.iter().skip(1) {
let logical_data_type: NativeType = t.into();
if logical_data_type == NativeType::Null {
continue;
}

if logical_data_type != NativeType::Boolean {
return plan_err!(
"The signature expected NativeType::Boolean but received {logical_data_type}"
);
}

valid_type = t.to_owned();
}

let logical_data_type: NativeType = valid_type.clone().into();
if logical_data_type == NativeType::Null {
valid_type = DataType::Boolean;
}

vec![vec![valid_type; *number]]
}
TypeSignature::Numeric(number) => {
function_length_check(current_types.len(), *number)?;

Expand Down
41 changes: 9 additions & 32 deletions datafusion/functions/src/core/nullif.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

use arrow::datatypes::DataType;
use datafusion_common::{exec_err, Result};
use datafusion_expr::{ColumnarValue, Documentation};
use datafusion_expr::{ColumnarValue, Documentation, TypeSignature};

use arrow::compute::kernels::cmp::eq;
use arrow::compute::kernels::nullif::nullif;
Expand All @@ -32,26 +32,6 @@ pub struct NullIfFunc {
signature: Signature,
}

/// Currently supported types by the nullif function.
/// The order of these types correspond to the order on which coercion applies
/// This should thus be from least informative to most informative
static SUPPORTED_NULLIF_TYPES: &[DataType] = &[
DataType::Boolean,
DataType::UInt8,
DataType::UInt16,
DataType::UInt32,
DataType::UInt64,
DataType::Int8,
DataType::Int16,
DataType::Int32,
DataType::Int64,
DataType::Float32,
DataType::Float64,
DataType::Utf8View,
DataType::Utf8,
DataType::LargeUtf8,
];

impl Default for NullIfFunc {
fn default() -> Self {
Self::new()
Expand All @@ -61,9 +41,13 @@ impl Default for NullIfFunc {
impl NullIfFunc {
pub fn new() -> Self {
Self {
signature: Signature::uniform(
2,
SUPPORTED_NULLIF_TYPES.to_vec(),
signature: Signature::one_of(
// Hack: String is at the beginning so the return type is String if both args are Nulls
Copy link
Contributor

@Omega359 Omega359 Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'd call that a hack tbh

vec![
TypeSignature::String(2),
TypeSignature::Numeric(2),
TypeSignature::Boolean(2),
],
Volatility::Immutable,
),
}
Expand All @@ -83,14 +67,7 @@ impl ScalarUDFImpl for NullIfFunc {
}

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
// NULLIF has two args and they might get coerced, get a preview of this
let coerced_types = datafusion_expr::type_coercion::functions::data_types(
arg_types,
&self.signature,
);
coerced_types
.map(|typs| typs[0].clone())
.map_err(|e| e.context("Failed to coerce arguments for NULLIF"))
Ok(arg_types[0].to_owned())
}

fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
Expand Down
28 changes: 26 additions & 2 deletions datafusion/sqllogictest/test_files/nullif.slt
Original file line number Diff line number Diff line change
Expand Up @@ -97,11 +97,35 @@ SELECT NULLIF(1, 3);
----
1

query I
query T
SELECT NULLIF(NULL, NULL);
----
NULL

query R
select nullif(1, 1.2);
----
1

query R
select nullif(1.0, 2);
----
1

query error DataFusion error: Error during planning: Internal error: Failed to match any signature, errors: Error during planning: The signature expected NativeType::String but received NativeType::Int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to work, I just ran this locally against v43. I can't see a reason why this should no longer be supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select nullif(2, 'a');

This query failed in Postgres, I think we should return error for this query

select nullif(2, 'a');


query T
select nullif('2', '3');
----
2

# TODO: support numeric string
# This query success in Postgres and DuckDB
query error DataFusion error: Error during planning: Internal error: Failed to match any signature, errors: Error during planning: The signature expected NativeType::String but received NativeType::Int64
select nullif(2, '1');
Copy link
Contributor

@Omega359 Omega359 Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also used to work.

query T
select nullif(2, '1');
----
2

Interesting that the type is text vs number but still, it did work.

Copy link
Contributor Author

@jayzhan211 jayzhan211 Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a hard issue to fix, since we can't tell the difference between Unknown type and String type currently

This query pass in main too but it should not.

query T
select nullif('1'::varchar, 2);
----
1

The change did have regression on nullif(2, '1'), but also fix nullif('1'::varchar, 2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar issue in #13240 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinions against this PR because it fixes nullif('1'::varchar, 2) and refines the TypeSignature. We could file a ticket for it and address it in a follow-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in #13285

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duckdb can run this query

D select nullif('2'::varchar, 2);
┌─────────────────────────────────┐
│ nullif(CAST('2' AS VARCHAR), 2) │
│             varchar             │
├─────────────────────────────────┤
│                                 │
└─────────────────────────────────┘

😕

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me what is important is not strictly what other systems support - those may/could be a guide to what datafusion will support - but whether the provided arguments to a signature can be losslessly converted to what the signature accepts and whether it logically makes sense to do so.

I personally would rather be lenient for what is accepted and do casting/coercion as required than to be strict and push the onus onto the user to do that. That's just me though, I don't know if that is the general consensus of the community. Perhaps we should file a discussion ticket with the options and decide?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just double checked on this branch and it seems good to me (supports things reasonably)

     Running `target/debug/datafusion-cli`
DataFusion CLI v43.0.0
> select nullif(2, '1');
+----------------------------+
| nullif(Int64(2),Utf8("1")) |
+----------------------------+
| 2                          |
+----------------------------+
1 row(s) fetched.
Elapsed 0.027 seconds.

> select nullif('1'::varchar, 2);
+----------------------------+
| nullif(Utf8("1"),Int64(2)) |
+----------------------------+
| 1                          |
+----------------------------+
1 row(s) fetched.
Elapsed 0.007 seconds.


query T
SELECT NULLIF(arrow_cast('a', 'Utf8View'), 'a');
----
Expand Down Expand Up @@ -130,4 +154,4 @@ NULL
query T
SELECT NULLIF(arrow_cast('a', 'Utf8View'), null);
----
a
a
Loading