Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decimal multiply kernel should not cause precision loss #5980

Merged
merged 17 commits into from
Apr 20, 2023

Conversation

viirya
Copy link
Member

@viirya viirya commented Apr 12, 2023

Which issue does this PR close?

Closes #5674.
Closes #3387.
Closes #4024.

Rationale for this change

Currently decimal multiplication in DataFusion silently truncates precision of result. It happens generally for regular decimal multiplication which doesn't overflow. Looks like DataFusion uses incomplete decimal precision coercion rule from Spark to coerce sides of decimal multiplication (and other arithmetic operators). The coerced type on two sides of decimal multiplication is not the resulting decimal type of multiplication. This (and how we computes decimal multiplication in the kernels) leads to truncated precision in the result decimal type.

What changes are included in this PR?

  • Moved decimal type coercion for math binary operators from TypeCoercion to physical binary operator
  • Fixed type coercion rule for decimal
    • Produced correct coerced types
    • Separated result type from coerced type

Are these changes tested?

Are there any user-facing changes?

@viirya viirya marked this pull request as draft April 12, 2023 20:01
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions labels Apr 12, 2023
@viirya
Copy link
Member Author

viirya commented Apr 12, 2023

Different to #5675, this doesn't add new expression node PromotePrecision and defers decimal type coercion to the phase of math expression evaluation. This approach is more close to how Spark handles decimal math coercion nowadays.

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Apr 12, 2023
@viirya viirya force-pushed the fix_decimal_multiply_precision_loss4 branch 2 times, most recently from 54397f9 to 343ca79 Compare April 13, 2023 21:34
@viirya
Copy link
Member Author

viirya commented Apr 16, 2023

There is a compilation error. Going to fix it at #6029.

@viirya viirya force-pushed the fix_decimal_multiply_precision_loss4 branch from cb7e326 to 0a88516 Compare April 17, 2023 01:18
Comment on lines +3313 to +3320
Some(99193548387), // 0.99193548387
None,
None,
Some(100813008130), // 1.0081300813
Some(100000000000), // 1.0
],
21,
11,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, this division losses precision. Now we get it back.

@@ -4801,18 +4878,18 @@ mod tests {

// subtract: decimal array subtract int32 array
let schema = Arc::new(Schema::new(vec![
Field::new("b", DataType::Int32, true),
Field::new("a", DataType::Decimal128(10, 2), true),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the field order is incorrect. But as we did coerce type on both side of the op anyway, so it still worked before. Now we don't coerce the decimal field (which is wrongly bound to Int32Array) before into binary expression, so wrong field causes an error.

query TTRRRRRRRI
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
sum(cast(l_extendedprice as decimal(12,2)) * (1 - l_discount) * (1 + l_tax)) as sum_charge,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +3 to +6
cast(cast(sum(case
when nation = 'BRAZIL' then volume
else 0
end) as decimal(12,2)) / cast(sum(volume) as decimal(12,2)) as decimal(15,2)) as mkt_share
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big_decimal_to_str(
BigDecimal::from_str(&Decimal::from_i128_with_scale(value, scale).to_string())
BigDecimal::from_str(&Decimal128Type::format_decimal(value, *precision, *scale))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya viirya marked this pull request as ready for review April 17, 2023 21:35
@viirya
Copy link
Member Author

viirya commented Apr 17, 2023

This deals with the decimal precision issue without additional PromotePrecision node (#5675).

cc @alamb @liukun4515

@Dandandan
Copy link
Contributor

I wonder if this already fixes #4024

@viirya
Copy link
Member Author

viirya commented Apr 18, 2023

I wonder if this already fixes #4024

Yea, just verified locally that this can pass verify_q6.

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@viirya
Copy link
Member Author

viirya commented Apr 18, 2023

Thanks @Dandandan

@Dandandan
Copy link
Contributor

Let's wait ~24hrs so other reviewers can have a chance.

@Dandandan
Copy link
Contributor

FYI @mingmwang @andygrove this PR also has some effect on performance, as casting is changed (mostly reduced).

@Dandandan
Copy link
Contributor

Ran the benchmarks for TPCH(SF=1) in memory.

Performance is mostly the same, except a ~30% improvement for q1 compared to main 🚀

@Dandandan Dandandan merged commit e81f54b into apache:main Apr 20, 2023
@alamb
Copy link
Contributor

alamb commented Apr 24, 2023

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
3 participants