-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant performance downgrade to tpch-q1 #6278
Comments
Will investingate asap |
I agree the tpch queries got much worse here are the results comparing 23.0.0 to
I'll try and run some profiling and figure out what is going on |
I believe the slowdown affects the 24.0.0 RC1 as well Here is the comparison between 23.0.0 and 24.0.0-rc1 tag: Comparing heads_23.0.0 and HEAD
--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ tpch_mem ┃ tpch_mem ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1 │ 855.34ms │ 3413.56ms │ 3.99x slower │
│ QQuery 2 │ 236.64ms │ 283.88ms │ 1.20x slower │
│ QQuery 3 │ 146.80ms │ 152.87ms │ no change │
│ QQuery 4 │ 96.49ms │ 91.61ms │ +1.05x faster │
│ QQuery 5 │ 376.97ms │ 446.78ms │ 1.19x slower │
│ QQuery 6 │ 35.81ms │ 34.47ms │ no change │
│ QQuery 7 │ 870.67ms │ 1084.90ms │ 1.25x slower │
│ QQuery 8 │ 225.08ms │ 238.15ms │ 1.06x slower │
│ QQuery 9 │ 492.54ms │ 553.52ms │ 1.12x slower │
│ QQuery 10 │ 274.24ms │ 315.50ms │ 1.15x slower │
│ QQuery 11 │ 243.34ms │ 294.34ms │ 1.21x slower │
│ QQuery 12 │ 145.80ms │ 152.11ms │ no change │
│ QQuery 13 │ 611.00ms │ 657.87ms │ 1.08x slower │
│ QQuery 14 │ 43.63ms │ 46.52ms │ 1.07x slower │
│ QQuery 15 │ 83.76ms │ 97.48ms │ 1.16x slower │
│ QQuery 16 │ 174.79ms │ 248.43ms │ 1.42x slower │
│ QQuery 17 │ 2500.32ms │ 2912.47ms │ 1.16x slower │
│ QQuery 18 │ 2643.99ms │ 2884.11ms │ 1.09x slower │
│ QQuery 19 │ 142.35ms │ 159.65ms │ 1.12x slower │
│ QQuery 20 │ 844.41ms │ 864.69ms │ no change │
│ QQuery 21 │ 1275.64ms │ 1311.02ms │ no change │
│ QQuery 22 │ 134.58ms │ 135.93ms │ no change │
└──────────────┴───────────┴───────────┴───────────────┘ |
@viirya is this something you can look into? ReproducerYou can reproduce the results using Make data: cd benchmarks
./bench.sh data all Install datafusion-cli locally (or run it however else you want): cargo install --path datafusion-cli Run the query: ❯ -- load lineitem table into memory
create table lineitem as select * from 'data/lineitem';
❯ select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= date '1998-09-02'
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order |
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| A | F | 37734107.00 | 56586554400.73 | 53758257134.8700 | 55909065222.827692 | 25.522005 | 38273.129734 | 0.049985 | 1478493 |
| N | F | 991417.00 | 1487504710.38 | 1413082168.0541 | 1469649223.194375 | 25.516471 | 38284.467760 | 0.050093 | 38854 |
| N | O | 74476040.00 | 111701729697.74 | 106118230307.6056 | 110367043872.497010 | 25.502226 | 38249.117988 | 0.049996 | 2920374 |
| R | F | 37719753.00 | 56568041380.90 | 53741292684.6040 | 55889619119.831932 | 25.505793 | 38250.854626 | 0.050009 | 1478870 |
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
4 rows in set. Query took 6.966 seconds. Explain PlansThe explain plans are identical between
|
Interestingly, the explain analyze output shows all the time being spent in the AggregateExec
|
My next plan (for tomorrow) is to isolate which change actually caused the regression |
From the Instruments profiling, looks like |
I will also take a look today. |
@viirya @alamb And the logic in the method
pub fn multiply_fixed_point(
left: &PrimitiveArray<Decimal128Type>,
right: &PrimitiveArray<Decimal128Type>,
required_scale: i8,
) -> Result<PrimitiveArray<Decimal128Type>, ArrowError> {
let product_scale = left.scale() + right.scale();
let precision = min(
left.precision() + right.precision() + 1,
DECIMAL128_MAX_PRECISION,
);
if required_scale == product_scale {
return multiply(left, right)?
.with_precision_and_scale(precision, required_scale);
}
if required_scale > product_scale {
return Err(ArrowError::ComputeError(format!(
"Required scale {} is greater than product scale {}",
required_scale, product_scale
)));
}
let divisor =
i256::from_i128(10).pow_wrapping((product_scale - required_scale) as u32);
binary::<_, _, _, Decimal128Type>(left, right, |a, b| {
let a = i256::from_i128(a);
let b = i256::from_i128(b);
let mut mul = a.wrapping_mul(b);
mul = divide_and_round::<Decimal256Type>(mul, divisor);
mul.as_i128()
})
.and_then(|a| a.with_precision_and_scale(precision, required_scale))
} |
And there is another issue related to the default scale and precision of literal value, DataFusion over estimates the default I will try to fix this issue. |
|
I believe the slowdown began in #6103 191af8d, as suspected by @mingmwang Specifically: Here is the data: (I built datafusion-cli at different revisions)
|
It was not over estimated the precision for literal value. When we coerce numeric type to decimal, the precision is determined by the numeric type. For Int64, the precision is 20. |
As I mentioned above (#6278 (comment)), Implementing native I will look into it and try to implement them.
It is not inconsistent. It is for allowing precision-loss decimal multiplication. Our kernels basically allow overflow. If |
I mean the literal |
Could you please explain a bit, why if the |
If two decimals with scale 38, the product scale will be 76. How do you multiply two i128 integers with more than 38 digits and get full precision result with more than 76 digits? |
Yea, for this case, you may treat the literal with Int8 to avoid long precision, but the root cause is not resolved. |
I understand now, thanks for the explanation. |
Describe the bug
Main branch:
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: Some(1), debug: false, iterations: 3, partitions: 1, batch_size: 8192, path: "./parquet_data", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: true }
Query 1 iteration 0 took 1716.3 ms and returned 4 rows
Query 1 iteration 1 took 1697.0 ms and returned 4 rows
Query 1 iteration 2 took 1694.3 ms and returned 4 rows
Query 1 avg time: 1702.52 ms
Branch 23 (Tag 23.0.0):
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: Some(1), debug: false, iterations: 3, partitions: 1, batch_size: 8192, path: "./parquet_data", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: true, enable_scheduler: false }
Query 1 iteration 0 took 864.2 ms and returned 4 rows
Query 1 iteration 1 took 842.0 ms and returned 4 rows
Query 1 iteration 2 took 838.7 ms and returned 4 rows
Query 1 avg time: 848.29 ms
To Reproduce
No response
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: