Significant performance downgrade to tpch-q1 #6278

mingmwang · 2023-05-08T03:57:49Z

Describe the bug

Main branch:

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: Some(1), debug: false, iterations: 3, partitions: 1, batch_size: 8192, path: "./parquet_data", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: true }
Query 1 iteration 0 took 1716.3 ms and returned 4 rows
Query 1 iteration 1 took 1697.0 ms and returned 4 rows
Query 1 iteration 2 took 1694.3 ms and returned 4 rows
Query 1 avg time: 1702.52 ms

Branch 23 (Tag 23.0.0):

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: Some(1), debug: false, iterations: 3, partitions: 1, batch_size: 8192, path: "./parquet_data", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: true, enable_scheduler: false }
Query 1 iteration 0 took 864.2 ms and returned 4 rows
Query 1 iteration 1 took 842.0 ms and returned 4 rows
Query 1 iteration 2 took 838.7 ms and returned 4 rows
Query 1 avg time: 848.29 ms

To Reproduce

No response

Expected behavior

No response

Additional context

No response

mingmwang · 2023-05-08T05:11:31Z

@viirya

Should be caused by this change:

#6103

mingmwang · 2023-05-08T05:12:34Z

@alamb

alamb · 2023-05-08T14:27:33Z

Will investingate asap

alamb · 2023-05-08T18:43:49Z

I agree the tpch queries got much worse

here are the results comparing 23.0.0 to main at 2e9beeb

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  tpch_mem ┃  tpch_mem ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  855.34ms │ 2974.32ms │  3.48x slower │
│ QQuery 2     │  236.64ms │  227.33ms │     no change │
│ QQuery 3     │  146.80ms │  156.11ms │  1.06x slower │
│ QQuery 4     │   96.49ms │   92.63ms │     no change │
│ QQuery 5     │  376.97ms │  389.73ms │     no change │
│ QQuery 6     │   35.81ms │   35.52ms │     no change │
│ QQuery 7     │  870.67ms │  863.31ms │     no change │
│ QQuery 8     │  225.08ms │  212.54ms │ +1.06x faster │
│ QQuery 9     │  492.54ms │  499.20ms │     no change │
│ QQuery 10    │  274.24ms │  249.58ms │ +1.10x faster │
│ QQuery 11    │  243.34ms │  250.72ms │     no change │
│ QQuery 12    │  145.80ms │  154.45ms │  1.06x slower │
│ QQuery 13    │  611.00ms │  650.75ms │  1.07x slower │
│ QQuery 14    │   43.63ms │   40.05ms │ +1.09x faster │
│ QQuery 15    │   83.76ms │   81.30ms │     no change │
│ QQuery 16    │  174.79ms │  177.44ms │     no change │
│ QQuery 17    │ 2500.32ms │ 2412.97ms │     no change │
│ QQuery 18    │ 2643.99ms │ 2450.26ms │ +1.08x faster │
│ QQuery 19    │  142.35ms │  155.07ms │  1.09x slower │
│ QQuery 20    │  844.41ms │  761.31ms │ +1.11x faster │
│ QQuery 21    │ 1275.64ms │ 1229.27ms │     no change │
│ QQuery 22    │  134.58ms │  127.93ms │     no change │
└──────────────┴───────────┴───────────┴───────────────┘

I'll try and run some profiling and figure out what is going on

alamb · 2023-05-08T21:02:50Z

A local profile with Instruments shows that 77% of the time is spent doing decimal multiplication:

alamb · 2023-05-08T21:09:43Z

I believe the slowdown affects the 24.0.0 RC1 as well

Here is the comparison between 23.0.0 and 24.0.0-rc1 tag:

Comparing heads_23.0.0 and HEAD
--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  tpch_mem ┃  tpch_mem ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  855.34ms │ 3413.56ms │  3.99x slower │
│ QQuery 2     │  236.64ms │  283.88ms │  1.20x slower │
│ QQuery 3     │  146.80ms │  152.87ms │     no change │
│ QQuery 4     │   96.49ms │   91.61ms │ +1.05x faster │
│ QQuery 5     │  376.97ms │  446.78ms │  1.19x slower │
│ QQuery 6     │   35.81ms │   34.47ms │     no change │
│ QQuery 7     │  870.67ms │ 1084.90ms │  1.25x slower │
│ QQuery 8     │  225.08ms │  238.15ms │  1.06x slower │
│ QQuery 9     │  492.54ms │  553.52ms │  1.12x slower │
│ QQuery 10    │  274.24ms │  315.50ms │  1.15x slower │
│ QQuery 11    │  243.34ms │  294.34ms │  1.21x slower │
│ QQuery 12    │  145.80ms │  152.11ms │     no change │
│ QQuery 13    │  611.00ms │  657.87ms │  1.08x slower │
│ QQuery 14    │   43.63ms │   46.52ms │  1.07x slower │
│ QQuery 15    │   83.76ms │   97.48ms │  1.16x slower │
│ QQuery 16    │  174.79ms │  248.43ms │  1.42x slower │
│ QQuery 17    │ 2500.32ms │ 2912.47ms │  1.16x slower │
│ QQuery 18    │ 2643.99ms │ 2884.11ms │  1.09x slower │
│ QQuery 19    │  142.35ms │  159.65ms │  1.12x slower │
│ QQuery 20    │  844.41ms │  864.69ms │     no change │
│ QQuery 21    │ 1275.64ms │ 1311.02ms │     no change │
│ QQuery 22    │  134.58ms │  135.93ms │     no change │
└──────────────┴───────────┴───────────┴───────────────┘

alamb · 2023-05-08T21:26:45Z

@viirya is this something you can look into?

Reproducer

You can reproduce the results using datafusion-cli:

Make data:

cd benchmarks
./bench.sh data all

Install datafusion-cli locally (or run it however else you want):

cargo install --path datafusion-cli

Run the query:

❯ -- load lineitem table into memory
create table lineitem as select * from 'data/lineitem';


❯ select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
        l_shipdate <= date '1998-09-02'
group by
    l_returnflag,
    l_linestatus
order by
    l_returnflag,
    l_linestatus;
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| l_returnflag | l_linestatus | sum_qty     | sum_base_price  | sum_disc_price    | sum_charge          | avg_qty   | avg_price    | avg_disc | count_order |
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| A            | F            | 37734107.00 | 56586554400.73  | 53758257134.8700  | 55909065222.827692  | 25.522005 | 38273.129734 | 0.049985 | 1478493     |
| N            | F            | 991417.00   | 1487504710.38   | 1413082168.0541   | 1469649223.194375   | 25.516471 | 38284.467760 | 0.050093 | 38854       |
| N            | O            | 74476040.00 | 111701729697.74 | 106118230307.6056 | 110367043872.497010 | 25.502226 | 38249.117988 | 0.049996 | 2920374     |
| R            | F            | 37719753.00 | 56568041380.90  | 53741292684.6040  | 55889619119.831932  | 25.505793 | 38250.854626 | 0.050009 | 1478870     |
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
4 rows in set. Query took 6.966 seconds.

Explain Plans

The explain plans are identical between 23.0.0 and main:


❯ explain select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
        l_shipdate <= date '1998-09-02'
group by
    l_returnflag,
    l_linestatus
order by
    l_returnflag,
    l_linestatus;
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: lineitem.l_returnflag ASC NULLS LAST, lineitem.l_linestatus ASC NULLS LAST                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|               |   Projection: lineitem.l_returnflag, lineitem.l_linestatus, SUM(lineitem.l_quantity) AS sum_qty, SUM(lineitem.l_extendedprice) AS sum_base_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount) AS sum_disc_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax) AS sum_charge, AVG(lineitem.l_quantity) AS avg_qty, AVG(lineitem.l_extendedprice) AS avg_price, AVG(lineitem.l_discount) AS avg_disc, COUNT(UInt8(1)) AS count_order                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|               |     Aggregate: groupBy=[[lineitem.l_returnflag, lineitem.l_linestatus]], aggr=[[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * (Decimal128(Some(1),20,0) - lineitem.l_discount)Decimal128(Some(1),20,0) - lineitem.l_discountlineitem.l_discountDecimal128(Some(1),20,0)lineitem.l_extendedprice AS lineitem.l_extendedprice * Decimal128(Some(1),20,0) - lineitem.l_discount) AS SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * (Decimal128(Some(1),20,0) - lineitem.l_discount)Decimal128(Some(1),20,0) - lineitem.l_discountlineitem.l_discountDecimal128(Some(1),20,0)lineitem.l_extendedprice AS lineitem.l_extendedprice * Decimal128(Some(1),20,0) - lineitem.l_discount * (Decimal128(Some(1),20,0) + lineitem.l_tax)) AS SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))]] |
|               |       Projection: lineitem.l_extendedprice * (Decimal128(Some(1),20,0) - lineitem.l_discount) AS lineitem.l_extendedprice * (Decimal128(Some(1),20,0) - lineitem.l_discount)Decimal128(Some(1),20,0) - lineitem.l_discountlineitem.l_discountDecimal128(Some(1),20,0)lineitem.l_extendedprice, lineitem.l_quantity, lineitem.l_extendedprice, lineitem.l_discount, lineitem.l_tax, lineitem.l_returnflag, lineitem.l_linestatus                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|               |         Filter: lineitem.l_shipdate <= Date32("10471")                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |           TableScan: lineitem projection=[l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| physical_plan | SortPreservingMergeExec: [l_returnflag@0 ASC NULLS LAST,l_linestatus@1 ASC NULLS LAST]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |   SortExec: expr=[l_returnflag@0 ASC NULLS LAST,l_linestatus@1 ASC NULLS LAST]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|               |     ProjectionExec: expr=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus, SUM(lineitem.l_quantity)@2 as sum_qty, SUM(lineitem.l_extendedprice)@3 as sum_base_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)@4 as sum_disc_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax)@5 as sum_charge, AVG(lineitem.l_quantity)@6 as avg_qty, AVG(lineitem.l_extendedprice)@7 as avg_price, AVG(lineitem.l_discount)@8 as avg_disc, COUNT(UInt8(1))@9 as count_order]                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|               |       AggregateExec: mode=FinalPartitioned, gby=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|               |         CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|               |           RepartitionExec: partitioning=Hash([Column { name: "l_returnflag", index: 0 }, Column { name: "l_linestatus", index: 1 }], 16), input_partitions=16                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|               |             AggregateExec: mode=Partial, gby=[l_returnflag@5 as l_returnflag, l_linestatus@6 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|               |               ProjectionExec: expr=[l_extendedprice@1 * (Some(1),20,0 - l_discount@2) as lineitem.l_extendedprice * (Decimal128(Some(1),20,0) - lineitem.l_discount)Decimal128(Some(1),20,0) - lineitem.l_discountlineitem.l_discountDecimal128(Some(1),20,0)lineitem.l_extendedprice, l_quantity@0 as l_quantity, l_extendedprice@1 as l_extendedprice, l_discount@2 as l_discount, l_tax@3 as l_tax, l_returnflag@4 as l_returnflag, l_linestatus@5 as l_linestatus]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |                 CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|               |                   FilterExec: l_shipdate@6 <= 10471                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|               |                     RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|               |                       MemoryExec: partitions=1, partition_sizes=[733]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.011 seconds.

explain_22.0.0.txt
explain_main.txt

alamb · 2023-05-08T21:30:28Z

Interestingly, the explain analyze output shows all the time being spent in the AggregateExec elapsed_compute=11.714486088s -- maybe the aggregate exec is re-calculating its input exprs incorrectly.

❯ explain analyze select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
        l_shipdate <= date '1998-09-02'
group by
    l_returnflag,
    l_linestatus
order by
    l_returnflag,
    l_linestatus;
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalescePartitionsExec, metrics=[output_rows=4, elapsed_compute=21.427µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|                   |   ProjectionExec: expr=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus, SUM(lineitem.l_quantity)@2 as sum_qty, SUM(lineitem.l_extendedprice)@3 as sum_base_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)@4 as sum_disc_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax)@5 as sum_charge, AVG(lineitem.l_quantity)@6 as avg_qty, AVG(lineitem.l_extendedprice)@7 as avg_price, AVG(lineitem.l_discount)@8 as avg_disc, COUNT(UInt8(1))@9 as count_order], metrics=[output_rows=4, elapsed_compute=114.909µs, spill_count=0, spilled_bytes=0, mem_used=0] |
|                   |     AggregateExec: mode=FinalPartitioned, gby=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=4, elapsed_compute=3.558233ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                  |
|                   |       CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=64, elapsed_compute=1.368734ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                   |         RepartitionExec: partitioning=Hash([Column { name: "l_returnflag", index: 0 }, Column { name: "l_linestatus", index: 1 }], 16), input_partitions=16, metrics=[fetch_time=199.828628049s, repart_time=1.858216ms, send_time=227.016µs]                                                                                                                                                                                                                                                                                                                                                                                                    |
|                   |           AggregateExec: mode=Partial, gby=[l_returnflag@5 as l_returnflag, l_linestatus@6 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=64, elapsed_compute=11.714486088s, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                 |
|                   |             ProjectionExec: expr=[l_extendedprice@1 * (Some(1),20,0 - l_discount@2) as lineitem.l_extendedprice * (Decimal128(Some(1),20,0) - lineitem.l_discount)Decimal128(Some(1),20,0) - lineitem.l_discountlineitem.l_discountDecimal128(Some(1),20,0)lineitem.l_extendedprice, l_quantity@0 as l_quantity, l_extendedprice@1 as l_extendedprice, l_discount@2 as l_discount, l_tax@3 as l_tax, l_returnflag@4 as l_returnflag, l_linestatus@5 as l_linestatus], metrics=[output_rows=5916591, elapsed_compute=377.249296ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                    |
|                   |               CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=5916591, elapsed_compute=222.551791ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                   |                 FilterExec: l_shipdate@6 <= 10471, metrics=[output_rows=5916591, elapsed_compute=202.848245ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|                   |                   RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1, metrics=[fetch_time=6.940083ms, repart_time=1ns, send_time=1.861309ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                   |                     MemoryExec: partitions=1, partition_sizes=[733], metrics=[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set. Query took 12.538 seconds.

alamb · 2023-05-08T22:43:35Z

My next plan (for tomorrow) is to isolate which change actually caused the regression

viirya · 2023-05-09T00:36:38Z

From the Instruments profiling, looks like divide_and_round took too much time where most time was spent on i256 operations. I will take a look into it.

mingmwang · 2023-05-09T01:53:55Z

I will also take a look today.

mingmwang · 2023-05-09T11:39:06Z

@viirya @alamb
I think the major problem is that the i256 math operation is not efficient.

And the logic in the method multiply_fixed_point() in arrow-rs is not consistent.

If required_scale == product_scale, it still uses i128 for multiply(can still overflow)
if product_scale > required_scale, use i256 for multiply and div, why ??

pub fn multiply_fixed_point(
    left: &PrimitiveArray<Decimal128Type>,
    right: &PrimitiveArray<Decimal128Type>,
    required_scale: i8,
) -> Result<PrimitiveArray<Decimal128Type>, ArrowError> {
    let product_scale = left.scale() + right.scale();
    let precision = min(
        left.precision() + right.precision() + 1,
        DECIMAL128_MAX_PRECISION,
    );

    if required_scale == product_scale {
        return multiply(left, right)?
            .with_precision_and_scale(precision, required_scale);
    }

    if required_scale > product_scale {
        return Err(ArrowError::ComputeError(format!(
            "Required scale {} is greater than product scale {}",
            required_scale, product_scale
        )));
    }

    let divisor =
        i256::from_i128(10).pow_wrapping((product_scale - required_scale) as u32);

    binary::<_, _, _, Decimal128Type>(left, right, |a, b| {
        let a = i256::from_i128(a);
        let b = i256::from_i128(b);

        let mut mul = a.wrapping_mul(b);
        mul = divide_and_round::<Decimal256Type>(mul, divisor);
        mul.as_i128()
    })
    .and_then(|a| a.with_precision_and_scale(precision, required_scale))
}

mingmwang · 2023-05-09T11:44:19Z

And there is another issue related to the default scale and precision of literal value, DataFusion over estimates the default
precision for a simple literal value, which will cause the final result type's precision is over estimated, and has to round the scale, and then lead to different required_scale and product_scale.

I will try to fix this issue.

mingmwang · 2023-05-09T12:13:11Z

l_extendedprice * (1 - l_discount) * (1 + l_tax)

1 was treated as Int64 by the planner, then TypeCoercion rule cast Int64(1) to decimal(20, 0)

alamb · 2023-05-09T14:13:37Z

I believe the slowdown began in #6103 191af8d, as suspected by @mingmwang

Specifically:

87a67d4 the query takes 1.85s for me
191af8d (the next commit) the query takes 6.23s

Here is the data: (I built datafusion-cli at different revisions)

+ /home/alamb/2023-05-09-tpch-perf-drop/datafusion-cli-87a67d4b190d5849969ead52fe75f4d6fccbbdec -f q1.sql
DataFusion CLI v23.0.0
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| l_returnflag | l_linestatus | sum_qty     | sum_base_price  | sum_disc_price    | sum_charge          | avg_qty   | avg_price    | avg_disc | count_order |
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| A            | F            | 37734107.00 | 56586554400.73  | 53758257134.8700  | 55909065222.827692  | 25.522005 | 38273.129734 | 0.049985 | 1478493     |
| N            | F            | 991417.00   | 1487504710.38   | 1413082168.0541   | 1469649223.194375   | 25.516471 | 38284.467760 | 0.050093 | 38854       |
| N            | O            | 74476040.00 | 111701729697.74 | 106118230307.6056 | 110367043872.497010 | 25.502226 | 38249.117988 | 0.049996 | 2920374     |
| R            | F            | 37719753.00 | 56568041380.90  | 53741292684.6040  | 55889619119.831932  | 25.505793 | 38250.854626 | 0.050009 | 1478870     |
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
4 rows in set. Query took 1.858 seconds.

SHA: 191af8dcd7c4ee756b7d98e06bd7b40dc23202b7
+ /home/alamb/2023-05-09-tpch-perf-drop/datafusion-cli-191af8dcd7c4ee756b7d98e06bd7b40dc23202b7 -f q1.sql
DataFusion CLI v23.0.0
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| l_returnflag | l_linestatus | sum_qty     | sum_base_price  | sum_disc_price    | sum_charge          | avg_qty   | avg_price    | avg_disc | count_order |
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| A            | F            | 37734107.00 | 56586554400.73  | 53758257134.8700  | 55909065222.827692  | 25.522005 | 38273.129734 | 0.049985 | 1478493     |
| N            | F            | 991417.00   | 1487504710.38   | 1413082168.0541   | 1469649223.194375   | 25.516471 | 38284.467760 | 0.050093 | 38854       |
| N            | O            | 74476040.00 | 111701729697.74 | 106118230307.6056 | 110367043872.497010 | 25.502226 | 38249.117988 | 0.049996 | 2920374     |
| R            | F            | 37719753.00 | 56568041380.90  | 53741292684.6040  | 55889619119.831932  | 25.505793 | 38250.854626 | 0.050009 | 1478870     |
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
4 rows in set. Query took 6.233 seconds.

viirya · 2023-05-09T16:03:54Z

And there is another issue related to the default scale and precision of literal value, DataFusion over estimates the default precision for a simple literal value, which will cause the final result type's precision is over estimated, and has to round the scale, and then lead to different required_scale and product_scale.
1 was treated as Int64 by the planner, then TypeCoercion rule cast Int64(1) to decimal(20, 0)

It was not over estimated the precision for literal value. When we coerce numeric type to decimal, the precision is determined by the numeric type. For Int64, the precision is 20.

viirya · 2023-05-09T16:11:54Z

I think the major problem is that the i256 math operation is not efficient.

As I mentioned above (#6278 (comment)), divide_and_round spent too much time on i256 operations, especially wrapping_div and wrapping_rem. We currently round back to BigInt for these math operations which are slow.

Implementing native wrapping_div and wrapping_rem will help speeding up these operations.

I will look into it and try to implement them.

And the logic in the method multiply_fixed_point() in arrow-rs is not consistent.

If required_scale == product_scale, it still uses i128 for multiply(can still overflow) if product_scale > required_scale, use i256 for multiply and div, why ??

It is not inconsistent. It is for allowing precision-loss decimal multiplication. Our kernels basically allow overflow.

If required_scale equals to product_scale, it won't lose precision, so we can use i128. If product_scale is larger than required_scale, the multiplication will lose precision, we need to multiply in i256 and divide the result to get the result with less precision.

mingmwang · 2023-05-10T01:30:48Z

And there is another issue related to the default scale and precision of literal value, DataFusion over estimates the default precision for a simple literal value, which will cause the final result type's precision is over estimated, and has to round the scale, and then lead to different required_scale and product_scale.
1 was treated as Int64 by the planner, then TypeCoercion rule cast Int64(1) to decimal(20, 0)

It was not over estimated the precision for literal value. When we coerce numeric type to decimal, the precision is determined by the numeric type. For Int64, the precision is 20.

I mean the literal 1 can be evaluated as Int8(1) at the very beginning.

mingmwang · 2023-05-10T01:33:39Z

I think the major problem is that the i256 math operation is not efficient.

As I mentioned above (#6278 (comment)), divide_and_round spent too much time on i256 operations, especially wrapping_div and wrapping_rem. We currently round back to BigInt for these math operations which are slow.

Implementing native wrapping_div and wrapping_rem will help speeding up these operations.

I will look into it and try to implement them.

And the logic in the method multiply_fixed_point() in arrow-rs is not consistent.
If required_scale == product_scale, it still uses i128 for multiply(can still overflow) if product_scale > required_scale, use i256 for multiply and div, why ??

It is not inconsistent. It is for allowing precision-loss decimal multiplication. Our kernels basically allow overflow.

If required_scale equals to product_scale, it won't lose precision, so we can use i128. If product_scale is larger than required_scale, the multiplication will lose precision, we need to multiply in i256 and divide the result to get the result with less precision.

Could you please explain a bit, why if the product_scale is larger than required_scale, we need to multiply in i256 ?
Can we multiply in i128 and divide the gap back ?

viirya · 2023-05-10T01:52:30Z

Could you please explain a bit, why if the product_scale is larger than required_scale, we need to multiply in i256 ?
Can we multiply in i128 and divide the gap back ?

If two decimals with scale 38, the product scale will be 76. How do you multiply two i128 integers with more than 38 digits and get full precision result with more than 76 digits?

viirya · 2023-05-10T02:16:13Z

I mean the literal 1 can be evaluated as Int8(1) at the very beginning.

Yea, for this case, you may treat the literal with Int8 to avoid long precision, but the root cause is not resolved.

mingmwang · 2023-05-10T03:50:41Z

Could you please explain a bit, why if the product_scale is larger than required_scale, we need to multiply in i256 ?
Can we multiply in i128 and divide the gap back ?

If two decimals with scale 38, the product scale will be 76. How do you multiply two i128 integers with more than 38 digits and get full precision result with more than 76 digits?

I understand now, thanks for the explanation.

mingmwang added the bug Something isn't working label May 8, 2023

This was referenced May 9, 2023

Datafusion not using all cores on a TPCH like query during query with repartiton #6290

Closed

Improve parallelism of repartition operator with multiple cores #6310

Merged

mingmwang mentioned this issue May 11, 2023

refine decimal multiply, avoid cast to wider type #6331

Merged

andygrove closed this as completed in #6331 May 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant performance downgrade to tpch-q1 #6278

Significant performance downgrade to tpch-q1 #6278

mingmwang commented May 8, 2023 •

edited

Loading

mingmwang commented May 8, 2023 •

edited

Loading

mingmwang commented May 8, 2023

alamb commented May 8, 2023

alamb commented May 8, 2023

alamb commented May 8, 2023

alamb commented May 8, 2023

alamb commented May 8, 2023

alamb commented May 8, 2023

alamb commented May 8, 2023

viirya commented May 9, 2023

mingmwang commented May 9, 2023

mingmwang commented May 9, 2023 •

edited

Loading

mingmwang commented May 9, 2023

mingmwang commented May 9, 2023 •

edited

Loading

alamb commented May 9, 2023

viirya commented May 9, 2023 •

edited

Loading

viirya commented May 9, 2023 •

edited

Loading

mingmwang commented May 10, 2023

mingmwang commented May 10, 2023

viirya commented May 10, 2023

viirya commented May 10, 2023

mingmwang commented May 10, 2023

Significant performance downgrade to tpch-q1 #6278

Significant performance downgrade to tpch-q1 #6278

Comments

mingmwang commented May 8, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

mingmwang commented May 8, 2023 • edited Loading

mingmwang commented May 8, 2023

alamb commented May 8, 2023

alamb commented May 8, 2023

alamb commented May 8, 2023

alamb commented May 8, 2023

alamb commented May 8, 2023

Reproducer

Explain Plans

alamb commented May 8, 2023

alamb commented May 8, 2023

viirya commented May 9, 2023

mingmwang commented May 9, 2023

mingmwang commented May 9, 2023 • edited Loading

mingmwang commented May 9, 2023

mingmwang commented May 9, 2023 • edited Loading

alamb commented May 9, 2023

viirya commented May 9, 2023 • edited Loading

viirya commented May 9, 2023 • edited Loading

mingmwang commented May 10, 2023

mingmwang commented May 10, 2023

viirya commented May 10, 2023

viirya commented May 10, 2023

mingmwang commented May 10, 2023

mingmwang commented May 8, 2023 •

edited

Loading

mingmwang commented May 8, 2023 •

edited

Loading

mingmwang commented May 9, 2023 •

edited

Loading

mingmwang commented May 9, 2023 •

edited

Loading

viirya commented May 9, 2023 •

edited

Loading

viirya commented May 9, 2023 •

edited

Loading