Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant performance downgrade to tpch-q1 #6278

Closed
mingmwang opened this issue May 8, 2023 · 22 comments · Fixed by #6331
Closed

Significant performance downgrade to tpch-q1 #6278

mingmwang opened this issue May 8, 2023 · 22 comments · Fixed by #6331
Labels
bug Something isn't working

Comments

@mingmwang
Copy link
Contributor

mingmwang commented May 8, 2023

Describe the bug

Main branch:

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: Some(1), debug: false, iterations: 3, partitions: 1, batch_size: 8192, path: "./parquet_data", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: true }
Query 1 iteration 0 took 1716.3 ms and returned 4 rows
Query 1 iteration 1 took 1697.0 ms and returned 4 rows
Query 1 iteration 2 took 1694.3 ms and returned 4 rows
Query 1 avg time: 1702.52 ms

Branch 23 (Tag 23.0.0):

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: Some(1), debug: false, iterations: 3, partitions: 1, batch_size: 8192, path: "./parquet_data", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: true, enable_scheduler: false }
Query 1 iteration 0 took 864.2 ms and returned 4 rows
Query 1 iteration 1 took 842.0 ms and returned 4 rows
Query 1 iteration 2 took 838.7 ms and returned 4 rows
Query 1 avg time: 848.29 ms

To Reproduce

No response

Expected behavior

No response

Additional context

No response

@mingmwang mingmwang added the bug Something isn't working label May 8, 2023
@mingmwang
Copy link
Contributor Author

mingmwang commented May 8, 2023

@viirya

Should be caused by this change:

#6103

@mingmwang
Copy link
Contributor Author

@alamb

@alamb
Copy link
Contributor

alamb commented May 8, 2023

Will investingate asap

@alamb
Copy link
Contributor

alamb commented May 8, 2023

I agree the tpch queries got much worse

here are the results comparing 23.0.0 to main at 2e9beeb

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  tpch_mem ┃  tpch_mem ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  855.34ms │ 2974.32ms │  3.48x slower │
│ QQuery 2     │  236.64ms │  227.33ms │     no change │
│ QQuery 3     │  146.80ms │  156.11ms │  1.06x slower │
│ QQuery 4     │   96.49ms │   92.63ms │     no change │
│ QQuery 5     │  376.97ms │  389.73ms │     no change │
│ QQuery 6     │   35.81ms │   35.52ms │     no change │
│ QQuery 7     │  870.67ms │  863.31ms │     no change │
│ QQuery 8     │  225.08ms │  212.54ms │ +1.06x faster │
│ QQuery 9     │  492.54ms │  499.20ms │     no change │
│ QQuery 10    │  274.24ms │  249.58ms │ +1.10x faster │
│ QQuery 11    │  243.34ms │  250.72ms │     no change │
│ QQuery 12    │  145.80ms │  154.45ms │  1.06x slower │
│ QQuery 13    │  611.00ms │  650.75ms │  1.07x slower │
│ QQuery 14    │   43.63ms │   40.05ms │ +1.09x faster │
│ QQuery 15    │   83.76ms │   81.30ms │     no change │
│ QQuery 16    │  174.79ms │  177.44ms │     no change │
│ QQuery 17    │ 2500.32ms │ 2412.97ms │     no change │
│ QQuery 18    │ 2643.99ms │ 2450.26ms │ +1.08x faster │
│ QQuery 19    │  142.35ms │  155.07ms │  1.09x slower │
│ QQuery 20    │  844.41ms │  761.31ms │ +1.11x faster │
│ QQuery 21    │ 1275.64ms │ 1229.27ms │     no change │
│ QQuery 22    │  134.58ms │  127.93ms │     no change │
└──────────────┴───────────┴───────────┴───────────────┘

I'll try and run some profiling and figure out what is going on

@alamb
Copy link
Contributor

alamb commented May 8, 2023

A local profile with Instruments shows that 77% of the time is spent doing decimal multiplication:

Screenshot 2023-05-08 at 4 59 33 PM

@alamb
Copy link
Contributor

alamb commented May 8, 2023

I believe the slowdown affects the 24.0.0 RC1 as well

Here is the comparison between 23.0.0 and 24.0.0-rc1 tag:

Comparing heads_23.0.0 and HEAD
--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  tpch_mem ┃  tpch_mem ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  855.34ms │ 3413.56ms │  3.99x slower │
│ QQuery 2     │  236.64ms │  283.88ms │  1.20x slower │
│ QQuery 3     │  146.80ms │  152.87ms │     no change │
│ QQuery 4     │   96.49ms │   91.61ms │ +1.05x faster │
│ QQuery 5     │  376.97ms │  446.78ms │  1.19x slower │
│ QQuery 6     │   35.81ms │   34.47ms │     no change │
│ QQuery 7     │  870.67ms │ 1084.90ms │  1.25x slower │
│ QQuery 8     │  225.08ms │  238.15ms │  1.06x slower │
│ QQuery 9     │  492.54ms │  553.52ms │  1.12x slower │
│ QQuery 10    │  274.24ms │  315.50ms │  1.15x slower │
│ QQuery 11    │  243.34ms │  294.34ms │  1.21x slower │
│ QQuery 12    │  145.80ms │  152.11ms │     no change │
│ QQuery 13    │  611.00ms │  657.87ms │  1.08x slower │
│ QQuery 14    │   43.63ms │   46.52ms │  1.07x slower │
│ QQuery 15    │   83.76ms │   97.48ms │  1.16x slower │
│ QQuery 16    │  174.79ms │  248.43ms │  1.42x slower │
│ QQuery 17    │ 2500.32ms │ 2912.47ms │  1.16x slower │
│ QQuery 18    │ 2643.99ms │ 2884.11ms │  1.09x slower │
│ QQuery 19    │  142.35ms │  159.65ms │  1.12x slower │
│ QQuery 20    │  844.41ms │  864.69ms │     no change │
│ QQuery 21    │ 1275.64ms │ 1311.02ms │     no change │
│ QQuery 22    │  134.58ms │  135.93ms │     no change │
└──────────────┴───────────┴───────────┴───────────────┘

@alamb
Copy link
Contributor

alamb commented May 8, 2023

@viirya is this something you can look into?

Reproducer

You can reproduce the results using datafusion-cli:

Make data:

cd benchmarks
./bench.sh data all

Install datafusion-cli locally (or run it however else you want):

cargo install --path datafusion-cli

Run the query:

-- load lineitem table into memory
create table lineitem as select * from 'data/lineitem';


❯ select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
        l_shipdate <= date '1998-09-02'
group by
    l_returnflag,
    l_linestatus
order by
    l_returnflag,
    l_linestatus;
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| l_returnflag | l_linestatus | sum_qty     | sum_base_price  | sum_disc_price    | sum_charge          | avg_qty   | avg_price    | avg_disc | count_order |
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| A            | F            | 37734107.00 | 56586554400.73  | 53758257134.8700  | 55909065222.827692  | 25.522005 | 38273.129734 | 0.049985 | 1478493     |
| N            | F            | 991417.00   | 1487504710.38   | 1413082168.0541   | 1469649223.194375   | 25.516471 | 38284.467760 | 0.050093 | 38854       |
| N            | O            | 74476040.00 | 111701729697.74 | 106118230307.6056 | 110367043872.497010 | 25.502226 | 38249.117988 | 0.049996 | 2920374     |
| R            | F            | 37719753.00 | 56568041380.90  | 53741292684.6040  | 55889619119.831932  | 25.505793 | 38250.854626 | 0.050009 | 1478870     |
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
4 rows in set. Query took 6.966 seconds.

Explain Plans

The explain plans are identical between 23.0.0 and main:


❯ explain select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
        l_shipdate <= date '1998-09-02'
group by
    l_returnflag,
    l_linestatus
order by
    l_returnflag,
    l_linestatus;
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: lineitem.l_returnflag ASC NULLS LAST, lineitem.l_linestatus ASC NULLS LAST                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|               |   Projection: lineitem.l_returnflag, lineitem.l_linestatus, SUM(lineitem.l_quantity) AS sum_qty, SUM(lineitem.l_extendedprice) AS sum_base_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount) AS sum_disc_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax) AS sum_charge, AVG(lineitem.l_quantity) AS avg_qty, AVG(lineitem.l_extendedprice) AS avg_price, AVG(lineitem.l_discount) AS avg_disc, COUNT(UInt8(1)) AS count_order                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|               |     Aggregate: groupBy=[[lineitem.l_returnflag, lineitem.l_linestatus]], aggr=[[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * (Decimal128(Some(1),20,0) - lineitem.l_discount)Decimal128(Some(1),20,0) - lineitem.l_discountlineitem.l_discountDecimal128(Some(1),20,0)lineitem.l_extendedprice AS lineitem.l_extendedprice * Decimal128(Some(1),20,0) - lineitem.l_discount) AS SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * (Decimal128(Some(1),20,0) - lineitem.l_discount)Decimal128(Some(1),20,0) - lineitem.l_discountlineitem.l_discountDecimal128(Some(1),20,0)lineitem.l_extendedprice AS lineitem.l_extendedprice * Decimal128(Some(1),20,0) - lineitem.l_discount * (Decimal128(Some(1),20,0) + lineitem.l_tax)) AS SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))]] |
|               |       Projection: lineitem.l_extendedprice * (Decimal128(Some(1),20,0) - lineitem.l_discount) AS lineitem.l_extendedprice * (Decimal128(Some(1),20,0) - lineitem.l_discount)Decimal128(Some(1),20,0) - lineitem.l_discountlineitem.l_discountDecimal128(Some(1),20,0)lineitem.l_extendedprice, lineitem.l_quantity, lineitem.l_extendedprice, lineitem.l_discount, lineitem.l_tax, lineitem.l_returnflag, lineitem.l_linestatus                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|               |         Filter: lineitem.l_shipdate <= Date32("10471")                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |           TableScan: lineitem projection=[l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| physical_plan | SortPreservingMergeExec: [l_returnflag@0 ASC NULLS LAST,l_linestatus@1 ASC NULLS LAST]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |   SortExec: expr=[l_returnflag@0 ASC NULLS LAST,l_linestatus@1 ASC NULLS LAST]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|               |     ProjectionExec: expr=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus, SUM(lineitem.l_quantity)@2 as sum_qty, SUM(lineitem.l_extendedprice)@3 as sum_base_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)@4 as sum_disc_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax)@5 as sum_charge, AVG(lineitem.l_quantity)@6 as avg_qty, AVG(lineitem.l_extendedprice)@7 as avg_price, AVG(lineitem.l_discount)@8 as avg_disc, COUNT(UInt8(1))@9 as count_order]                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|               |       AggregateExec: mode=FinalPartitioned, gby=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|               |         CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|               |           RepartitionExec: partitioning=Hash([Column { name: "l_returnflag", index: 0 }, Column { name: "l_linestatus", index: 1 }], 16), input_partitions=16                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|               |             AggregateExec: mode=Partial, gby=[l_returnflag@5 as l_returnflag, l_linestatus@6 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|               |               ProjectionExec: expr=[l_extendedprice@1 * (Some(1),20,0 - l_discount@2) as lineitem.l_extendedprice * (Decimal128(Some(1),20,0) - lineitem.l_discount)Decimal128(Some(1),20,0) - lineitem.l_discountlineitem.l_discountDecimal128(Some(1),20,0)lineitem.l_extendedprice, l_quantity@0 as l_quantity, l_extendedprice@1 as l_extendedprice, l_discount@2 as l_discount, l_tax@3 as l_tax, l_returnflag@4 as l_returnflag, l_linestatus@5 as l_linestatus]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |                 CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|               |                   FilterExec: l_shipdate@6 <= 10471                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|               |                     RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|               |                       MemoryExec: partitions=1, partition_sizes=[733]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.011 seconds.

explain_22.0.0.txt
explain_main.txt

@alamb
Copy link
Contributor

alamb commented May 8, 2023

Interestingly, the explain analyze output shows all the time being spent in the AggregateExec elapsed_compute=11.714486088s -- maybe the aggregate exec is re-calculating its input exprs incorrectly.

❯ explain analyze select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
        l_shipdate <= date '1998-09-02'
group by
    l_returnflag,
    l_linestatus
order by
    l_returnflag,
    l_linestatus;
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | CoalescePartitionsExec, metrics=[output_rows=4, elapsed_compute=21.427µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|                   |   ProjectionExec: expr=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus, SUM(lineitem.l_quantity)@2 as sum_qty, SUM(lineitem.l_extendedprice)@3 as sum_base_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)@4 as sum_disc_price, SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax)@5 as sum_charge, AVG(lineitem.l_quantity)@6 as avg_qty, AVG(lineitem.l_extendedprice)@7 as avg_price, AVG(lineitem.l_discount)@8 as avg_disc, COUNT(UInt8(1))@9 as count_order], metrics=[output_rows=4, elapsed_compute=114.909µs, spill_count=0, spilled_bytes=0, mem_used=0] |
|                   |     AggregateExec: mode=FinalPartitioned, gby=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=4, elapsed_compute=3.558233ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                  |
|                   |       CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=64, elapsed_compute=1.368734ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|                   |         RepartitionExec: partitioning=Hash([Column { name: "l_returnflag", index: 0 }, Column { name: "l_linestatus", index: 1 }], 16), input_partitions=16, metrics=[fetch_time=199.828628049s, repart_time=1.858216ms, send_time=227.016µs]                                                                                                                                                                                                                                                                                                                                                                                                    |
|                   |           AggregateExec: mode=Partial, gby=[l_returnflag@5 as l_returnflag, l_linestatus@6 as l_linestatus], aggr=[SUM(lineitem.l_quantity), SUM(lineitem.l_extendedprice), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), AVG(lineitem.l_quantity), AVG(lineitem.l_extendedprice), AVG(lineitem.l_discount), COUNT(UInt8(1))], metrics=[output_rows=64, elapsed_compute=11.714486088s, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                 |
|                   |             ProjectionExec: expr=[l_extendedprice@1 * (Some(1),20,0 - l_discount@2) as lineitem.l_extendedprice * (Decimal128(Some(1),20,0) - lineitem.l_discount)Decimal128(Some(1),20,0) - lineitem.l_discountlineitem.l_discountDecimal128(Some(1),20,0)lineitem.l_extendedprice, l_quantity@0 as l_quantity, l_extendedprice@1 as l_extendedprice, l_discount@2 as l_discount, l_tax@3 as l_tax, l_returnflag@4 as l_returnflag, l_linestatus@5 as l_linestatus], metrics=[output_rows=5916591, elapsed_compute=377.249296ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                    |
|                   |               CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=5916591, elapsed_compute=222.551791ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                   |                 FilterExec: l_shipdate@6 <= 10471, metrics=[output_rows=5916591, elapsed_compute=202.848245ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|                   |                   RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1, metrics=[fetch_time=6.940083ms, repart_time=1ns, send_time=1.861309ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                   |                     MemoryExec: partitions=1, partition_sizes=[733], metrics=[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set. Query took 12.538 seconds.

@alamb
Copy link
Contributor

alamb commented May 8, 2023

My next plan (for tomorrow) is to isolate which change actually caused the regression

@viirya
Copy link
Member

viirya commented May 9, 2023

From the Instruments profiling, looks like divide_and_round took too much time where most time was spent on i256 operations. I will take a look into it.

@mingmwang
Copy link
Contributor Author

I will also take a look today.

@mingmwang
Copy link
Contributor Author

mingmwang commented May 9, 2023

@viirya @alamb
I think the major problem is that the i256 math operation is not efficient.

And the logic in the method multiply_fixed_point() in arrow-rs is not consistent.

If required_scale == product_scale, it still uses i128 for multiply(can still overflow)
if product_scale > required_scale, use i256 for multiply and div, why ??

pub fn multiply_fixed_point(
    left: &PrimitiveArray<Decimal128Type>,
    right: &PrimitiveArray<Decimal128Type>,
    required_scale: i8,
) -> Result<PrimitiveArray<Decimal128Type>, ArrowError> {
    let product_scale = left.scale() + right.scale();
    let precision = min(
        left.precision() + right.precision() + 1,
        DECIMAL128_MAX_PRECISION,
    );

    if required_scale == product_scale {
        return multiply(left, right)?
            .with_precision_and_scale(precision, required_scale);
    }

    if required_scale > product_scale {
        return Err(ArrowError::ComputeError(format!(
            "Required scale {} is greater than product scale {}",
            required_scale, product_scale
        )));
    }

    let divisor =
        i256::from_i128(10).pow_wrapping((product_scale - required_scale) as u32);

    binary::<_, _, _, Decimal128Type>(left, right, |a, b| {
        let a = i256::from_i128(a);
        let b = i256::from_i128(b);

        let mut mul = a.wrapping_mul(b);
        mul = divide_and_round::<Decimal256Type>(mul, divisor);
        mul.as_i128()
    })
    .and_then(|a| a.with_precision_and_scale(precision, required_scale))
}

@mingmwang
Copy link
Contributor Author

And there is another issue related to the default scale and precision of literal value, DataFusion over estimates the default
precision for a simple literal value, which will cause the final result type's precision is over estimated, and has to round the scale, and then lead to different required_scale and product_scale.

I will try to fix this issue.

@mingmwang
Copy link
Contributor Author

mingmwang commented May 9, 2023

l_extendedprice * (1 - l_discount) * (1 + l_tax)

1 was treated as Int64 by the planner, then TypeCoercion rule cast Int64(1) to decimal(20, 0)

@alamb
Copy link
Contributor

alamb commented May 9, 2023

I believe the slowdown began in #6103 191af8d, as suspected by @mingmwang

Specifically:

  • 87a67d4 the query takes 1.85s for me
  • 191af8d (the next commit) the query takes 6.23s

Here is the data: (I built datafusion-cli at different revisions)

+ /home/alamb/2023-05-09-tpch-perf-drop/datafusion-cli-87a67d4b190d5849969ead52fe75f4d6fccbbdec -f q1.sql
DataFusion CLI v23.0.0
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| l_returnflag | l_linestatus | sum_qty     | sum_base_price  | sum_disc_price    | sum_charge          | avg_qty   | avg_price    | avg_disc | count_order |
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| A            | F            | 37734107.00 | 56586554400.73  | 53758257134.8700  | 55909065222.827692  | 25.522005 | 38273.129734 | 0.049985 | 1478493     |
| N            | F            | 991417.00   | 1487504710.38   | 1413082168.0541   | 1469649223.194375   | 25.516471 | 38284.467760 | 0.050093 | 38854       |
| N            | O            | 74476040.00 | 111701729697.74 | 106118230307.6056 | 110367043872.497010 | 25.502226 | 38249.117988 | 0.049996 | 2920374     |
| R            | F            | 37719753.00 | 56568041380.90  | 53741292684.6040  | 55889619119.831932  | 25.505793 | 38250.854626 | 0.050009 | 1478870     |
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
4 rows in set. Query took 1.858 seconds.

SHA: 191af8dcd7c4ee756b7d98e06bd7b40dc23202b7
+ /home/alamb/2023-05-09-tpch-perf-drop/datafusion-cli-191af8dcd7c4ee756b7d98e06bd7b40dc23202b7 -f q1.sql
DataFusion CLI v23.0.0
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| l_returnflag | l_linestatus | sum_qty     | sum_base_price  | sum_disc_price    | sum_charge          | avg_qty   | avg_price    | avg_disc | count_order |
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
| A            | F            | 37734107.00 | 56586554400.73  | 53758257134.8700  | 55909065222.827692  | 25.522005 | 38273.129734 | 0.049985 | 1478493     |
| N            | F            | 991417.00   | 1487504710.38   | 1413082168.0541   | 1469649223.194375   | 25.516471 | 38284.467760 | 0.050093 | 38854       |
| N            | O            | 74476040.00 | 111701729697.74 | 106118230307.6056 | 110367043872.497010 | 25.502226 | 38249.117988 | 0.049996 | 2920374     |
| R            | F            | 37719753.00 | 56568041380.90  | 53741292684.6040  | 55889619119.831932  | 25.505793 | 38250.854626 | 0.050009 | 1478870     |
+--------------+--------------+-------------+-----------------+-------------------+---------------------+-----------+--------------+----------+-------------+
4 rows in set. Query took 6.233 seconds.

@viirya
Copy link
Member

viirya commented May 9, 2023

And there is another issue related to the default scale and precision of literal value, DataFusion over estimates the default precision for a simple literal value, which will cause the final result type's precision is over estimated, and has to round the scale, and then lead to different required_scale and product_scale.
1 was treated as Int64 by the planner, then TypeCoercion rule cast Int64(1) to decimal(20, 0)

It was not over estimated the precision for literal value. When we coerce numeric type to decimal, the precision is determined by the numeric type. For Int64, the precision is 20.

@viirya
Copy link
Member

viirya commented May 9, 2023

I think the major problem is that the i256 math operation is not efficient.

As I mentioned above (#6278 (comment)), divide_and_round spent too much time on i256 operations, especially wrapping_div and wrapping_rem. We currently round back to BigInt for these math operations which are slow.

Implementing native wrapping_div and wrapping_rem will help speeding up these operations.

I will look into it and try to implement them.

And the logic in the method multiply_fixed_point() in arrow-rs is not consistent.

If required_scale == product_scale, it still uses i128 for multiply(can still overflow) if product_scale > required_scale, use i256 for multiply and div, why ??

It is not inconsistent. It is for allowing precision-loss decimal multiplication. Our kernels basically allow overflow.

If required_scale equals to product_scale, it won't lose precision, so we can use i128. If product_scale is larger than required_scale, the multiplication will lose precision, we need to multiply in i256 and divide the result to get the result with less precision.

@mingmwang
Copy link
Contributor Author

And there is another issue related to the default scale and precision of literal value, DataFusion over estimates the default precision for a simple literal value, which will cause the final result type's precision is over estimated, and has to round the scale, and then lead to different required_scale and product_scale.
1 was treated as Int64 by the planner, then TypeCoercion rule cast Int64(1) to decimal(20, 0)

It was not over estimated the precision for literal value. When we coerce numeric type to decimal, the precision is determined by the numeric type. For Int64, the precision is 20.

I mean the literal 1 can be evaluated as Int8(1) at the very beginning.

@mingmwang
Copy link
Contributor Author

I think the major problem is that the i256 math operation is not efficient.

As I mentioned above (#6278 (comment)), divide_and_round spent too much time on i256 operations, especially wrapping_div and wrapping_rem. We currently round back to BigInt for these math operations which are slow.

Implementing native wrapping_div and wrapping_rem will help speeding up these operations.

I will look into it and try to implement them.

And the logic in the method multiply_fixed_point() in arrow-rs is not consistent.
If required_scale == product_scale, it still uses i128 for multiply(can still overflow) if product_scale > required_scale, use i256 for multiply and div, why ??

It is not inconsistent. It is for allowing precision-loss decimal multiplication. Our kernels basically allow overflow.

If required_scale equals to product_scale, it won't lose precision, so we can use i128. If product_scale is larger than required_scale, the multiplication will lose precision, we need to multiply in i256 and divide the result to get the result with less precision.

Could you please explain a bit, why if the product_scale is larger than required_scale, we need to multiply in i256 ?
Can we multiply in i128 and divide the gap back ?

@viirya
Copy link
Member

viirya commented May 10, 2023

Could you please explain a bit, why if the product_scale is larger than required_scale, we need to multiply in i256 ?
Can we multiply in i128 and divide the gap back ?

If two decimals with scale 38, the product scale will be 76. How do you multiply two i128 integers with more than 38 digits and get full precision result with more than 76 digits?

@viirya
Copy link
Member

viirya commented May 10, 2023

I mean the literal 1 can be evaluated as Int8(1) at the very beginning.

Yea, for this case, you may treat the literal with Int8 to avoid long precision, but the root cause is not resolved.

@mingmwang
Copy link
Contributor Author

Could you please explain a bit, why if the product_scale is larger than required_scale, we need to multiply in i256 ?
Can we multiply in i128 and divide the gap back ?

If two decimals with scale 38, the product scale will be 76. How do you multiply two i128 integers with more than 38 digits and get full precision result with more than 76 digits?

I understand now, thanks for the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants