Fix sort in the query to keep linear index per asset in the storage constraint #1010

datejada · 2025-02-03T12:10:33Z

Related issues

Checklist

I am following the contributing guidelines
Tests are passing
Lint workflow is passing
Docs were updated and workflow is passing

github-actions · 2025-02-03T12:10:54Z

Benchmark Results

	`19dd03f`...	`df233a7`...	`19dd03f`.../df233a771f3e63...
energy_problem/create_model	28.5 ± 1.8 s	28 ± 1.7 s	1.02
energy_problem/input_and_constructor	37.7 ± 0.52 s	37.5 ± 0.73 s	1
time_to_load	3.98 ± 0.021 s	4.03 ± 0.19 s	0.987

	`19dd03f`...	`df233a7`...	`19dd03f`.../df233a771f3e63...
energy_problem/create_model	0.257 G allocs: 12.9 GB	0.257 G allocs: 12.9 GB	1
energy_problem/input_and_constructor	0.0436 G allocs: 1.67 GB	0.0436 G allocs: 1.67 GB	1
time_to_load	0.159 k allocs: 11.2 kB	0.159 k allocs: 11.2 kB	1

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

codecov · 2025-02-03T12:20:50Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.22%. Comparing base (8ee7150) to head (df233a7).
Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1010   +/-   ##
=======================================
  Coverage   95.22%   95.22%           
=======================================
  Files          29       29           
  Lines        1151     1151           
=======================================
  Hits         1096     1096           
  Misses         55       55

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

datejada · 2025-02-03T14:08:05Z

@abelsiqueira and @suvayu This error/bug was difficult to reproduce since it only occurred for a large number of constraints. Empirically, with more than 10000 rows, the linear index was not guaranteed (I think it is a SQL thing) for the storage constraint where we need s and s-1. This was causing infeasibilities, since it was taking the previous storage variable from another asset.

This code ensures the order stays as we want in the constraint, and the benchmark seems to show that there is no much impact in query.

Anyway, any suggestion on the SQL side, to make it more efficient, please don't hesitate to add it.

Thanks!

Diego

datejada · 2025-02-04T08:54:45Z

@abelsiqueira, here are the files to reproduce the infeasibility in version 0.11.0

#1007 (comment)

The code in this PR solves it :)

datejada · 2025-02-04T09:25:04Z

@abelsiqueira I tried ordering at the end of the SQL query (using the files here #1007 (comment)), but it is still infeasible since we get the previous storage level using row.index - 1 and the index is assigned during the LEFT JOIN, which does not respect the order of t_low or attr tables. So, even if we order the table after the JOIN, the index is retrieving the wrong storage variable:

QUERIES

    DuckDB.query(
        connection,
        "CREATE OR REPLACE TEMP SEQUENCE id START 1;
        CREATE OR REPLACE TABLE var_storage_level_rep_period AS
        SELECT
            nextval('id') as index,
            t_low.asset,
            t_low.year,
            t_low.rep_period,
            t_low.time_block_start,
            t_low.time_block_end
        FROM t_lowest_all AS t_low
        LEFT JOIN asset
            ON t_low.asset = asset.asset
        WHERE
            asset.type = 'storage'
            AND asset.is_seasonal = false
        ORDER BY
            t_low.asset,
            t_low.year,
            t_low.rep_period,
            t_low.time_block_start;
        ",
    )

    DuckDB.query(
        connection,
        "CREATE OR REPLACE TEMP SEQUENCE id START 1;
        CREATE OR REPLACE TABLE var_storage_level_over_clustered_year AS
        SELECT
            nextval('id') as index,
            attr.asset,
            attr.year,
            attr.period_block_start,
            attr.period_block_end,
        FROM asset_timeframe_time_resolution AS attr
        LEFT JOIN asset
            ON attr.asset = asset.asset
        WHERE
            asset.type = 'storage'
        ORDER BY
            attr.asset,
            attr.year,
            attr.period_block_start;
        ",
    )

INFEASIBILITY

balancing for FR_Pump_Hydro_Open and getting the storage levels of the BG_Pump_Hydro_Open

 balance_storage_over_clustered_year[FR_Pump_Hydro_Open,2050,224:224] : 27.428571428571427 flow[(FR_Pump_Hydro_Open, FR_E_Balance), 2050, 2, 1:24] - 21 flow[(FR_E_Balance, FR_Pump_Hydro_Open), 2050, 2, 1:24] - storage_level_over_clustered_year[BG_Pump_Hydro_Open,2050,223:223] + storage_level_over_clustered_year[BG_Pump_Hydro_Open,2050,224:224] == 0

If you run the files with the code in v0.11.0, then you get the infeasibility that is in the description of issue #1007

So, my PR solves the issue, but please double-check with the files if this is the best way or if there is a better way to achieve the same result as what we get with these changes.

Thanks!

suvayu

Something to keep in mind with SQL when you are doing "cross row" operations, SQL doesn't guarantee any order. Think of the results as sets. So if a certain order is necessary, you should always explicitly ask for it.

src/variables/create.jl

suvayu · 2025-02-04T14:20:15Z

I tried ordering at the end of the SQL query (using the files here #1007 (comment)), but it is still infeasible since we get the previous storage level using row.index - 1 and the index is assigned during the LEFT JOIN, which does not respect the order of t_low or attr tables. So, even if we order the table after the JOIN, the index is retrieving the wrong storage variable

Damn, didn't see this comment before adding my review. Let me think, I don't quite understand it.

abelsiqueira

Hi @datejada, thanks for catching this. I've checked and indeed the index is created before the ordering so you're right in ordering separately. The way to do it is also correct, so I'm approving, but below I talk about alternatives.

I didn't know about the WITH clause, so I looked around and I found 3 ways to do what we need. I share them here so we get on the same page.

The first way is what you did, which is called a Common Table Expression (https://duckdb.org/docs/sql/query_syntax/with.html), created with WITH. It is essentially a temporary table with bounded scope
Option 2 is to create an actual TEMP TABLE to be reused in other places. I.e., you have the first "query" to create the ordered table, then a second "query" to create the final table.
Option 3 is a subquery (https://duckdb.org/docs/sql/expressions/subqueries), which is something like SELECT ... FROM (subquery). The subquery here would be the first SELECT.

Since we only use the intermediary table once, I think there are no performance implications, and according to the internet, only benchmarking these would reveal possible differences - and I think the time is dominated in worse parts.

I've used the second method a few times where probably the CTE would be a better idea, so I'll try to keep it in mind to clean things up later.

For this PR, the 3rd option could also fit, and it would read something like

CREATE OR REPLACE TABLE var_... AS
SELECT
    nextval('id') AS index,
    asset,
    year,
    period_block_start,
    period_block_end,
FROM ( -- Start of the subquery
    SELECT ...
    ...
    ORDER BY ...
)

i.e., the order of the SELECT is the opposite of the CTEs.

I don't have a preference for either, because in this case it's essentially the whole query that is a sub/CTE/temp, and we just add the index afterward - which I haven't found a nicer way to do.
Thanks again.

abelsiqueira · 2025-02-04T14:38:23Z

@suvayu, the issue is that we want the indices to be in the same order, and apparently the nextval('id') happens before ORDER BY, so just ordering is not enough - it was also my first instinct, by the way.

suvayu · 2025-02-04T14:39:55Z

balancing for FR_Pump_Hydro_Open and getting the storage levels of the BG_Pump_Hydro_Open
 balance_storage_over_clustered_year[FR_Pump_Hydro_Open,2050,224:224] : 27.428571428571427 flow[(FR_Pump_Hydro_Open, FR_E_Balance), 2050, 2, 1:24] - 21 flow[(FR_E_Balance, FR_Pump_Hydro_Open), 2050, 2, 1:24] - storage_level_over_clustered_year[BG_Pump_Hydro_Open,2050,223:223] + storage_level_over_clustered_year[BG_Pump_Hydro_Open,2050,224:224] == 0
If you run the files with the code in v0.11.0, then you get the infeasibility that is in the description of issue #1007

Based on this comment, my hunch is the query that uses row.index - 1 has the bug. Basically, you have a table with multiple assets, and you are doing a cross-row calculation, for a given asset. So you would need an equivalent of WHERE asset = 'bla' to avoid exactly this kind of crossover. I wouldn't actually expect row.index - 1 to work at all, since it is undefined for WHERE index = 1. I would expect a CASE statement to handle this case.

Does that make sense, or did I misunderstand the intention of that cross-row operation?

abelsiqueira · 2025-02-04T14:41:45Z

PS. I'm checking whether the ordering is the desired one with visual inspection of the following query:

SELECT * FROM the_relevant_table
WHERE period_block_start=1 OR period_block_end=365
ODER BY index

The rows should be something like

94×5 DataFrame
 Row │ index   asset                 year    period_block_start  period_block_end
     │ Int64?  String?               Int32?  Int32?              Int32?
─────┼────────────────────────────────────────────────────────────────────────────
   1 │      1  AT_Hydro_Reservoir      2050                   1                 1
   2 │    365  AT_Hydro_Reservoir      2050                 365               365
   3 │    366  AT_Pump_Hydro_Closed    2050                   1                 1
   4 │    730  AT_Pump_Hydro_Closed    2050                 365               365

suvayu · 2025-02-04T14:43:21Z

apparently the nextval('id') happens before ORDER BY

Somehow this makes sense :-p. I would still look at that cross-row query for a bug, because it should not crossover in the first place.

suvayu · 2025-02-04T14:49:12Z

BTW, I think you can simplify the second SELECT inside the WITH to:

SELECT nextval('id') AS index, * FROM filtered_data;

abelsiqueira · 2025-02-04T14:51:37Z

Based on this comment, my hunch is the query that uses row.index - 1 has the bug.

It's completely possible. This is the constraint where this (or a similar) table is used: https://github.com/TulipaEnergy/TulipaEnergyModel.jl/blob/main/src/constraints/storage.jl#L23-L75

Basically, you have a table with multiple assets, and you are doing a cross-row calculation, for a given asset. So you would need an equivalent of WHERE asset = 'bla' to avoid exactly this kind of crossover. I wouldn't actually expect row.index - 1 to work at all, since it is undefined for WHERE index = 1. I would expect a CASE statement to handle this case.

Does that make sense, or did I misunderstand the intention of that cross-row operation?

Initially we were doing that (fixing an asset), but it's all a single vector of constraints now, which is faster to create, and aligned with the idea of attaching the container to the table on a one-to-one correspondence.
I haven't looked into CASE, but at constraint creation time we do checks in the Julia side. And for the index=1 case, we loop around to get the last index (for same asset, year, rp).

datejada · 2025-02-04T15:00:13Z

Hi @abelsiqueira and @suvayu thank you both for the good insights! I am learning a lot.

So, Suvayu, I resolved the comments since the next comment from Abel answered them. But before I merge the PR, let me know if you still think that something can be improved here.

Thanks!

suvayu · 2025-02-04T15:00:22Z

I'll look at the constraint.

About the 3 alternatives, (1) & (3) are equivalent. People tend to prefer (1) when you want to refer to the result of that query in a few different places. It's kind of syntactic sugar. Both options run the query, so costs compute. If you are using it in many places (1) gives you the option to materialize. Then you can actually gain some performance. For a single reference though, materialisation would make it slower. I hope that made sense :-p

suvayu · 2025-02-04T15:01:20Z

if you still think that something can be improved here.

Only that you can use * instead of typing all the column names again.

datejada · 2025-02-04T20:29:44Z

if you still think that something can be improved here.

Only that you can use * instead of typing all the column names again.

@suvayu thanks! I changed it with the * 😄 it is ready for you to review again before merging. Although I use filtered_assets.* to be explicit in the name of the table (to keep the style as in the other SQL queries in the code)

suvayu

LGTM

Fix sort in the query to keep linear index

5c1f9ac

datejada added the benchmark PR only - Run benchmark on PR label Feb 3, 2025

datejada marked this pull request as ready for review February 3, 2025 13:56

datejada requested a review from abelsiqueira February 3, 2025 13:57

datejada changed the title ~~Fix sort in the query to keep linear index~~ Fix sort in the query to keep linear index per asset in the storage constraint Feb 3, 2025

suvayu requested changes Feb 4, 2025

View reviewed changes

src/variables/create.jl Outdated Show resolved Hide resolved

src/variables/create.jl Outdated Show resolved Hide resolved

abelsiqueira approved these changes Feb 4, 2025

View reviewed changes

Apply suggestions from code review

df233a7

datejada requested a review from suvayu February 4, 2025 20:27

suvayu approved these changes Feb 5, 2025

View reviewed changes

datejada merged commit d4f73da into main Feb 5, 2025
7 checks passed

datejada deleted the 1007-bug-storage-constraints branch February 5, 2025 11:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sort in the query to keep linear index per asset in the storage constraint #1010

Fix sort in the query to keep linear index per asset in the storage constraint #1010

datejada commented Feb 3, 2025 •

edited

Loading

github-actions bot commented Feb 3, 2025 •

edited

Loading

codecov bot commented Feb 3, 2025 •

edited

Loading

datejada commented Feb 3, 2025

datejada commented Feb 4, 2025 •

edited

Loading

datejada commented Feb 4, 2025

suvayu left a comment

suvayu commented Feb 4, 2025

abelsiqueira left a comment

abelsiqueira commented Feb 4, 2025

suvayu commented Feb 4, 2025

abelsiqueira commented Feb 4, 2025

suvayu commented Feb 4, 2025

suvayu commented Feb 4, 2025 •

edited

Loading

abelsiqueira commented Feb 4, 2025

datejada commented Feb 4, 2025

suvayu commented Feb 4, 2025

suvayu commented Feb 4, 2025

datejada commented Feb 4, 2025 •

edited

Loading

suvayu left a comment

Fix sort in the query to keep linear index per asset in the storage constraint #1010

Fix sort in the query to keep linear index per asset in the storage constraint #1010

Conversation

datejada commented Feb 3, 2025 • edited Loading

Related issues

Checklist

github-actions bot commented Feb 3, 2025 • edited Loading

Benchmark Results

Benchmark Plots

codecov bot commented Feb 3, 2025 • edited Loading

Codecov Report

datejada commented Feb 3, 2025

datejada commented Feb 4, 2025 • edited Loading

datejada commented Feb 4, 2025

suvayu left a comment

Choose a reason for hiding this comment

suvayu commented Feb 4, 2025

abelsiqueira left a comment

Choose a reason for hiding this comment

abelsiqueira commented Feb 4, 2025

suvayu commented Feb 4, 2025

abelsiqueira commented Feb 4, 2025

suvayu commented Feb 4, 2025

suvayu commented Feb 4, 2025 • edited Loading

abelsiqueira commented Feb 4, 2025

datejada commented Feb 4, 2025

suvayu commented Feb 4, 2025

suvayu commented Feb 4, 2025

datejada commented Feb 4, 2025 • edited Loading

suvayu left a comment

Choose a reason for hiding this comment

datejada commented Feb 3, 2025 •

edited

Loading

github-actions bot commented Feb 3, 2025 •

edited

Loading

codecov bot commented Feb 3, 2025 •

edited

Loading

datejada commented Feb 4, 2025 •

edited

Loading

suvayu commented Feb 4, 2025 •

edited

Loading

datejada commented Feb 4, 2025 •

edited

Loading