Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sort in the query to keep linear index per asset in the storage constraint #1010

Merged
merged 2 commits into from
Feb 5, 2025

Conversation

datejada
Copy link
Member

@datejada datejada commented Feb 3, 2025

Related issues

Closes #1007

Checklist

  • I am following the contributing guidelines

  • Tests are passing

  • Lint workflow is passing

  • Docs were updated and workflow is passing

@datejada datejada added the benchmark PR only - Run benchmark on PR label Feb 3, 2025
Copy link
Contributor

github-actions bot commented Feb 3, 2025

Benchmark Results

19dd03f... df233a7... 19dd03f.../df233a771f3e63...
energy_problem/create_model 28.5 ± 1.8 s 28 ± 1.7 s 1.02
energy_problem/input_and_constructor 37.7 ± 0.52 s 37.5 ± 0.73 s 1
time_to_load 3.98 ± 0.021 s 4.03 ± 0.19 s 0.987
19dd03f... df233a7... 19dd03f.../df233a771f3e63...
energy_problem/create_model 0.257 G allocs: 12.9 GB 0.257 G allocs: 12.9 GB 1
energy_problem/input_and_constructor 0.0436 G allocs: 1.67 GB 0.0436 G allocs: 1.67 GB 1
time_to_load 0.159 k allocs: 11.2 kB 0.159 k allocs: 11.2 kB 1

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

Copy link

codecov bot commented Feb 3, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.22%. Comparing base (8ee7150) to head (df233a7).
Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1010   +/-   ##
=======================================
  Coverage   95.22%   95.22%           
=======================================
  Files          29       29           
  Lines        1151     1151           
=======================================
  Hits         1096     1096           
  Misses         55       55           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@datejada datejada marked this pull request as ready for review February 3, 2025 13:56
@datejada datejada requested a review from abelsiqueira February 3, 2025 13:57
@datejada
Copy link
Member Author

datejada commented Feb 3, 2025

@abelsiqueira and @suvayu This error/bug was difficult to reproduce since it only occurred for a large number of constraints. Empirically, with more than 10000 rows, the linear index was not guaranteed (I think it is a SQL thing) for the storage constraint where we need s and s-1. This was causing infeasibilities, since it was taking the previous storage variable from another asset.

This code ensures the order stays as we want in the constraint, and the benchmark seems to show that there is no much impact in query.

Anyway, any suggestion on the SQL side, to make it more efficient, please don't hesitate to add it.

Thanks!

Diego

@datejada datejada changed the title Fix sort in the query to keep linear index Fix sort in the query to keep linear index per asset in the storage constraint Feb 3, 2025
@datejada
Copy link
Member Author

datejada commented Feb 4, 2025

@abelsiqueira, here are the files to reproduce the infeasibility in version 0.11.0

#1007 (comment)

The code in this PR solves it :)

@datejada
Copy link
Member Author

datejada commented Feb 4, 2025

@abelsiqueira I tried ordering at the end of the SQL query (using the files here #1007 (comment)), but it is still infeasible since we get the previous storage level using row.index - 1 and the index is assigned during the LEFT JOIN, which does not respect the order of t_low or attr tables. So, even if we order the table after the JOIN, the index is retrieving the wrong storage variable:

QUERIES

    DuckDB.query(
        connection,
        "CREATE OR REPLACE TEMP SEQUENCE id START 1;
        CREATE OR REPLACE TABLE var_storage_level_rep_period AS
        SELECT
            nextval('id') as index,
            t_low.asset,
            t_low.year,
            t_low.rep_period,
            t_low.time_block_start,
            t_low.time_block_end
        FROM t_lowest_all AS t_low
        LEFT JOIN asset
            ON t_low.asset = asset.asset
        WHERE
            asset.type = 'storage'
            AND asset.is_seasonal = false
        ORDER BY
            t_low.asset,
            t_low.year,
            t_low.rep_period,
            t_low.time_block_start;
        ",
    )

    DuckDB.query(
        connection,
        "CREATE OR REPLACE TEMP SEQUENCE id START 1;
        CREATE OR REPLACE TABLE var_storage_level_over_clustered_year AS
        SELECT
            nextval('id') as index,
            attr.asset,
            attr.year,
            attr.period_block_start,
            attr.period_block_end,
        FROM asset_timeframe_time_resolution AS attr
        LEFT JOIN asset
            ON attr.asset = asset.asset
        WHERE
            asset.type = 'storage'
        ORDER BY
            attr.asset,
            attr.year,
            attr.period_block_start;
        ",
    )

INFEASIBILITY

balancing for FR_Pump_Hydro_Open and getting the storage levels of the BG_Pump_Hydro_Open

 balance_storage_over_clustered_year[FR_Pump_Hydro_Open,2050,224:224] : 27.428571428571427 flow[(FR_Pump_Hydro_Open, FR_E_Balance), 2050, 2, 1:24] - 21 flow[(FR_E_Balance, FR_Pump_Hydro_Open), 2050, 2, 1:24] - storage_level_over_clustered_year[BG_Pump_Hydro_Open,2050,223:223] + storage_level_over_clustered_year[BG_Pump_Hydro_Open,2050,224:224] == 0

If you run the files with the code in v0.11.0, then you get the infeasibility that is in the description of issue #1007

So, my PR solves the issue, but please double-check with the files if this is the best way or if there is a better way to achieve the same result as what we get with these changes.

Thanks!

Copy link
Member

@suvayu suvayu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to keep in mind with SQL when you are doing "cross row" operations, SQL doesn't guarantee any order. Think of the results as sets. So if a certain order is necessary, you should always explicitly ask for it.

src/variables/create.jl Outdated Show resolved Hide resolved
src/variables/create.jl Outdated Show resolved Hide resolved
@suvayu
Copy link
Member

suvayu commented Feb 4, 2025

I tried ordering at the end of the SQL query (using the files here #1007 (comment)), but it is still infeasible since we get the previous storage level using row.index - 1 and the index is assigned during the LEFT JOIN, which does not respect the order of t_low or attr tables. So, even if we order the table after the JOIN, the index is retrieving the wrong storage variable

Damn, didn't see this comment before adding my review. Let me think, I don't quite understand it.

Copy link
Member

@abelsiqueira abelsiqueira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @datejada, thanks for catching this. I've checked and indeed the index is created before the ordering so you're right in ordering separately. The way to do it is also correct, so I'm approving, but below I talk about alternatives.

I didn't know about the WITH clause, so I looked around and I found 3 ways to do what we need. I share them here so we get on the same page.

  1. The first way is what you did, which is called a Common Table Expression (https://duckdb.org/docs/sql/query_syntax/with.html), created with WITH. It is essentially a temporary table with bounded scope
  2. Option 2 is to create an actual TEMP TABLE to be reused in other places. I.e., you have the first "query" to create the ordered table, then a second "query" to create the final table.
  3. Option 3 is a subquery (https://duckdb.org/docs/sql/expressions/subqueries), which is something like SELECT ... FROM (subquery). The subquery here would be the first SELECT.

Since we only use the intermediary table once, I think there are no performance implications, and according to the internet, only benchmarking these would reveal possible differences - and I think the time is dominated in worse parts.

I've used the second method a few times where probably the CTE would be a better idea, so I'll try to keep it in mind to clean things up later.

For this PR, the 3rd option could also fit, and it would read something like

CREATE OR REPLACE TABLE var_... AS
SELECT
    nextval('id') AS index,
    asset,
    year,
    period_block_start,
    period_block_end,
FROM ( -- Start of the subquery
    SELECT ...
    ...
    ORDER BY ...
)

i.e., the order of the SELECT is the opposite of the CTEs.

I don't have a preference for either, because in this case it's essentially the whole query that is a sub/CTE/temp, and we just add the index afterward - which I haven't found a nicer way to do.
Thanks again.

@abelsiqueira
Copy link
Member

@suvayu, the issue is that we want the indices to be in the same order, and apparently the nextval('id') happens before ORDER BY, so just ordering is not enough - it was also my first instinct, by the way.

@suvayu
Copy link
Member

suvayu commented Feb 4, 2025

balancing for FR_Pump_Hydro_Open and getting the storage levels of the BG_Pump_Hydro_Open

 balance_storage_over_clustered_year[FR_Pump_Hydro_Open,2050,224:224] : 27.428571428571427 flow[(FR_Pump_Hydro_Open, FR_E_Balance), 2050, 2, 1:24] - 21 flow[(FR_E_Balance, FR_Pump_Hydro_Open), 2050, 2, 1:24] - storage_level_over_clustered_year[BG_Pump_Hydro_Open,2050,223:223] + storage_level_over_clustered_year[BG_Pump_Hydro_Open,2050,224:224] == 0

If you run the files with the code in v0.11.0, then you get the infeasibility that is in the description of issue #1007

Based on this comment, my hunch is the query that uses row.index - 1 has the bug. Basically, you have a table with multiple assets, and you are doing a cross-row calculation, for a given asset. So you would need an equivalent of WHERE asset = 'bla' to avoid exactly this kind of crossover. I wouldn't actually expect row.index - 1 to work at all, since it is undefined for WHERE index = 1. I would expect a CASE statement to handle this case.

Does that make sense, or did I misunderstand the intention of that cross-row operation?

@abelsiqueira
Copy link
Member

PS. I'm checking whether the ordering is the desired one with visual inspection of the following query:

SELECT * FROM the_relevant_table
WHERE period_block_start=1 OR period_block_end=365
ODER BY index

The rows should be something like

94×5 DataFrame
 Row │ index   asset                 year    period_block_start  period_block_end
     │ Int64?  String?               Int32?  Int32?              Int32?
─────┼────────────────────────────────────────────────────────────────────────────
   1 │      1  AT_Hydro_Reservoir      2050                   1                 1
   2 │    365  AT_Hydro_Reservoir      2050                 365               365
   3 │    366  AT_Pump_Hydro_Closed    2050                   1                 1
   4 │    730  AT_Pump_Hydro_Closed    2050                 365               365

@suvayu
Copy link
Member

suvayu commented Feb 4, 2025

apparently the nextval('id') happens before ORDER BY

Somehow this makes sense :-p. I would still look at that cross-row query for a bug, because it should not crossover in the first place.

@suvayu
Copy link
Member

suvayu commented Feb 4, 2025

BTW, I think you can simplify the second SELECT inside the WITH to:

SELECT nextval('id') AS index, * FROM filtered_data;

@abelsiqueira
Copy link
Member

Based on this comment, my hunch is the query that uses row.index - 1 has the bug.

It's completely possible. This is the constraint where this (or a similar) table is used: https://github.com/TulipaEnergy/TulipaEnergyModel.jl/blob/main/src/constraints/storage.jl#L23-L75

Basically, you have a table with multiple assets, and you are doing a cross-row calculation, for a given asset. So you would need an equivalent of WHERE asset = 'bla' to avoid exactly this kind of crossover. I wouldn't actually expect row.index - 1 to work at all, since it is undefined for WHERE index = 1. I would expect a CASE statement to handle this case.

Does that make sense, or did I misunderstand the intention of that cross-row operation?

Initially we were doing that (fixing an asset), but it's all a single vector of constraints now, which is faster to create, and aligned with the idea of attaching the container to the table on a one-to-one correspondence.
I haven't looked into CASE, but at constraint creation time we do checks in the Julia side. And for the index=1 case, we loop around to get the last index (for same asset, year, rp).

@datejada
Copy link
Member Author

datejada commented Feb 4, 2025

Hi @abelsiqueira and @suvayu thank you both for the good insights! I am learning a lot.

So, Suvayu, I resolved the comments since the next comment from Abel answered them. But before I merge the PR, let me know if you still think that something can be improved here.

Thanks!

@suvayu
Copy link
Member

suvayu commented Feb 4, 2025

I'll look at the constraint.

About the 3 alternatives, (1) & (3) are equivalent. People tend to prefer (1) when you want to refer to the result of that query in a few different places. It's kind of syntactic sugar. Both options run the query, so costs compute. If you are using it in many places (1) gives you the option to materialize. Then you can actually gain some performance. For a single reference though, materialisation would make it slower. I hope that made sense :-p

@suvayu
Copy link
Member

suvayu commented Feb 4, 2025

if you still think that something can be improved here.

Only that you can use * instead of typing all the column names again.

@datejada datejada requested a review from suvayu February 4, 2025 20:27
@datejada
Copy link
Member Author

datejada commented Feb 4, 2025

if you still think that something can be improved here.

Only that you can use * instead of typing all the column names again.

@suvayu thanks! I changed it with the * 😄 it is ready for you to review again before merging. Although I use filtered_assets.* to be explicit in the name of the table (to keep the style as in the other SQL queries in the code)

Copy link
Member

@suvayu suvayu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@datejada datejada merged commit d4f73da into main Feb 5, 2025
7 checks passed
@datejada datejada deleted the 1007-bug-storage-constraints branch February 5, 2025 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark PR only - Run benchmark on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] The storage balance constraints are not correctly created
3 participants