-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix sort in the query to keep linear index per asset in the storage constraint #1010
Conversation
Benchmark Results
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1010 +/- ##
=======================================
Coverage 95.22% 95.22%
=======================================
Files 29 29
Lines 1151 1151
=======================================
Hits 1096 1096
Misses 55 55 ☔ View full report in Codecov by Sentry. |
@abelsiqueira and @suvayu This error/bug was difficult to reproduce since it only occurred for a large number of constraints. Empirically, with more than 10000 rows, the linear index was not guaranteed (I think it is a SQL thing) for the storage constraint where we need This code ensures the order stays as we want in the constraint, and the benchmark seems to show that there is no much impact in query. Anyway, any suggestion on the SQL side, to make it more efficient, please don't hesitate to add it. Thanks! Diego |
@abelsiqueira, here are the files to reproduce the infeasibility in version 0.11.0 The code in this PR solves it :) |
@abelsiqueira I tried ordering at the end of the SQL query (using the files here #1007 (comment)), but it is still infeasible since we get the previous storage level using QUERIES DuckDB.query(
connection,
"CREATE OR REPLACE TEMP SEQUENCE id START 1;
CREATE OR REPLACE TABLE var_storage_level_rep_period AS
SELECT
nextval('id') as index,
t_low.asset,
t_low.year,
t_low.rep_period,
t_low.time_block_start,
t_low.time_block_end
FROM t_lowest_all AS t_low
LEFT JOIN asset
ON t_low.asset = asset.asset
WHERE
asset.type = 'storage'
AND asset.is_seasonal = false
ORDER BY
t_low.asset,
t_low.year,
t_low.rep_period,
t_low.time_block_start;
",
)
DuckDB.query(
connection,
"CREATE OR REPLACE TEMP SEQUENCE id START 1;
CREATE OR REPLACE TABLE var_storage_level_over_clustered_year AS
SELECT
nextval('id') as index,
attr.asset,
attr.year,
attr.period_block_start,
attr.period_block_end,
FROM asset_timeframe_time_resolution AS attr
LEFT JOIN asset
ON attr.asset = asset.asset
WHERE
asset.type = 'storage'
ORDER BY
attr.asset,
attr.year,
attr.period_block_start;
",
) INFEASIBILITY balancing for
If you run the files with the code in v0.11.0, then you get the infeasibility that is in the description of issue #1007 So, my PR solves the issue, but please double-check with the files if this is the best way or if there is a better way to achieve the same result as what we get with these changes. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to keep in mind with SQL when you are doing "cross row" operations, SQL doesn't guarantee any order. Think of the results as sets
. So if a certain order is necessary, you should always explicitly ask for it.
Damn, didn't see this comment before adding my review. Let me think, I don't quite understand it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @datejada, thanks for catching this. I've checked and indeed the index
is created before the ordering so you're right in ordering separately. The way to do it is also correct, so I'm approving, but below I talk about alternatives.
I didn't know about the WITH
clause, so I looked around and I found 3 ways to do what we need. I share them here so we get on the same page.
- The first way is what you did, which is called a Common Table Expression (https://duckdb.org/docs/sql/query_syntax/with.html), created with
WITH
. It is essentially a temporary table with bounded scope - Option 2 is to create an actual
TEMP TABLE
to be reused in other places. I.e., you have the first "query" to create the ordered table, then a second "query" to create the final table. - Option 3 is a subquery (https://duckdb.org/docs/sql/expressions/subqueries), which is something like
SELECT ... FROM (subquery)
. Thesubquery
here would be the firstSELECT
.
Since we only use the intermediary table once, I think there are no performance implications, and according to the internet, only benchmarking these would reveal possible differences - and I think the time is dominated in worse parts.
I've used the second method a few times where probably the CTE would be a better idea, so I'll try to keep it in mind to clean things up later.
For this PR, the 3rd option could also fit, and it would read something like
CREATE OR REPLACE TABLE var_... AS
SELECT
nextval('id') AS index,
asset,
year,
period_block_start,
period_block_end,
FROM ( -- Start of the subquery
SELECT ...
...
ORDER BY ...
)
i.e., the order of the SELECT
is the opposite of the CTEs.
I don't have a preference for either, because in this case it's essentially the whole query that is a sub/CTE/temp, and we just add the index afterward - which I haven't found a nicer way to do.
Thanks again.
@suvayu, the issue is that we want the indices to be in the same order, and apparently the |
Based on this comment, my hunch is the query that uses Does that make sense, or did I misunderstand the intention of that cross-row operation? |
PS. I'm checking whether the ordering is the desired one with visual inspection of the following query:
The rows should be something like
|
Somehow this makes sense :-p. I would still look at that cross-row query for a bug, because it should not crossover in the first place. |
BTW, I think you can simplify the second SELECT nextval('id') AS index, * FROM filtered_data; |
It's completely possible. This is the constraint where this (or a similar) table is used: https://github.com/TulipaEnergy/TulipaEnergyModel.jl/blob/main/src/constraints/storage.jl#L23-L75
Initially we were doing that (fixing an asset), but it's all a single vector of constraints now, which is faster to create, and aligned with the idea of attaching the container to the table on a one-to-one correspondence. |
Hi @abelsiqueira and @suvayu thank you both for the good insights! I am learning a lot. So, Suvayu, I resolved the comments since the next comment from Abel answered them. But before I merge the PR, let me know if you still think that something can be improved here. Thanks! |
I'll look at the constraint. About the 3 alternatives, (1) & (3) are equivalent. People tend to prefer (1) when you want to refer to the result of that query in a few different places. It's kind of syntactic sugar. Both options run the query, so costs compute. If you are using it in many places (1) gives you the option to materialize. Then you can actually gain some performance. For a single reference though, materialisation would make it slower. I hope that made sense :-p |
Only that you can use |
@suvayu thanks! I changed it with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Related issues
Closes #1007
Checklist
I am following the contributing guidelines
Tests are passing
Lint workflow is passing
Docs were updated and workflow is passing