Optimize 'Leaderboard - FIL Earned' query: add migration for participant_id and update query using ROW_NUMBER() for faster performance #324

Hany-Almnaem · 2025-02-14T13:38:37Z

This PR optimizes the SQL query used for the "Leaderboard - FIL Earned" panel. The changes include:

Adding a new migration script (023.do.add-participant-id.sql) that introduces a participant_id column and corresponding indexes to replace text-based joins.
Updating the query to use ROW_NUMBER() to efficiently select the latest scheduled rewards per participant.

Performance Improvements:

Execution time is reduced by ~41%.
Sorting is ~40% faster and consumes ~48% less disk space.
Aggregation is ~42% faster, using ~40% less memory.
Joins are streamlined, avoiding unnecessary hashing, and leveraging ROW_NUMBER() for better efficiency.
Query planning time is cut by ~92%.

Performance Comparison

After Optimization:

Testing:
I have tested these changes in my local environment with backfilled test data fetch from the API, and the results confirm a significant performance improvement.

Please review and let me know if any further changes are needed.

…ant_id and update query using ROW_NUMBER() for faster performance

bajtos

Thank you for your contribution, @Hany-Almnaem! ❤️

bajtos · 2025-02-17T09:18:53Z

db/migrations/023.do.add-participant-id.sql

+-- Backfill participant_id in daily_scheduled_rewards based on participant_address
+WITH address_mapping AS (
+  SELECT DISTINCT participant_address AS address,
+         ROW_NUMBER() OVER (ORDER BY participant_address) AS generated_id
+  FROM daily_scheduled_rewards
+)


I am concerned about using ROW_NUMBER() as the generated id:

Is it guaranteed that we will always get the same generated id for the same participant address? For example, in the following query from your pull request, were are limiting which rows we look at:

SELECT participant_id, scheduled_rewards, participant_address, ROW_NUMBER() OVER (PARTITION BY participant_id ORDER BY day DESC) AS rn FROM daily_scheduled_rewards WHERE day <= $2

In this particular case, since the WHERE condition removes recent deals, I think the query should work. However, if we need to write a different query in the future, one that will look at recent rows only (e.g. WHERE day >= $2) - would we still get the same participant id?

What is the performance cost of running ROW_NUMBER() OVER (ORDER BY participant_address)?

Have you considered creating a new table mapping participant addresses to IDs and introducing a foreign key constraint/relation?

We have had a good experience with such schema design in spark-evaluate, see e.g. here:

https://github.com/CheckerNetwork/spark-evaluate/blob/733c5b974426027f7b0d100f2f7bf427ee1a1868/migrations/003.do.daily-participants.sql#L1-L4

My knowledge of advanced PostgreSQL is limited. It's possible your solution is correct and I just don't understand enough details to see that.

Let's discuss!

@juliangruber do you have any thoughts on this?

When I was giving this a quick look, I was also concerned about using row number here. But I was not sure whether I understood enough about the situation to leave a comment about it.

My understanding is that this is a one-time migration, so we don't need to worry about the query behaving consistently - as long as it takes care of assigning unique ids once, and as long as any future rows will have ids not conflicting with these ones.

Would https://stackoverflow.com/a/78355281 be safer?

In the previous solution, I aimed to minimize schema changes. However, if future queries rely on filtering conditions like WHERE day >= $2, using ROW_NUMBER() becomes risky because it dynamically assigns IDs based on available addresses. those IDs could shift around if the dataset changes in some cases like the filtering condition focuses on recent records only, which isn't great for reliability.

I think a more robust solution as you suggest is to create a separate mapping table that assigns stable and efficiently retrievable IDs to participant_address values and introduce a foreign key constraint to enforce consistency.

This approach would mean updating both daily_scheduled_rewards and daily_reward_transfers tables to work with these foreign keys, but it's worth it.

From a performance angle, there's a clear win here too. ROW_NUMBER() has to sort everything each time (O(n log n) worst-case scenarios), while looking up IDs in a mapping table is faster with proper indexing (O(1)).

Happy to walk through the migration steps if you want to give it a shot!
@juliangruber, @bajtos

The proposed solution sounds great to me, but let's wait for a confirmation from @bajtos

ping @bajtos

Optimize 'Leaderboard - FIL Earned' query: add migration for particip…

512d094

…ant_id and update query using ROW_NUMBER() for faster performance

Hany-Almnaem requested review from bajtos, juliangruber, pyropy and NikolasHaimerl as code owners February 14, 2025 13:38

Merge branch 'CheckerNetwork:main' into optimize-query-performance

0925932

bajtos reviewed Feb 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize 'Leaderboard - FIL Earned' query: add migration for participant_id and update query using ROW_NUMBER() for faster performance #324

Optimize 'Leaderboard - FIL Earned' query: add migration for participant_id and update query using ROW_NUMBER() for faster performance #324

Hany-Almnaem commented Feb 14, 2025

bajtos left a comment

bajtos Feb 17, 2025

juliangruber Feb 17, 2025

Hany-Almnaem Feb 17, 2025

juliangruber Feb 17, 2025

juliangruber Feb 25, 2025

Optimize 'Leaderboard - FIL Earned' query: add migration for participant_id and update query using ROW_NUMBER() for faster performance #324

Are you sure you want to change the base?

Optimize 'Leaderboard - FIL Earned' query: add migration for participant_id and update query using ROW_NUMBER() for faster performance #324

Conversation

Hany-Almnaem commented Feb 14, 2025

Performance Comparison

After Optimization:

bajtos left a comment

Choose a reason for hiding this comment

bajtos Feb 17, 2025

Choose a reason for hiding this comment

juliangruber Feb 17, 2025

Choose a reason for hiding this comment

Hany-Almnaem Feb 17, 2025

Choose a reason for hiding this comment

juliangruber Feb 17, 2025

Choose a reason for hiding this comment

juliangruber Feb 25, 2025

Choose a reason for hiding this comment