Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize 'Leaderboard - FIL Earned' query: add migration for participant_id and update query using ROW_NUMBER() for faster performance #324

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Hany-Almnaem
Copy link

This PR optimizes the SQL query used for the "Leaderboard - FIL Earned" panel. The changes include:

  • Adding a new migration script (023.do.add-participant-id.sql) that introduces a participant_id column and corresponding indexes to replace text-based joins.
  • Updating the query to use ROW_NUMBER() to efficiently select the latest scheduled rewards per participant.

Performance Improvements:

  1. Execution time is reduced by ~41%.
  2. Sorting is ~40% faster and consumes ~48% less disk space.
  3. Aggregation is ~42% faster, using ~40% less memory.
  4. Joins are streamlined, avoiding unnecessary hashing, and leveraging ROW_NUMBER() for better efficiency.
  5. Query planning time is cut by ~92%.

Performance Comparison

1Q

After Optimization:

2Q

Testing:
I have tested these changes in my local environment with backfilled test data fetch from the API, and the results confirm a significant performance improvement.

Please review and let me know if any further changes are needed.

…ant_id and update query using ROW_NUMBER() for faster performance
Copy link
Member

@bajtos bajtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution, @Hany-Almnaem! ❤️

Comment on lines +5 to +10
-- Backfill participant_id in daily_scheduled_rewards based on participant_address
WITH address_mapping AS (
SELECT DISTINCT participant_address AS address,
ROW_NUMBER() OVER (ORDER BY participant_address) AS generated_id
FROM daily_scheduled_rewards
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about using ROW_NUMBER() as the generated id:

Is it guaranteed that we will always get the same generated id for the same participant address? For example, in the following query from your pull request, were are limiting which rows we look at:

        SELECT participant_id, scheduled_rewards, participant_address,
               ROW_NUMBER() OVER (PARTITION BY participant_id ORDER BY day DESC) AS rn
        FROM daily_scheduled_rewards
        WHERE day <= $2

In this particular case, since the WHERE condition removes recent deals, I think the query should work. However, if we need to write a different query in the future, one that will look at recent rows only (e.g. WHERE day >= $2) - would we still get the same participant id?

What is the performance cost of running ROW_NUMBER() OVER (ORDER BY participant_address)?

Have you considered creating a new table mapping participant addresses to IDs and introducing a foreign key constraint/relation?

We have had a good experience with such schema design in spark-evaluate, see e.g. here:

https://github.com/CheckerNetwork/spark-evaluate/blob/733c5b974426027f7b0d100f2f7bf427ee1a1868/migrations/003.do.daily-participants.sql#L1-L4

My knowledge of advanced PostgreSQL is limited. It's possible your solution is correct and I just don't understand enough details to see that.

Let's discuss!

@juliangruber do you have any thoughts on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was giving this a quick look, I was also concerned about using row number here. But I was not sure whether I understood enough about the situation to leave a comment about it.

My understanding is that this is a one-time migration, so we don't need to worry about the query behaving consistently - as long as it takes care of assigning unique ids once, and as long as any future rows will have ids not conflicting with these ones.

Would https://stackoverflow.com/a/78355281 be safer?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous solution, I aimed to minimize schema changes. However, if future queries rely on filtering conditions like WHERE day >= $2, using ROW_NUMBER() becomes risky because it dynamically assigns IDs based on available addresses. those IDs could shift around if the dataset changes in some cases like the filtering condition focuses on recent records only, which isn't great for reliability.

I think a more robust solution as you suggest is to create a separate mapping table that assigns stable and efficiently retrievable IDs to participant_address values ​​and introduce a foreign key constraint to enforce consistency.

This approach would mean updating both daily_scheduled_rewards and daily_reward_transfers tables to work with these foreign keys, but it's worth it.

From a performance angle, there's a clear win here too. ROW_NUMBER() has to sort everything each time (O(n log n) worst-case scenarios), while looking up IDs in a mapping table is faster with proper indexing (O(1)).

Happy to walk through the migration steps if you want to give it a shot!
@juliangruber, @bajtos

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed solution sounds great to me, but let's wait for a confirmation from @bajtos

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @bajtos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants