-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize 'Leaderboard - FIL Earned' query: add migration for participant_id and update query using ROW_NUMBER() for faster performance #324
base: main
Are you sure you want to change the base?
Conversation
…ant_id and update query using ROW_NUMBER() for faster performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution, @Hany-Almnaem! ❤️
-- Backfill participant_id in daily_scheduled_rewards based on participant_address | ||
WITH address_mapping AS ( | ||
SELECT DISTINCT participant_address AS address, | ||
ROW_NUMBER() OVER (ORDER BY participant_address) AS generated_id | ||
FROM daily_scheduled_rewards | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned about using ROW_NUMBER()
as the generated id:
Is it guaranteed that we will always get the same generated id for the same participant address? For example, in the following query from your pull request, were are limiting which rows we look at:
SELECT participant_id, scheduled_rewards, participant_address,
ROW_NUMBER() OVER (PARTITION BY participant_id ORDER BY day DESC) AS rn
FROM daily_scheduled_rewards
WHERE day <= $2
In this particular case, since the WHERE
condition removes recent deals, I think the query should work. However, if we need to write a different query in the future, one that will look at recent rows only (e.g. WHERE day >= $2
) - would we still get the same participant id?
What is the performance cost of running ROW_NUMBER() OVER (ORDER BY participant_address)
?
Have you considered creating a new table mapping participant addresses to IDs and introducing a foreign key constraint/relation?
We have had a good experience with such schema design in spark-evaluate, see e.g. here:
My knowledge of advanced PostgreSQL is limited. It's possible your solution is correct and I just don't understand enough details to see that.
Let's discuss!
@juliangruber do you have any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was giving this a quick look, I was also concerned about using row number here. But I was not sure whether I understood enough about the situation to leave a comment about it.
My understanding is that this is a one-time migration, so we don't need to worry about the query behaving consistently - as long as it takes care of assigning unique ids once, and as long as any future rows will have ids not conflicting with these ones.
Would https://stackoverflow.com/a/78355281 be safer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous solution, I aimed to minimize schema changes. However, if future queries rely on filtering conditions like WHERE day >= $2, using ROW_NUMBER() becomes risky because it dynamically assigns IDs based on available addresses. those IDs could shift around if the dataset changes in some cases like the filtering condition focuses on recent records only, which isn't great for reliability.
I think a more robust solution as you suggest is to create a separate mapping table that assigns stable and efficiently retrievable IDs to participant_address values and introduce a foreign key constraint to enforce consistency.
This approach would mean updating both daily_scheduled_rewards and daily_reward_transfers tables to work with these foreign keys, but it's worth it.
From a performance angle, there's a clear win here too. ROW_NUMBER() has to sort everything each time (O(n log n) worst-case scenarios), while looking up IDs in a mapping table is faster with proper indexing (O(1)).
Happy to walk through the migration steps if you want to give it a shot!
@juliangruber, @bajtos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed solution sounds great to me, but let's wait for a confirmation from @bajtos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @bajtos
This PR optimizes the SQL query used for the "Leaderboard - FIL Earned" panel. The changes include:
Performance Improvements:
Performance Comparison
After Optimization:
Testing:
I have tested these changes in my local environment with backfilled test data fetch from the API, and the results confirm a significant performance improvement.
Please review and let me know if any further changes are needed.