-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: optimise SQL query in observeRetrievalResultCodes #316
Conversation
Rework `observeRetrievalResultCodes` to execute a single SQL query to update all daily codes in one go. Before this change, we would run ~1k individual queries, consume a lot of CPU and trigger CPU throttling alert. As part of this change, I am also adding a console log to tell us how many rows - tuples (day, code) - we are updating in each loop iteration. Signed-off-by: Miroslav Bajtoš <[email protected]>
Cross-posting from https://space-meridian.slack.com/archives/C06S76341B8/p1739288083931459?thread_ts=1739273894.720459&cid=C06S76341B8 The CPU usage spike happened between 10:36:15-10:36:45 UTC. According to papertrail logs, Fly.io restarted the service at 10:36 UTC. Here are the logs from that time:
The log continues with hundreds of "Scheduled rewards for 0x..." messages. ![]() I suspect when spark-observer starts and executes the first iteration of the "Retrieval result codes" loop or "Transfer events" loop, too much CPU is needed to process the data. spark-observer is currently running on shared-cpu-4x ($7.78/month). The next step is shared-cpu-8x ($15.55). The extra cost is negligible, but I am not confident that a beefier machine would prevent the CPU throttling alert. If the VM upgrade does not prevent the alert, we can also stay on the current VM size. I checked the source code of the loops and see some optimisation opportunities, but I am not convinced it's worth our effort right now.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job on optimizing this! 👏🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
The function `observeRetrievalResultCodes()` includes the following statement in the InfluxDB query fetching `(day, code, rate)` data: |> aggregateWindow(every: 1d, fn: mean, createEmpty: false) Such query produces a list of values like this: ``` 2024-11-15T00:00:00Z,CONNECTION_REFUSED,0.0022313570194142725 2024-11-16T00:00:00Z,CONNECTION_REFUSED,0.002153071995819862 (...) 2025-02-12T00:00:00Z,CONNECTION_REFUSED,0.021266890041248942 2025-02-12T13:08:20.817239423Z,CONNECTION_REFUSED,0.02153170594662248 ``` Notice there are two rows for today (2025-02-12). One row contains data from yesterday (full day) and another row contain partial data from today. In this commit, I am fixing the query to correctly assign data points from yesterday to yesterday's date: |> aggregateWindow(every: 1d, fn: mean, createEmpty: false, timeSrc: "_start") The new query produces a list like this: ``` 2024-11-14T00:00:00Z,CONNECTION_REFUSED,0.0022313570194142725 2024-11-15T00:00:00Z,CONNECTION_REFUSED,0.002153071995819862 (...) 2025-02-11T00:00:00Z,CONNECTION_REFUSED,0.021266890041248942 2025-02-12T00:00:00Z,CONNECTION_REFUSED,0.02153170594662248 ``` This fixed the error introduced by cbb3bf1 (#316), where the SQL query fails with the following message: ``` ON CONFLICT DO UPDATE command cannot affect row a second time. Ensure that no rows proposed for insertion within the same command have duplicate constrained values. ``` See also InfluxDB documentation for `aggregateWindow()`: https://docs.influxdata.com/flux/v0/stdlib/universe/aggregatewindow/#timesrc Signed-off-by: Miroslav Bajtoš <[email protected]>
Interesting! After a short investigation, I found the cause of the bug, I think we discovered a flaw in the InfluxDB query. Instead of reverting this PR, I opened a new one to fix that flaw: #319 |
* fix: add `observeRetrievalResultCodes` to dry-run Signed-off-by: Miroslav Bajtoš <[email protected]> * fix: day returned by InfluxDB query The function `observeRetrievalResultCodes()` includes the following statement in the InfluxDB query fetching `(day, code, rate)` data: |> aggregateWindow(every: 1d, fn: mean, createEmpty: false) Such query produces a list of values like this: ``` 2024-11-15T00:00:00Z,CONNECTION_REFUSED,0.0022313570194142725 2024-11-16T00:00:00Z,CONNECTION_REFUSED,0.002153071995819862 (...) 2025-02-12T00:00:00Z,CONNECTION_REFUSED,0.021266890041248942 2025-02-12T13:08:20.817239423Z,CONNECTION_REFUSED,0.02153170594662248 ``` Notice there are two rows for today (2025-02-12). One row contains data from yesterday (full day) and another row contain partial data from today. In this commit, I am fixing the query to correctly assign data points from yesterday to yesterday's date: |> aggregateWindow(every: 1d, fn: mean, createEmpty: false, timeSrc: "_start") The new query produces a list like this: ``` 2024-11-14T00:00:00Z,CONNECTION_REFUSED,0.0022313570194142725 2024-11-15T00:00:00Z,CONNECTION_REFUSED,0.002153071995819862 (...) 2025-02-11T00:00:00Z,CONNECTION_REFUSED,0.021266890041248942 2025-02-12T00:00:00Z,CONNECTION_REFUSED,0.02153170594662248 ``` This fixed the error introduced by cbb3bf1 (#316), where the SQL query fails with the following message: ``` ON CONFLICT DO UPDATE command cannot affect row a second time. Ensure that no rows proposed for insertion within the same command have duplicate constrained values. ``` See also InfluxDB documentation for `aggregateWindow()`: https://docs.influxdata.com/flux/v0/stdlib/universe/aggregatewindow/#timesrc Signed-off-by: Miroslav Bajtoš <[email protected]> * fixup! set INFLUXDB_TOKEN for dry-run Signed-off-by: Miroslav Bajtoš <[email protected]> --------- Signed-off-by: Miroslav Bajtoš <[email protected]> Co-authored-by: Srdjan <[email protected]>
Rework
observeRetrievalResultCodes
to execute a single SQL query to update all daily codes in one go.Before this change, we would run ~1k individual queries, consume a lot of CPU and trigger a CPU throttling alert.
As part of this change, I am also adding a console log to tell us how many rows - tuples (day, code) - we are updating in each loop iteration.