perf: optimise SQL query in observeRetrievalResultCodes #316

bajtos · 2025-02-11T15:34:07Z

Rework observeRetrievalResultCodes to execute a single SQL query to update all daily codes in one go.

Before this change, we would run ~1k individual queries, consume a lot of CPU and trigger a CPU throttling alert.

As part of this change, I am also adding a console log to tell us how many rows - tuples (day, code) - we are updating in each loop iteration.

Rework `observeRetrievalResultCodes` to execute a single SQL query to update all daily codes in one go. Before this change, we would run ~1k individual queries, consume a lot of CPU and trigger CPU throttling alert. As part of this change, I am also adding a console log to tell us how many rows - tuples (day, code) - we are updating in each loop iteration. Signed-off-by: Miroslav Bajtoš <[email protected]>

bajtos · 2025-02-11T15:36:22Z

Cross-posting from https://space-meridian.slack.com/archives/C06S76341B8/p1739288083931459?thread_ts=1739273894.720459&cid=C06S76341B8

The CPU usage spike happened between 10:36:15-10:36:45 UTC. According to papertrail logs, Fly.io restarted the service at 10:36 UTC. Here are the logs from that time:

Feb 11 10:36:17Z  > start
Feb 11 10:36:17Z  > node bin/spark-observer.js
Feb 11 10:36:18Z  Selected JSON-RPC endpoint https://api.node.glif.io/rpc/v0
Feb 11 10:36:18Z  Querying scheduled rewards from impact evaluator
Feb 11 10:36:21Z  Block number too old, GLIF only provides last 2000 blocks, querying from -1900
Feb 11 10:36:21Z  Querying impact evaluator Transfer events after block 4696890
Feb 11 10:36:21Z  Scheduled rewards for 0x39656B13B4D7009C270E2CC6D6698A3C97Fc02D6 88143932255125704n
Feb 11 10:36:22Z  Loop "Retrieval result codes" took 3985ms
Feb 11 10:36:23Z  Found 0 Transfer events
Feb 11 10:36:23Z  Loop "Transfer events" took 4605ms
Feb 11 10:36:23Z  Scheduled rewards for 0xc57450fa57748a09b437A3E3116673f4C64e64Dd 6701134242610932n
Feb 11 10:36:23Z  Scheduled rewards for 0x6E5Afb67b5978b1F7e56358577C6bf93789fe572 81390879695710545n
Feb 11 10:36:23Z  Scheduled rewards for 0xCE53e6dEfbD1f0810BFcd691660ad311d0DB24f7 77135747466396775n

The log continues with hundreds of "Scheduled rewards for 0x..." messages.

I suspect when spark-observer starts and executes the first iteration of the "Retrieval result codes" loop or "Transfer events" loop, too much CPU is needed to process the data.

spark-observer is currently running on shared-cpu-4x ($7.78/month). The next step is shared-cpu-8x ($15.55). The extra cost is negligible, but I am not confident that a beefier machine would prevent the CPU throttling alert. If the VM upgrade does not prevent the alert, we can also stay on the current VM size.

I checked the source code of the loops and see some optimisation opportunities, but I am not convinced it's worth our effort right now.

observeTransferEvents - we update last_checked_block only when a transfer happens - I think that's once per week? - which means that most of the time, last_checked_block is 1-7 days old and we need to query 1900 blocks into the past. I believe this is only one JSON RPC API call, it apparently returns 0 events, so this loop should not be CPU intensive.
observeRetrievalResultCodes - it executes one SQL query for each tuple (day, result_code) found in the InfluxDB data. When I executed the query myself, I got 962 rows back. That means 962 sql queries. This can be CPU intensive, as the PG client has to process a lot of data exchanged over the wire.

pyropy

Great job on optimizing this! 👏🏻

juliangruber

😍

sentry-io · 2025-02-12T12:37:05Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ error: ON CONFLICT DO UPDATE command cannot affect row a second time observeRetrievalResultCodes(observer) View Issue
‼️ Error: Connection terminated unexpectedly observeRetrievalResultCodes(observer) View Issue

_{Did you find this useful? React with a 👍 or 👎}

bajtos · 2025-02-12T12:50:31Z

Unfortunately, this does not work.

The SQL query fails, as Sentry pointed out above.
The service startup is still triggering CPU throttling.

I am going to revert this pull request.

The function `observeRetrievalResultCodes()` includes the following statement in the InfluxDB query fetching `(day, code, rate)` data: |> aggregateWindow(every: 1d, fn: mean, createEmpty: false) Such query produces a list of values like this: ``` 2024-11-15T00:00:00Z,CONNECTION_REFUSED,0.0022313570194142725 2024-11-16T00:00:00Z,CONNECTION_REFUSED,0.002153071995819862 (...) 2025-02-12T00:00:00Z,CONNECTION_REFUSED,0.021266890041248942 2025-02-12T13:08:20.817239423Z,CONNECTION_REFUSED,0.02153170594662248 ``` Notice there are two rows for today (2025-02-12). One row contains data from yesterday (full day) and another row contain partial data from today. In this commit, I am fixing the query to correctly assign data points from yesterday to yesterday's date: |> aggregateWindow(every: 1d, fn: mean, createEmpty: false, timeSrc: "_start") The new query produces a list like this: ``` 2024-11-14T00:00:00Z,CONNECTION_REFUSED,0.0022313570194142725 2024-11-15T00:00:00Z,CONNECTION_REFUSED,0.002153071995819862 (...) 2025-02-11T00:00:00Z,CONNECTION_REFUSED,0.021266890041248942 2025-02-12T00:00:00Z,CONNECTION_REFUSED,0.02153170594662248 ``` This fixed the error introduced by cbb3bf1 (#316), where the SQL query fails with the following message: ``` ON CONFLICT DO UPDATE command cannot affect row a second time. Ensure that no rows proposed for insertion within the same command have duplicate constrained values. ``` See also InfluxDB documentation for `aggregateWindow()`: https://docs.influxdata.com/flux/v0/stdlib/universe/aggregatewindow/#timesrc Signed-off-by: Miroslav Bajtoš <[email protected]>

bajtos · 2025-02-12T13:19:20Z

Interesting! After a short investigation, I found the cause of the bug, I think we discovered a flaw in the InfluxDB query.

Instead of reverting this PR, I opened a new one to fix that flaw: #319

* fix: add `observeRetrievalResultCodes` to dry-run Signed-off-by: Miroslav Bajtoš <[email protected]> * fix: day returned by InfluxDB query The function `observeRetrievalResultCodes()` includes the following statement in the InfluxDB query fetching `(day, code, rate)` data: |> aggregateWindow(every: 1d, fn: mean, createEmpty: false) Such query produces a list of values like this: ``` 2024-11-15T00:00:00Z,CONNECTION_REFUSED,0.0022313570194142725 2024-11-16T00:00:00Z,CONNECTION_REFUSED,0.002153071995819862 (...) 2025-02-12T00:00:00Z,CONNECTION_REFUSED,0.021266890041248942 2025-02-12T13:08:20.817239423Z,CONNECTION_REFUSED,0.02153170594662248 ``` Notice there are two rows for today (2025-02-12). One row contains data from yesterday (full day) and another row contain partial data from today. In this commit, I am fixing the query to correctly assign data points from yesterday to yesterday's date: |> aggregateWindow(every: 1d, fn: mean, createEmpty: false, timeSrc: "_start") The new query produces a list like this: ``` 2024-11-14T00:00:00Z,CONNECTION_REFUSED,0.0022313570194142725 2024-11-15T00:00:00Z,CONNECTION_REFUSED,0.002153071995819862 (...) 2025-02-11T00:00:00Z,CONNECTION_REFUSED,0.021266890041248942 2025-02-12T00:00:00Z,CONNECTION_REFUSED,0.02153170594662248 ``` This fixed the error introduced by cbb3bf1 (#316), where the SQL query fails with the following message: ``` ON CONFLICT DO UPDATE command cannot affect row a second time. Ensure that no rows proposed for insertion within the same command have duplicate constrained values. ``` See also InfluxDB documentation for `aggregateWindow()`: https://docs.influxdata.com/flux/v0/stdlib/universe/aggregatewindow/#timesrc Signed-off-by: Miroslav Bajtoš <[email protected]> * fixup! set INFLUXDB_TOKEN for dry-run Signed-off-by: Miroslav Bajtoš <[email protected]> --------- Signed-off-by: Miroslav Bajtoš <[email protected]> Co-authored-by: Srdjan <[email protected]>

bajtos requested review from juliangruber, pyropy and NikolasHaimerl as code owners February 11, 2025 15:34

pyropy approved these changes Feb 11, 2025

View reviewed changes

juliangruber approved these changes Feb 11, 2025

View reviewed changes

NikolasHaimerl approved these changes Feb 12, 2025

View reviewed changes

Merge branch 'main' into perf-update-result-codes

8ed53bc

bajtos enabled auto-merge (squash) February 12, 2025 12:29

bajtos merged commit cbb3bf1 into main Feb 12, 2025
10 checks passed

bajtos deleted the perf-update-result-codes branch February 12, 2025 12:30

bajtos mentioned this pull request Feb 12, 2025

fix result codes update #319

Merged

bajtos mentioned this pull request Feb 12, 2025

Avoid CPU throttling alert on spark-observer (re)start #320

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimise SQL query in observeRetrievalResultCodes #316

perf: optimise SQL query in observeRetrievalResultCodes #316

bajtos commented Feb 11, 2025 •

edited

Loading

bajtos commented Feb 11, 2025 •

edited

Loading

pyropy left a comment

juliangruber left a comment

sentry-io bot commented Feb 12, 2025 •

edited

Loading

bajtos commented Feb 12, 2025

bajtos commented Feb 12, 2025

perf: optimise SQL query in observeRetrievalResultCodes #316

perf: optimise SQL query in observeRetrievalResultCodes #316

Conversation

bajtos commented Feb 11, 2025 • edited Loading

bajtos commented Feb 11, 2025 • edited Loading

pyropy left a comment

Choose a reason for hiding this comment

juliangruber left a comment

Choose a reason for hiding this comment

sentry-io bot commented Feb 12, 2025 • edited Loading

Suspect Issues

bajtos commented Feb 12, 2025

bajtos commented Feb 12, 2025

bajtos commented Feb 11, 2025 •

edited

Loading

bajtos commented Feb 11, 2025 •

edited

Loading

sentry-io bot commented Feb 12, 2025 •

edited

Loading