Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: optimise SQL query in observeRetrievalResultCodes #316

Merged
merged 2 commits into from
Feb 12, 2025

Conversation

bajtos
Copy link
Member

@bajtos bajtos commented Feb 11, 2025

Rework observeRetrievalResultCodes to execute a single SQL query to update all daily codes in one go.

Before this change, we would run ~1k individual queries, consume a lot of CPU and trigger a CPU throttling alert.

As part of this change, I am also adding a console log to tell us how many rows - tuples (day, code) - we are updating in each loop iteration.

Rework `observeRetrievalResultCodes` to execute a single SQL query
to update all daily codes in one go.

Before this change, we would run ~1k individual queries, consume a lot
of CPU and trigger CPU throttling alert.

As part of this change, I am also adding a console log to tell us how
many rows - tuples (day, code) - we are updating in each loop iteration.

Signed-off-by: Miroslav Bajtoš <[email protected]>
@bajtos
Copy link
Member Author

bajtos commented Feb 11, 2025

Cross-posting from https://space-meridian.slack.com/archives/C06S76341B8/p1739288083931459?thread_ts=1739273894.720459&cid=C06S76341B8

The CPU usage spike happened between 10:36:15-10:36:45 UTC. According to papertrail logs, Fly.io restarted the service at 10:36 UTC. Here are the logs from that time:

Feb 11 10:36:17Z  > start
Feb 11 10:36:17Z  > node bin/spark-observer.js
Feb 11 10:36:18Z  Selected JSON-RPC endpoint https://api.node.glif.io/rpc/v0
Feb 11 10:36:18Z  Querying scheduled rewards from impact evaluator
Feb 11 10:36:21Z  Block number too old, GLIF only provides last 2000 blocks, querying from -1900
Feb 11 10:36:21Z  Querying impact evaluator Transfer events after block 4696890
Feb 11 10:36:21Z  Scheduled rewards for 0x39656B13B4D7009C270E2CC6D6698A3C97Fc02D6 88143932255125704n
Feb 11 10:36:22Z  Loop "Retrieval result codes" took 3985ms
Feb 11 10:36:23Z  Found 0 Transfer events
Feb 11 10:36:23Z  Loop "Transfer events" took 4605ms
Feb 11 10:36:23Z  Scheduled rewards for 0xc57450fa57748a09b437A3E3116673f4C64e64Dd 6701134242610932n
Feb 11 10:36:23Z  Scheduled rewards for 0x6E5Afb67b5978b1F7e56358577C6bf93789fe572 81390879695710545n
Feb 11 10:36:23Z  Scheduled rewards for 0xCE53e6dEfbD1f0810BFcd691660ad311d0DB24f7 77135747466396775n

The log continues with hundreds of "Scheduled rewards for 0x..." messages.

Screenshot 2025-02-11 at 16 37 33

I suspect when spark-observer starts and executes the first iteration of the "Retrieval result codes" loop or "Transfer events" loop, too much CPU is needed to process the data.

spark-observer is currently running on shared-cpu-4x ($7.78/month). The next step is shared-cpu-8x ($15.55). The extra cost is negligible, but I am not confident that a beefier machine would prevent the CPU throttling alert. If the VM upgrade does not prevent the alert, we can also stay on the current VM size.

I checked the source code of the loops and see some optimisation opportunities, but I am not convinced it's worth our effort right now.

  • observeTransferEvents - we update last_checked_block only when a transfer happens - I think that's once per week? - which means that most of the time, last_checked_block is 1-7 days old and we need to query 1900 blocks into the past. I believe this is only one JSON RPC API call, it apparently returns 0 events, so this loop should not be CPU intensive.
  • observeRetrievalResultCodes - it executes one SQL query for each tuple (day, result_code) found in the InfluxDB data. When I executed the query myself, I got 962 rows back. That means 962 sql queries. This can be CPU intensive, as the PG client has to process a lot of data exchanged over the wire.

Copy link
Contributor

@pyropy pyropy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job on optimizing this! 👏🏻

Copy link
Member

@juliangruber juliangruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

@bajtos bajtos enabled auto-merge (squash) February 12, 2025 12:29
@bajtos bajtos merged commit cbb3bf1 into main Feb 12, 2025
10 checks passed
@bajtos bajtos deleted the perf-update-result-codes branch February 12, 2025 12:30
Copy link

sentry-io bot commented Feb 12, 2025

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ error: ON CONFLICT DO UPDATE command cannot affect row a second time observeRetrievalResultCodes(observer) View Issue
  • ‼️ Error: Connection terminated unexpectedly observeRetrievalResultCodes(observer) View Issue

Did you find this useful? React with a 👍 or 👎

@bajtos
Copy link
Member Author

bajtos commented Feb 12, 2025

Unfortunately, this does not work.

  1. The SQL query fails, as Sentry pointed out above.
  2. The service startup is still triggering CPU throttling.
Screenshot 2025-02-12 at 13 50 18

I am going to revert this pull request.

bajtos added a commit that referenced this pull request Feb 12, 2025
The function `observeRetrievalResultCodes()` includes the following
statement in the InfluxDB query fetching `(day, code, rate)` data:

    |> aggregateWindow(every: 1d, fn: mean, createEmpty: false)

Such query produces a list of values like this:

```
2024-11-15T00:00:00Z,CONNECTION_REFUSED,0.0022313570194142725
2024-11-16T00:00:00Z,CONNECTION_REFUSED,0.002153071995819862
(...)
2025-02-12T00:00:00Z,CONNECTION_REFUSED,0.021266890041248942
2025-02-12T13:08:20.817239423Z,CONNECTION_REFUSED,0.02153170594662248
```

Notice there are two rows for today (2025-02-12). One row contains
data from yesterday (full day) and another row contain partial data from
today.

In this commit, I am fixing the query to correctly assign data points
from yesterday to yesterday's date:

      |> aggregateWindow(every: 1d, fn: mean, createEmpty: false, timeSrc: "_start")

The new query produces a list like this:

```
2024-11-14T00:00:00Z,CONNECTION_REFUSED,0.0022313570194142725
2024-11-15T00:00:00Z,CONNECTION_REFUSED,0.002153071995819862
(...)
2025-02-11T00:00:00Z,CONNECTION_REFUSED,0.021266890041248942
2025-02-12T00:00:00Z,CONNECTION_REFUSED,0.02153170594662248
```

This fixed the error introduced by cbb3bf1 (#316), where the SQL query
fails with the following message:

```
ON CONFLICT DO UPDATE command cannot affect row a second time.
Ensure that no rows proposed for insertion within the same command have duplicate constrained values.
```

See also InfluxDB documentation for `aggregateWindow()`:
https://docs.influxdata.com/flux/v0/stdlib/universe/aggregatewindow/#timesrc

Signed-off-by: Miroslav Bajtoš <[email protected]>
@bajtos bajtos mentioned this pull request Feb 12, 2025
@bajtos
Copy link
Member Author

bajtos commented Feb 12, 2025

Interesting! After a short investigation, I found the cause of the bug, I think we discovered a flaw in the InfluxDB query.

Instead of reverting this PR, I opened a new one to fix that flaw: #319

bajtos added a commit that referenced this pull request Feb 13, 2025
* fix: add `observeRetrievalResultCodes` to dry-run

Signed-off-by: Miroslav Bajtoš <[email protected]>

* fix: day returned by InfluxDB query

The function `observeRetrievalResultCodes()` includes the following
statement in the InfluxDB query fetching `(day, code, rate)` data:

    |> aggregateWindow(every: 1d, fn: mean, createEmpty: false)

Such query produces a list of values like this:

```
2024-11-15T00:00:00Z,CONNECTION_REFUSED,0.0022313570194142725
2024-11-16T00:00:00Z,CONNECTION_REFUSED,0.002153071995819862
(...)
2025-02-12T00:00:00Z,CONNECTION_REFUSED,0.021266890041248942
2025-02-12T13:08:20.817239423Z,CONNECTION_REFUSED,0.02153170594662248
```

Notice there are two rows for today (2025-02-12). One row contains
data from yesterday (full day) and another row contain partial data from
today.

In this commit, I am fixing the query to correctly assign data points
from yesterday to yesterday's date:

      |> aggregateWindow(every: 1d, fn: mean, createEmpty: false, timeSrc: "_start")

The new query produces a list like this:

```
2024-11-14T00:00:00Z,CONNECTION_REFUSED,0.0022313570194142725
2024-11-15T00:00:00Z,CONNECTION_REFUSED,0.002153071995819862
(...)
2025-02-11T00:00:00Z,CONNECTION_REFUSED,0.021266890041248942
2025-02-12T00:00:00Z,CONNECTION_REFUSED,0.02153170594662248
```

This fixed the error introduced by cbb3bf1 (#316), where the SQL query
fails with the following message:

```
ON CONFLICT DO UPDATE command cannot affect row a second time.
Ensure that no rows proposed for insertion within the same command have duplicate constrained values.
```

See also InfluxDB documentation for `aggregateWindow()`:
https://docs.influxdata.com/flux/v0/stdlib/universe/aggregatewindow/#timesrc

Signed-off-by: Miroslav Bajtoš <[email protected]>

* fixup! set INFLUXDB_TOKEN for dry-run

Signed-off-by: Miroslav Bajtoš <[email protected]>

---------

Signed-off-by: Miroslav Bajtoš <[email protected]>
Co-authored-by: Srdjan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants