Solver participation guard #3257

squadgazzz · 2025-01-29T15:05:13Z

Description

From the original issue:

When a solver repeatedly wins consecutive auctions but fails to settle its solutions on-chain, it can lead to system downtime. To prevent this, the autopilot must have the capability to temporarily exclude such solvers from participating in competitions. This ensures no single solver can disrupt the system's operations.

This PR implements it by introducing a new struct, which checks whether the solver is allowed to participate in the next competition by using two different approaches:

Moved the existing Authenticator's is_solver on-chain call into the new struct.
Introduced a new strategy, which finds a non-settling solver using a SQL query. It selects 3 last auctions(configurable) with a deadline until the current block to avoid selecting pending settlements and checks if all of the auctions were settled by the same solver/solvers(in case of multiple winners). This strategy caches the results to avoid redundant DB queries. This query relies on the auction_id column from the settlements table, which gets updated separately by the Observer struct, so the cache gets updated only once the Observer has some result.

These validators are called sequentially to avoid redundant RPC calls to Authenticator. So it first checks for the DB-based validator cache and, only then, sends the RPC call.

Once one of the strategies says the solver is not allowed to participate, it gets deny-listed for 5m(configurable).

Each validator can be enabled/disabled separately in case of any issue.

Metrics

Added a metric that gets populated by the DB-based validator once a solver is marked as banned. The idea is to create an alert that is sent if there are more than 4 such occurrences for the last 30 minutes for the same solver, meaning it should be considered disabling the solver.

Open discussions

Since the current SQL query filters out auctions where a deadline has not been reached, the following case is possible:
The solver gets banned, while the same solver has a pending settlement. In case this gets settled, the solver remains banned. While this is a niche case, it would be better to unblock the solver before the cache TTL deadline is reached. This has not been implemented in the current PR since some refactoring is required in the Observer struct. If this is approved, it can be implemented quickly.
Whether it makes sense to introduce a metrics-based strategy similar to the bad token detector's where the solver gets banned in case >95%(or similar) of settlements fail.

How to test

A new SQL query test. Existing e2e tests.

Related Issues

Fixes #3221

Summary by CodeRabbit

New Features
- Introduced advanced solver participation controls with configurable eligibility checks, integrating both on-chain and database validations.
- Enabled asynchronous real-time notifications for settlement updates, enhancing system responsiveness.
- Added metrics tracking to monitor auction participation and performance.
Chores
- Updated internal dependencies and restructured driver configuration.
- Reorganized the database schema to support improved auction and settlement processing.

crates/autopilot/src/database/competition.rs

crates/autopilot/src/domain/competition/solver_participation_guard.rs

crates/autopilot/src/run_loop.rs

crates/autopilot/src/arguments.rs

squadgazzz · 2025-02-07T17:25:57Z

Any issue with this approuch?

If I got it correctly, you mean some kind of FIFO cache. There is still the settlements data that needs to be fetched from the DB. I thought about this initially, but it adds more complexity to the already non-trivial solution. With the current approach, all the data is received from one source.

sunce86 · 2025-02-10T08:57:46Z

There is still the settlements data that needs to be fetched from the DB.

Maybe Validator can be signalled from two sources:

Inform Validator when each competition/auction is saved.
Inform Validator when each settlement is observed onchain.

Then combine those data internally in Validator to match each settlement to each competition/auction and deduct which competitions ended up without settlement (using auction deadline block).

squadgazzz · 2025-02-10T19:52:47Z

Then combine those data internally in Validator to match each settlement to each competition/auction and deduct which competitions ended up without settlement (using auction deadline block).

Why I initially didn't go with this approach(mostly because of the second point):

At first glance, it seems more complex than SQL queries. Retrieve two different types of data from different sources. Update the data accordingly. Maintain a reasonable cache size.
On each restart, for the statistic-based validators, it would require either executing an SQL query to populate the initial data or waiting for N auctions to accumulate enough data to start blocking the solvers.

@sunce86 Does it make sense? I am not against implementing a more complex and probably more efficient approach, but the only benefit I see is the reduction of DB queries.

sunce86

Mostly nits.

Do you have in plan writing an e2e test for this? Otherwise I'm afraid we won't have a high conviction it's working properly and we would hope for the best in prod 🙏

crates/autopilot/src/domain/competition/participation_guard/db.rs

crates/autopilot/src/domain/competition/participation_guard/mod.rs

squadgazzz · 2025-02-12T14:24:01Z

Do you have in plan writing an e2e test for this?

Yep, I will open another PR since 600+ lines of the code are already too much.

sunce86

LG.

This new logic will always return at least one non-settling solver right? The one that is currently being settled. Not sure what to do with this information but you might want to skip logging/alerting for that special one.

crates/autopilot/src/domain/competition/participation_guard/db.rs

squadgazzz · 2025-02-13T20:38:00Z

This new logic will always return at least one non-settling solver right? The one that is currently being settled.

I didn't get why is that. The currently settling auction gets filtered out in the query because of this: https://github.com/cowprotocol/services/pull/3257/files#diff-ecc7354b24bcc39d93bfb90181abe577203cc25d8f94c9886b2f5f3f1b7894d5R112

# Conflicts: # crates/database/src/lib.rs

squadgazzz · 2025-02-17T09:51:26Z

Thinking more about this approach, it currently relies a lot on the RPC node connection since the settlements table gets updated only once a tx is observed onchain. This might lead to false-positive bans. Maybe it makes sense to switch to settlement error metrics, such as only counting reverts/simulation reverts and ignoring expiration events completely.

MartinquaXD

Running a DB query on every auction makes the code easier but I'm a bit worried about the performance. Do we already have the right queries to make this fast?

MartinquaXD · 2025-02-17T08:05:24Z

crates/autopilot/src/arguments.rs

+                    accepts_unsettled_blocking = value
+                        .parse()
+                        .context("failed to parse solver's third arg param")?


This really highlights how painful configuring non-trivial things via ENV variables is. Sneakily interpreting the 3rd argument for something else when the first attempt to parse something is not great. OTOH forcing people to provide some value for the fairness threshold is not great either.
I think it's okay in this PR but we should consider to switch to config files for the autopilot and api sooner rather than later to not make the parsing ever more complex.

I hope this will be a temporary value until CIP is approved.

I agree with @MartinquaXD , hopefully we will do that soon 🙏

crates/autopilot/src/domain/competition/participation_guard/db.rs

crates/autopilot/src/arguments.rs

crates/autopilot/src/infra/solvers/mod.rs

crates/autopilot/src/run.rs

MartinquaXD · 2025-02-17T10:27:41Z

crates/database/src/solver_competition.rs

@@ -97,6 +97,57 @@ GROUP BY sc.id
    sqlx::query_as(QUERY).bind(tx_hash).fetch_optional(ex).await
 }

+pub async fn find_non_settling_solvers(


This query is quite big an gets executed during every auction if I'm not mistaken. Did you benchmark how well it performs?
Also a comment explaining what this query is supposed to do would be good.

It gets executed in a background task outside the runloop. The execution is triggered based on the updates of the proposed_solutions and proposed_trade_executions tables.

The execution of the query is tested and it completes within 2s since the auction amount is limited by a small number.

MartinquaXD · 2025-02-17T10:32:22Z

crates/autopilot/src/run_loop.rs

-                    ?err,
-                    "failed to check if solver is deny listed"
-                );
+        let can_participate = self.solver_participation_guard.can_participate(&driver.submission_address).await.map_err(|err| {


Given that we fight for every little bit of time it would probably be better to already send the request to the solver and in the meantime run the query. If it turns out the solver should be deny-listed we just discard the solution.
Given that this should be the exception not the regular case this seems like the better approach to me.
Also this drastically reduces the performance requirements for the query. If we run it before sending a request to the solver it should probably finish in a few ms but if we run it while the solver is already computing a solution it wouldn't be a problem if it takes a second or 2.

Also this drastically reduces the performance requirements for the query.

All the DB queries get executed in a background task outside the runloop.

crates/autopilot/src/arguments.rs

m-lord-renkse · 2025-02-17T12:09:01Z

crates/autopilot/src/arguments.rs

+                    accepts_unsettled_blocking = value
+                        .parse()
+                        .context("failed to parse solver's third arg param")?


I agree with @MartinquaXD , hopefully we will do that soon 🙏

crates/autopilot/src/domain/competition/participation_guard/db.rs

m-lord-renkse

Nice!

MartinquaXD

Change looks okay to me. But let's hold off with merging until the e2e test has been reviewed/approved as well.

squadgazzz added 5 commits January 29, 2025 12:27

Solver participation validator

5fe0dd6

Test

5319945

Avoid rpc calls every time

e65c328

Typo

fc3321b

Docs

0fbd61c

squadgazzz changed the title ~~Solver participation validator~~ Solver participation gate Jan 29, 2025

squadgazzz changed the title ~~Solver participation gate~~ Solver participation guard Jan 29, 2025

Metrics

b1abfa0

squadgazzz marked this pull request as ready for review January 29, 2025 17:00

squadgazzz requested a review from a team as a code owner January 29, 2025 17:00

squadgazzz marked this pull request as draft January 29, 2025 17:01

squadgazzz added 2 commits January 29, 2025 17:42

Configurable validators

292dcff

Fixed clap config

fe9ef5b

squadgazzz marked this pull request as ready for review January 29, 2025 18:00

MartinquaXD reviewed Jan 30, 2025

View reviewed changes

squadgazzz added 5 commits January 30, 2025 12:36

Refactoring

c5e3502

Config per solver

a9e6a3f

Start using the new config

9a55fe2

Simplify to hashset

f9bdafd

Nit

5fc831e

squadgazzz force-pushed the blacklist-failing-solvers branch 2 times, most recently from f69e174 to 5fc831e Compare January 30, 2025 20:11

squadgazzz mentioned this pull request Jan 30, 2025

Notify banned solvers #3262

Open

squadgazzz added 5 commits January 31, 2025 15:18

Use driver's name in metrics

3154cd0

Nit

47007c1

Send metrics about each found solver

bb9059e

Cache only accepted solvers

6787d34

Refactoring

a2710c6

squadgazzz mentioned this pull request Jan 31, 2025

Ban solvers based on the settlements success rate #3263

Open

3 tasks

Fix the tests

1f43009

squadgazzz added 3 commits February 11, 2025 20:16

Trigger updates on the proposed_solution table insert

e9a70f5

Nit

e220eaf

Formatting

51832d4

squadgazzz requested a review from sunce86 February 11, 2025 21:06

sunce86 reviewed Feb 12, 2025

View reviewed changes

squadgazzz added 2 commits February 12, 2025 14:20

infra::Persistence

17ee52c

Naming

cba693a

sunce86 approved these changes Feb 13, 2025

View reviewed changes

crates/autopilot/src/domain/competition/participation_guard/db.rs Outdated Show resolved Hide resolved

squadgazzz mentioned this pull request Feb 13, 2025

Solver participation guard e2e test #3280

Open

squadgazzz added 3 commits February 14, 2025 09:41

Comment

c3c9433

Merge branch 'main' into blacklist-failing-solvers

58d7de1

Merge branch 'main' into blacklist-failing-solvers

fdc3afe

# Conflicts: # crates/database/src/lib.rs

MartinquaXD reviewed Feb 17, 2025

View reviewed changes

squadgazzz added 2 commits February 17, 2025 11:58

Comments

4f6cd1d

Simplify the code

bdd33d0

m-lord-renkse reviewed Feb 17, 2025

View reviewed changes

squadgazzz added 4 commits February 17, 2025 12:15

Nits

fd0fc27

Solver names in the log

e5250a5

Naming

051bd50

Fixes unit tests

4bb8640

m-lord-renkse reviewed Feb 18, 2025

View reviewed changes

crates/autopilot/src/domain/competition/participation_guard/db.rs Outdated Show resolved Hide resolved

Nit

de31f1e

m-lord-renkse approved these changes Feb 18, 2025

View reviewed changes

MartinquaXD approved these changes Feb 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solver participation guard #3257

Solver participation guard #3257

squadgazzz commented Jan 29, 2025 •

edited by coderabbitai bot

Loading

squadgazzz commented Feb 7, 2025

sunce86 commented Feb 10, 2025

squadgazzz commented Feb 10, 2025

sunce86 left a comment

squadgazzz commented Feb 12, 2025

sunce86 left a comment

squadgazzz commented Feb 13, 2025

squadgazzz commented Feb 17, 2025

MartinquaXD left a comment

MartinquaXD Feb 17, 2025

squadgazzz Feb 17, 2025

m-lord-renkse Feb 17, 2025

MartinquaXD Feb 17, 2025

squadgazzz Feb 17, 2025

squadgazzz Feb 17, 2025

MartinquaXD Feb 17, 2025

squadgazzz Feb 17, 2025

m-lord-renkse Feb 17, 2025

m-lord-renkse left a comment

MartinquaXD left a comment

Solver participation guard #3257

Are you sure you want to change the base?

Solver participation guard #3257

Conversation

squadgazzz commented Jan 29, 2025 • edited by coderabbitai bot Loading

Description

Metrics

Open discussions

How to test

Related Issues

Summary by CodeRabbit

squadgazzz commented Feb 7, 2025

sunce86 commented Feb 10, 2025

squadgazzz commented Feb 10, 2025

sunce86 left a comment

Choose a reason for hiding this comment

squadgazzz commented Feb 12, 2025

sunce86 left a comment

Choose a reason for hiding this comment

squadgazzz commented Feb 13, 2025

squadgazzz commented Feb 17, 2025

MartinquaXD left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-lord-renkse left a comment

Choose a reason for hiding this comment

MartinquaXD left a comment

Choose a reason for hiding this comment

squadgazzz commented Jan 29, 2025 •

edited by coderabbitai bot

Loading