Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solver participation guard #3257

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open

Solver participation guard #3257

wants to merge 36 commits into from

Conversation

squadgazzz
Copy link
Contributor

@squadgazzz squadgazzz commented Jan 29, 2025

Description

From the original issue:

When a solver repeatedly wins consecutive auctions but fails to settle its solutions on-chain, it can lead to system downtime. To prevent this, the autopilot must have the capability to temporarily exclude such solvers from participating in competitions. This ensures no single solver can disrupt the system's operations.

This PR implements it by introducing a new struct, which checks whether the solver is allowed to participate in the next competition by using two different approaches:

  1. Moved the existing Authenticator's is_solver on-chain call into the new struct.
  2. Introduced a new strategy, which finds a non-settling solver using a SQL query. It selects 3 last auctions(configurable) with a deadline until the current block to avoid selecting pending settlements and checks if all of the auctions were settled by the same solver/solvers(in case of multiple winners). This strategy caches the results to avoid redundant DB queries. This query relies on the auction_id column from the settlements table, which gets updated separately by the Observer struct, so the cache gets updated only once the Observer has some result.

These validators are called sequentially to avoid redundant RPC calls to Authenticator. So it first checks for the DB-based validator cache and, only then, sends the RPC call.

Once one of the strategies says the solver is not allowed to participate, it gets deny-listed for 5m(configurable).

Each validator can be enabled/disabled separately in case of any issue.

Metrics

Added a metric that gets populated by the DB-based validator once a solver is marked as banned. The idea is to create an alert that is sent if there are more than 4 such occurrences for the last 30 minutes for the same solver, meaning it should be considered disabling the solver.

Open discussions

  1. Since the current SQL query filters out auctions where a deadline has not been reached, the following case is possible:
    The solver gets banned, while the same solver has a pending settlement. In case this gets settled, the solver remains banned. While this is a niche case, it would be better to unblock the solver before the cache TTL deadline is reached. This has not been implemented in the current PR since some refactoring is required in the Observer struct. If this is approved, it can be implemented quickly.

  2. Whether it makes sense to introduce a metrics-based strategy similar to the bad token detector's where the solver gets banned in case >95%(or similar) of settlements fail.

How to test

A new SQL query test. Existing e2e tests.

Related Issues

Fixes #3221

Summary by CodeRabbit

  • New Features
     - Introduced advanced solver participation controls with configurable eligibility checks, integrating both on-chain and database validations.
     - Enabled asynchronous real-time notifications for settlement updates, enhancing system responsiveness.
     - Added metrics tracking to monitor auction participation and performance.

  • Chores
     - Updated internal dependencies and restructured driver configuration.
     - Reorganized the database schema to support improved auction and settlement processing.

@squadgazzz squadgazzz changed the title Solver participation validator Solver participation gate Jan 29, 2025
@squadgazzz squadgazzz changed the title Solver participation gate Solver participation guard Jan 29, 2025
@squadgazzz squadgazzz marked this pull request as ready for review January 29, 2025 17:00
@squadgazzz squadgazzz requested a review from a team as a code owner January 29, 2025 17:00
@squadgazzz squadgazzz marked this pull request as draft January 29, 2025 17:01
@squadgazzz squadgazzz marked this pull request as ready for review January 29, 2025 18:00
crates/autopilot/src/database/competition.rs Outdated Show resolved Hide resolved
crates/autopilot/src/run_loop.rs Show resolved Hide resolved
crates/autopilot/src/run_loop.rs Show resolved Hide resolved
crates/autopilot/src/arguments.rs Outdated Show resolved Hide resolved
@squadgazzz squadgazzz force-pushed the blacklist-failing-solvers branch 2 times, most recently from f69e174 to 5fc831e Compare January 30, 2025 20:11
@squadgazzz
Copy link
Contributor Author

Any issue with this approuch?

If I got it correctly, you mean some kind of FIFO cache. There is still the settlements data that needs to be fetched from the DB. I thought about this initially, but it adds more complexity to the already non-trivial solution. With the current approach, all the data is received from one source.

@sunce86
Copy link
Contributor

sunce86 commented Feb 10, 2025

There is still the settlements data that needs to be fetched from the DB.

Maybe Validator can be signalled from two sources:

  1. Inform Validator when each competition/auction is saved.
  2. Inform Validator when each settlement is observed onchain.

Then combine those data internally in Validator to match each settlement to each competition/auction and deduct which competitions ended up without settlement (using auction deadline block).

@squadgazzz
Copy link
Contributor Author

Then combine those data internally in Validator to match each settlement to each competition/auction and deduct which competitions ended up without settlement (using auction deadline block).

Why I initially didn't go with this approach(mostly because of the second point):

  1. At first glance, it seems more complex than SQL queries. Retrieve two different types of data from different sources. Update the data accordingly. Maintain a reasonable cache size.
  2. On each restart, for the statistic-based validators, it would require either executing an SQL query to populate the initial data or waiting for N auctions to accumulate enough data to start blocking the solvers.

@sunce86 Does it make sense? I am not against implementing a more complex and probably more efficient approach, but the only benefit I see is the reduction of DB queries.

@squadgazzz squadgazzz requested a review from sunce86 February 11, 2025 21:06
Copy link
Contributor

@sunce86 sunce86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits.

Do you have in plan writing an e2e test for this? Otherwise I'm afraid we won't have a high conviction it's working properly and we would hope for the best in prod 🙏

@squadgazzz
Copy link
Contributor Author

Do you have in plan writing an e2e test for this?

Yep, I will open another PR since 600+ lines of the code are already too much.

Copy link
Contributor

@sunce86 sunce86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG.

This new logic will always return at least one non-settling solver right? The one that is currently being settled. Not sure what to do with this information but you might want to skip logging/alerting for that special one.

@squadgazzz
Copy link
Contributor Author

This new logic will always return at least one non-settling solver right? The one that is currently being settled.

I didn't get why is that. The currently settling auction gets filtered out in the query because of this: https://github.com/cowprotocol/services/pull/3257/files#diff-ecc7354b24bcc39d93bfb90181abe577203cc25d8f94c9886b2f5f3f1b7894d5R112

@squadgazzz
Copy link
Contributor Author

Thinking more about this approach, it currently relies a lot on the RPC node connection since the settlements table gets updated only once a tx is observed onchain. This might lead to false-positive bans. Maybe it makes sense to switch to settlement error metrics, such as only counting reverts/simulation reverts and ignoring expiration events completely.

Copy link
Contributor

@MartinquaXD MartinquaXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running a DB query on every auction makes the code easier but I'm a bit worried about the performance. Do we already have the right queries to make this fast?

Comment on lines 475 to 477
accepts_unsettled_blocking = value
.parse()
.context("failed to parse solver's third arg param")?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really highlights how painful configuring non-trivial things via ENV variables is. Sneakily interpreting the 3rd argument for something else when the first attempt to parse something is not great. OTOH forcing people to provide some value for the fairness threshold is not great either.
I think it's okay in this PR but we should consider to switch to config files for the autopilot and api sooner rather than later to not make the parsing ever more complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this will be a temporary value until CIP is approved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @MartinquaXD , hopefully we will do that soon 🙏

crates/autopilot/src/arguments.rs Outdated Show resolved Hide resolved
crates/autopilot/src/infra/solvers/mod.rs Outdated Show resolved Hide resolved
crates/autopilot/src/run.rs Outdated Show resolved Hide resolved
@@ -97,6 +97,57 @@ GROUP BY sc.id
sqlx::query_as(QUERY).bind(tx_hash).fetch_optional(ex).await
}

pub async fn find_non_settling_solvers(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query is quite big an gets executed during every auction if I'm not mistaken. Did you benchmark how well it performs?
Also a comment explaining what this query is supposed to do would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets executed in a background task outside the runloop. The execution is triggered based on the updates of the proposed_solutions and proposed_trade_executions tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execution of the query is tested and it completes within 2s since the auction amount is limited by a small number.

?err,
"failed to check if solver is deny listed"
);
let can_participate = self.solver_participation_guard.can_participate(&driver.submission_address).await.map_err(|err| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we fight for every little bit of time it would probably be better to already send the request to the solver and in the meantime run the query. If it turns out the solver should be deny-listed we just discard the solution.
Given that this should be the exception not the regular case this seems like the better approach to me.
Also this drastically reduces the performance requirements for the query. If we run it before sending a request to the solver it should probably finish in a few ms but if we run it while the solver is already computing a solution it wouldn't be a problem if it takes a second or 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this drastically reduces the performance requirements for the query.

All the DB queries get executed in a background task outside the runloop.

crates/autopilot/src/arguments.rs Outdated Show resolved Hide resolved
Comment on lines 475 to 477
accepts_unsettled_blocking = value
.parse()
.context("failed to parse solver's third arg param")?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @MartinquaXD , hopefully we will do that soon 🙏

Copy link
Contributor

@m-lord-renkse m-lord-renkse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

@MartinquaXD MartinquaXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks okay to me. But let's hold off with merging until the e2e test has been reviewed/approved as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Kick out solver from competition if not settling
5 participants