-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solver participation guard #3257
base: main
Are you sure you want to change the base?
Conversation
crates/autopilot/src/domain/competition/solver_participation_guard.rs
Outdated
Show resolved
Hide resolved
f69e174
to
5fc831e
Compare
If I got it correctly, you mean some kind of FIFO cache. There is still the settlements data that needs to be fetched from the DB. I thought about this initially, but it adds more complexity to the already non-trivial solution. With the current approach, all the data is received from one source. |
Maybe Validator can be signalled from two sources:
Then combine those data internally in Validator to match each settlement to each competition/auction and deduct which competitions ended up without settlement (using auction deadline block). |
Why I initially didn't go with this approach(mostly because of the second point):
@sunce86 Does it make sense? I am not against implementing a more complex and probably more efficient approach, but the only benefit I see is the reduction of DB queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nits.
Do you have in plan writing an e2e test for this? Otherwise I'm afraid we won't have a high conviction it's working properly and we would hope for the best in prod 🙏
crates/autopilot/src/domain/competition/participation_guard/db.rs
Outdated
Show resolved
Hide resolved
crates/autopilot/src/domain/competition/participation_guard/db.rs
Outdated
Show resolved
Hide resolved
crates/autopilot/src/domain/competition/participation_guard/db.rs
Outdated
Show resolved
Hide resolved
crates/autopilot/src/domain/competition/participation_guard/db.rs
Outdated
Show resolved
Hide resolved
crates/autopilot/src/domain/competition/participation_guard/mod.rs
Outdated
Show resolved
Hide resolved
Yep, I will open another PR since 600+ lines of the code are already too much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG.
This new logic will always return at least one non-settling solver right? The one that is currently being settled. Not sure what to do with this information but you might want to skip logging/alerting for that special one.
crates/autopilot/src/domain/competition/participation_guard/db.rs
Outdated
Show resolved
Hide resolved
I didn't get why is that. The currently settling auction gets filtered out in the query because of this: https://github.com/cowprotocol/services/pull/3257/files#diff-ecc7354b24bcc39d93bfb90181abe577203cc25d8f94c9886b2f5f3f1b7894d5R112 |
# Conflicts: # crates/database/src/lib.rs
Thinking more about this approach, it currently relies a lot on the RPC node connection since the settlements table gets updated only once a tx is observed onchain. This might lead to false-positive bans. Maybe it makes sense to switch to settlement error metrics, such as only counting reverts/simulation reverts and ignoring expiration events completely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running a DB query on every auction makes the code easier but I'm a bit worried about the performance. Do we already have the right queries to make this fast?
crates/autopilot/src/arguments.rs
Outdated
accepts_unsettled_blocking = value | ||
.parse() | ||
.context("failed to parse solver's third arg param")? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really highlights how painful configuring non-trivial things via ENV variables is. Sneakily interpreting the 3rd argument for something else when the first attempt to parse something is not great. OTOH forcing people to provide some value for the fairness threshold is not great either.
I think it's okay in this PR but we should consider to switch to config files for the autopilot and api sooner rather than later to not make the parsing ever more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this will be a temporary value until CIP is approved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @MartinquaXD , hopefully we will do that soon 🙏
crates/autopilot/src/domain/competition/participation_guard/db.rs
Outdated
Show resolved
Hide resolved
crates/autopilot/src/domain/competition/participation_guard/db.rs
Outdated
Show resolved
Hide resolved
crates/autopilot/src/domain/competition/participation_guard/db.rs
Outdated
Show resolved
Hide resolved
@@ -97,6 +97,57 @@ GROUP BY sc.id | |||
sqlx::query_as(QUERY).bind(tx_hash).fetch_optional(ex).await | |||
} | |||
|
|||
pub async fn find_non_settling_solvers( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query is quite big an gets executed during every auction if I'm not mistaken. Did you benchmark how well it performs?
Also a comment explaining what this query is supposed to do would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets executed in a background task outside the runloop. The execution is triggered based on the updates of the proposed_solutions
and proposed_trade_executions
tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The execution of the query is tested and it completes within 2s since the auction amount is limited by a small number.
?err, | ||
"failed to check if solver is deny listed" | ||
); | ||
let can_participate = self.solver_participation_guard.can_participate(&driver.submission_address).await.map_err(|err| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we fight for every little bit of time it would probably be better to already send the request to the solver and in the meantime run the query. If it turns out the solver should be deny-listed we just discard the solution.
Given that this should be the exception not the regular case this seems like the better approach to me.
Also this drastically reduces the performance requirements for the query. If we run it before sending a request to the solver it should probably finish in a few ms but if we run it while the solver is already computing a solution it wouldn't be a problem if it takes a second or 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this drastically reduces the performance requirements for the query.
All the DB queries get executed in a background task outside the runloop.
crates/autopilot/src/arguments.rs
Outdated
accepts_unsettled_blocking = value | ||
.parse() | ||
.context("failed to parse solver's third arg param")? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @MartinquaXD , hopefully we will do that soon 🙏
crates/autopilot/src/domain/competition/participation_guard/db.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks okay to me. But let's hold off with merging until the e2e test has been reviewed/approved as well.
Description
From the original issue:
This PR implements it by introducing a new struct, which checks whether the solver is allowed to participate in the next competition by using two different approaches:
Authenticator
'sis_solver
on-chain call into the new struct.Observer
struct, so the cache gets updated only once theObserver
has some result.These validators are called sequentially to avoid redundant RPC calls to
Authenticator
. So it first checks for the DB-based validator cache and, only then, sends the RPC call.Once one of the strategies says the solver is not allowed to participate, it gets deny-listed for 5m(configurable).
Each validator can be enabled/disabled separately in case of any issue.
Metrics
Added a metric that gets populated by the DB-based validator once a solver is marked as banned. The idea is to create an alert that is sent if there are more than 4 such occurrences for the last 30 minutes for the same solver, meaning it should be considered disabling the solver.
Open discussions
Since the current SQL query filters out auctions where a deadline has not been reached, the following case is possible:
The solver gets banned, while the same solver has a pending settlement. In case this gets settled, the solver remains banned. While this is a niche case, it would be better to unblock the solver before the cache TTL deadline is reached. This has not been implemented in the current PR since some refactoring is required in the Observer struct. If this is approved, it can be implemented quickly.
Whether it makes sense to introduce a metrics-based strategy similar to the bad token detector's where the solver gets banned in case >95%(or similar) of settlements fail.
How to test
A new SQL query test. Existing e2e tests.
Related Issues
Fixes #3221
Summary by CodeRabbit
New Features
- Introduced advanced solver participation controls with configurable eligibility checks, integrating both on-chain and database validations.
- Enabled asynchronous real-time notifications for settlement updates, enhancing system responsiveness.
- Added metrics tracking to monitor auction participation and performance.
Chores
- Updated internal dependencies and restructured driver configuration.
- Reorganized the database schema to support improved auction and settlement processing.