Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DESIGN/SPECS] Widdn protection #236

Open
aimxhaisse opened this issue Jan 15, 2025 · 2 comments
Open

[DESIGN/SPECS] Widdn protection #236

aimxhaisse opened this issue Jan 15, 2025 · 2 comments

Comments

@aimxhaisse
Copy link
Contributor

aimxhaisse commented Jan 15, 2025

Widdn

When in doubt, do nothing.

This is a design discussion around a potential protection for stateful modules of pre-confirmations, with the assumption that missing a block is non-slashable by the pre-confirmation system, and the full state of expected pre-confirmations for the next block proposal is in the module after 2 epochs. The goal is to be able to operate pre-confirmations without risk:

  • be able to restart/upgrade commit boost
  • be able to restart/upgrade your beacon/validator/exec
  • handle bugs where say, commit boost / a module as an OOM issue / is crash-looping

During those events, pre-confirmations can't be met properly: the in-memory state of the module is lost or the validator key can switch to another beacon/instance of the module which doesn't have the same state (if the operator uses a fallback). The resulting proposed block would miss some pre-confirmations and thus, be slashed by the pre-confirmation system. This will happen at scale: if 50% of the network uses pre-confirmations, that's 3600 block proposals/day, so on days where upgrades are required (Pectra?) and multiple components of the stack have to be upgraded, those partial state condition will be met.

The problems were described in the second community call @ https://www.youtube.com/watch?v=PPWwpPx4it0&t=1367s

Overview

The general idea of the protection is to voluntarily miss a block proposal if there is a doubt the block doesn't have all the constraints required by the pre-confirmation system (when in doubt, do nothing). In widdn mode, the pre-confirmation module:

  • responds to /validators, /headers & /status calls properly
  • returns an error in the /blinded_blocks call

Two cases need to be protected against: the module crashes and restart (case A), the validator falls back to another beacon due to a validator/beacon/exec crash/restart/update ... (case B). To handle those cases:

  • The module is in widdn mode if it has an uptime less than 2 epochs,
  • At start, all keys handled by the module are in widdn mode, once the module receives PBS registration calls for two consecutive epochs, the key exits the widdn mode and can safely propose blocks via the module.

Case A

In this scenario, the commit-boost module restarts for whatever reason (upgrade, crash, ...). In this mode, if the module previously accepted pre-confirmations/constraints for the next block proposal, it losts its state and so it is not safe to propose a block with partial pre-confirmations.

With widdn as it's below 2 epochs of uptime on the commit-boost instance, the proposals are missed. During those two epochs, it is expected the module will have enough time to gather all pre-confirmations for the next block proposal (either the TX expire after 2 epochs, or the module fetches it from the gateway at start).

Case B

In this scenario, the validator switches from a commit-boost instance to another, possibly with a different state. After switching, the validator/beacon registers to the second instance on the next epoch. It can propose a block at any time, however it's not safe: maybe the pre-confirmations seen by this second instance don't match the ones from the prior instance on which some pre-confirmations were accepted.

With widdn, the module waits for 2 register calls for two consecutive epochs before marking the validator as OK; it will potentially miss blocks during this period.

Implementation

There can likely be a helper/wrapper in the common codebase of commit-boost to facilitate this, which can be used in module.

@irfanshaik11
Copy link

irfanshaik11 commented Jan 16, 2025

Interesting idea!

the assumption is: "with the assumption that missing a block is non-slashable by the pre-confirmation system"

from ux perspective it doesn't matter whether the proposer missed the slot or missed the constraint so asking the proposer to take the in protocol penalty of missing the slot is an unneeded extra penalty.

Also the drop rate might fall to above 0.5% if we add in % of times commit-boost module fails

it might be simpler for the gateway to do periods healthchecks and disqualify spotty commit boost instances. Good gateways should do healthchecks anyways, and avoid sending to sidecars that have not been alive for a while

widdn might make sense in the gateway itself

@aimxhaisse
Copy link
Contributor Author

aimxhaisse commented Jan 16, 2025

from ux perspective it doesn't matter whether the proposer missed the slot or missed the constraint

From a UX I think there is a difference: a malicious operator that fiddles with blocks differs from an honest validator that misses a block. If I send a pre-conf and the block is missed, nobody "stole" my opportunity in the block: there is no block. I can re-send it on the next block and have it included with the same effect, doing the same "profit". It's as good as it can be. If I send a pre-conf and it's not included in the block, and the block is there, the validator could have stolen the opportunity. I can't resend it on the next block.

From an operational perspective, it boils down to: what is actionable and within the control of the operator. Missing blocks are part of the protocol by design: the next block proposer could not have seen the previous block, build one on the prior one, get more traction on it as maybe the network is struggling, and you get pre-conf slashed even if you had everything perfectly running.

Another example where focusing on missing blocks has deep consequences, let's say suddenly, there is an inactivity leak of 20 epochs because there is a Nethermind bug, there are two branches: the nethermind branch (40% of the network), the rest (60%). Operators realize there is a nethermind bug, decide to upgrade, switch to the 60% branch, and the network continues. In those 20 epochs (640 blocks), all blocks produced on the wrong branch are now considered missed, they had the pre-confs in them, but sadly they weren't on the right side of history, assuming there is half of blocks on preconfs, that's about 50% of 40% of 640: you slash 128 validators. Ethereum is designed for these scenarii to automatically recover/allow room for operators to properly upgrade, and it goes against this.

Given this, I think any solution to work around reducing the probability of missing a block is a positive move as it improves the overall situation for everyone, but it misses the way Ethereum is designed and ultimately, there will always be cases where you slash by design a validator that did the right thing. Allowing for block misses and work around it is I think a better approach, it's not in the interest of validators to miss blocks, they'll minimize to the best they can these events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants