[DESIGN/SPECS] Widdn protection #236

aimxhaisse · 2025-01-15T16:07:22Z

Widdn

When in doubt, do nothing.

This is a design discussion around a potential protection for stateful modules of pre-confirmations, with the assumption that missing a block is non-slashable by the pre-confirmation system, and the full state of expected pre-confirmations for the next block proposal is in the module after 2 epochs. The goal is to be able to operate pre-confirmations without risk:

be able to restart/upgrade commit boost
be able to restart/upgrade your beacon/validator/exec
handle bugs where say, commit boost / a module as an OOM issue / is crash-looping

During those events, pre-confirmations can't be met properly: the in-memory state of the module is lost or the validator key can switch to another beacon/instance of the module which doesn't have the same state (if the operator uses a fallback). The resulting proposed block would miss some pre-confirmations and thus, be slashed by the pre-confirmation system. This will happen at scale: if 50% of the network uses pre-confirmations, that's 3600 block proposals/day, so on days where upgrades are required (Pectra?) and multiple components of the stack have to be upgraded, those partial state condition will be met.

The problems were described in the second community call @ https://www.youtube.com/watch?v=PPWwpPx4it0&t=1367s

Overview

The general idea of the protection is to voluntarily miss a block proposal if there is a doubt the block doesn't have all the constraints required by the pre-confirmation system (when in doubt, do nothing). In widdn mode, the pre-confirmation module:

responds to /validators, /headers & /status calls properly
returns an error in the /blinded_blocks call

Two cases need to be protected against: the module crashes and restart (case A), the validator falls back to another beacon due to a validator/beacon/exec crash/restart/update ... (case B). To handle those cases:

The module is in widdn mode if it has an uptime less than 2 epochs,
At start, all keys handled by the module are in widdn mode, once the module receives PBS registration calls for two consecutive epochs, the key exits the widdn mode and can safely propose blocks via the module.

Case A

In this scenario, the commit-boost module restarts for whatever reason (upgrade, crash, ...). In this mode, if the module previously accepted pre-confirmations/constraints for the next block proposal, it losts its state and so it is not safe to propose a block with partial pre-confirmations.

With widdn as it's below 2 epochs of uptime on the commit-boost instance, the proposals are missed. During those two epochs, it is expected the module will have enough time to gather all pre-confirmations for the next block proposal (either the TX expire after 2 epochs, or the module fetches it from the gateway at start).

Case B

In this scenario, the validator switches from a commit-boost instance to another, possibly with a different state. After switching, the validator/beacon registers to the second instance on the next epoch. It can propose a block at any time, however it's not safe: maybe the pre-confirmations seen by this second instance don't match the ones from the prior instance on which some pre-confirmations were accepted.

With widdn, the module waits for 2 register calls for two consecutive epochs before marking the validator as OK; it will potentially miss blocks during this period.

Implementation

There can likely be a helper/wrapper in the common codebase of commit-boost to facilitate this, which can be used in module.

The text was updated successfully, but these errors were encountered:

irfanshaik11 · 2025-01-16T11:11:05Z

Interesting idea!

the assumption is: "with the assumption that missing a block is non-slashable by the pre-confirmation system"

from ux perspective it doesn't matter whether the proposer missed the slot or missed the constraint so asking the proposer to take the in protocol penalty of missing the slot is an unneeded extra penalty.

Also the drop rate might fall to above 0.5% if we add in % of times commit-boost module fails

it might be simpler for the gateway to do periods healthchecks and disqualify spotty commit boost instances. Good gateways should do healthchecks anyways, and avoid sending to sidecars that have not been alive for a while

widdn might make sense in the gateway itself

aimxhaisse · 2025-01-16T16:29:31Z

from ux perspective it doesn't matter whether the proposer missed the slot or missed the constraint

From a UX I think there is a difference: a malicious operator that fiddles with blocks differs from an honest validator that misses a block. If I send a pre-conf and the block is missed, nobody "stole" my opportunity in the block: there is no block. I can re-send it on the next block and have it included with the same effect, doing the same "profit". It's as good as it can be. If I send a pre-conf and it's not included in the block, and the block is there, the validator could have stolen the opportunity. I can't resend it on the next block.

From an operational perspective, it boils down to: what is actionable and within the control of the operator. Missing blocks are part of the protocol by design: the next block proposer could not have seen the previous block, build one on the prior one, get more traction on it as maybe the network is struggling, and you get pre-conf slashed even if you had everything perfectly running.

Another example where focusing on missing blocks has deep consequences, let's say suddenly, there is an inactivity leak of 20 epochs because there is a Nethermind bug, there are two branches: the nethermind branch (40% of the network), the rest (60%). Operators realize there is a nethermind bug, decide to upgrade, switch to the 60% branch, and the network continues. In those 20 epochs (640 blocks), all blocks produced on the wrong branch are now considered missed, they had the pre-confs in them, but sadly they weren't on the right side of history, assuming there is half of blocks on preconfs, that's about 50% of 40% of 640: you slash 128 validators. Ethereum is designed for these scenarii to automatically recover/allow room for operators to properly upgrade, and it goes against this.

Given this, I think any solution to work around reducing the probability of missing a block is a positive move as it improves the overall situation for everyone, but it misses the way Ethereum is designed and ultimately, there will always be cases where you slash by design a validator that did the right thing. Allowing for block misses and work around it is I think a better approach, it's not in the interest of validators to miss blocks, they'll minimize to the best they can these events.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DESIGN/SPECS] Widdn protection #236

[DESIGN/SPECS] Widdn protection #236

aimxhaisse commented Jan 15, 2025 •

edited

Loading

irfanshaik11 commented Jan 16, 2025 •

edited

Loading

aimxhaisse commented Jan 16, 2025 •

edited

Loading

[DESIGN/SPECS] Widdn protection #236

[DESIGN/SPECS] Widdn protection #236

Comments

aimxhaisse commented Jan 15, 2025 • edited Loading

Widdn

Overview

Case A

Case B

Implementation

irfanshaik11 commented Jan 16, 2025 • edited Loading

aimxhaisse commented Jan 16, 2025 • edited Loading

aimxhaisse commented Jan 15, 2025 •

edited

Loading

irfanshaik11 commented Jan 16, 2025 •

edited

Loading

aimxhaisse commented Jan 16, 2025 •

edited

Loading