Shorten the ClusterInfoVoteListener leader lookahead window #2377

steviez · 2024-07-31T23:14:26Z

Problem

The ClusterInfoVoteListener is responsible for forwarding votes to BankingStage. The service only forwards votes when the node will be leader soon, and it determines this by checking the PohRecorder.

As currently written, the service will begin forwarding when the node will be leader in slot_hashes::MAX_ENTRIES * 3 slots. This comes out to 512 * 3 = 1536 slots. However, any transaction with a blockhash exeeding MAX_PROCESSING_AGE (150) non-skipped slots is expired and invalid.

Summary of Changes

Shorten the lookahead window to MAX_PROCESSING_AGE (150) slots for forwarding votes to BankingStage

The ClusterInfoVoteListener is responsible for forwarding votes to BankingStage. The service only forwards votes when the node will be leader soon, and it determines this by checking the PohRecorder. As currently written, the service will begin forwarding when the node will be leader in slot_hashes::MAX_ENTRIES * 3 slots. This comes out to 512 * 3 = 1536 slots. However, any transaction with a blockhash exeeding MAX_PROCESSING_AGE (150) non-skipped slots is expired and invalid. Thus, there is no point in sending votes to BankingStage until we are within MAX_PROCESSING_AGE (150) slots of our leader window. So, shorten the lookead window down to MAX_PROCESSING_AGE slots.

steviez · 2024-08-01T00:13:29Z

Made a PR for this to facilitate conversation - Tracing back with git blame, the value was changed from 20 to 1536 in solana-labs#20873. Moreso, this constant came up for discussion in https://github.com/solana-labs/solana/pull/20873/files#r742437132

as mentioned in the PR description the window was a bit short. Seeing as we only keep around MAX_ENTRIES votes anyways, this allows for about 3 forks of size MAX_ENTRIES each

Carl's comment mentions a hypothetical worst-case scenario of 3 forks of length MAX_ENTRIES each, and wanting to start listening for votes early enough to observe those. This makes sense given that SlotHashes sysvar and the BlockHashQueue update for non-skipped slots only. So, if we have 3 forks of equal length and 1500 slots have progressed overall since a common ancestor, each fork will have 500 more slots from the common ancestor.

That being said, it is still the case that any transaction with a blockhash older than MAX_PROCESSING_AGE (150) non-skipped slots is invalid. So, using the same three-forks-as-worst-case example, it seems that 3 * MAX_PROCESSING_AGE might be more appropriate. With current 3 * 512, we really have space for ~10 forks of equal length with vote transactions that are unexpired. Moreso, logic such as nodes refreshing their votes would also be an argument against needing to start sending votes to a leader so early.

Curious to hear some thoughts from @AshwinSekar & @carllin. I realize that votes over gossip are a bit of a "last line of defense" when the cluster is struggling, but at the very least, still an interesting piece to discuss IMO

AshwinSekar

I'm in favor of this change, has been on my list of things to fix in the vote processing pipeline.
This 3 * 512 was designed in a pre VoteStateUpdate world, where vote transactions were incremental. Along with the giant buffer in VerifiedVotePackets, it made sense to hold votes from multiple forks as it could be necessary to apply these votes in order for each fork's VoteState to be accurate.
With VoteStateUpdate we switched to only storing the latest vote per validator as this captures all necessary info. In this case, we do not need to store any votes older than blockhash expiration as you've pointed out here. 150 is a conservative maximum, as under normal cluster conditions we would expect validators to vote close to once per slot.

AshwinSekar · 2024-08-01T15:41:41Z

Also linking #2183 as it is somewhat related to this gossip -> banking stage vote pipeline.

steviez · 2024-08-01T16:02:35Z

In this case, we do not need to store any votes older than blockhash expiration as you've pointed out here. 150 is a conservative maximum, as under normal cluster conditions we would expect validators to vote close to once per slot.

Just to confirm:

When the cluster is running well, I think we are in agreement that 150 is quite conservative
When the cluster is NOT running well, do you think 150 is sufficient? If there is forking, we technically might be ignoring votes before the 150 slot lookahead window that would still be valid when we become leader. BUT, 150 slots (60 seconds of wall clock time) still seems quite generous for a node to get a vote out

AshwinSekar

Sorry you make a good point, I think we can have a better solution, I will retract my previous approval, was confusing this with the old buffer in VerifiedVotePackets

When the cluster is NOT running well, do you think 150 is sufficient? If there is forking, we technically might be ignoring votes before the 150 slot lookahead window that would still be valid when we become leader. BUT, 150 slots (60 seconds of wall clock time) still seems quite generous for a node to get a vote out

When the cluster is not running well, 150 is not sufficient and neither is 3 * 512. Since the leader can choose to build off any block, we can place no lower bound on the block height of the fork. So the necessary vote for this fork could have been sent any number of slots ago. We can make a probabilistic assumption, but I believe the only true solution for these situations is the complete fix suggested in #2183 and not hold gossip votes at all.

Essentially every single vote observed through gossip regardless of how close we are to building our next block should be passed to banking_stage to be held in the latest_unprocessed_votes buffer. That way we can ensure that when it comes time to build our block, regardless of which parent we choose, we have captured any votes that could land.

As a temporary measure until I get to #2183 we can remove would_be_leader entirely, and send every vote to VerifiedVotePackets. VerifiedVotePackets accomplishes what latest_unprocessed_votes does for gossip votes by storing the latest vote for each pubkey. And then for the complete fix we can get rid of VerifiedVotePackets entirely and pass through every vote to latest_unprocessed_votes

steviez · 2024-08-02T21:02:53Z

Per above discussion, going to close this in favor of a more proper fix in #2183

AshwinSekar approved these changes Aug 1, 2024

View reviewed changes

steviez requested a review from carllin August 1, 2024 15:49

AshwinSekar requested changes Aug 1, 2024

View reviewed changes

steviez closed this Aug 2, 2024

steviez deleted the civl_shorten_would_be_leader_lookahead branch August 2, 2024 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shorten the ClusterInfoVoteListener leader lookahead window #2377

Shorten the ClusterInfoVoteListener leader lookahead window #2377

steviez commented Jul 31, 2024 •

edited

Loading

steviez commented Aug 1, 2024

AshwinSekar left a comment •

edited

Loading

AshwinSekar commented Aug 1, 2024

steviez commented Aug 1, 2024

AshwinSekar left a comment •

edited

Loading

steviez commented Aug 2, 2024

Shorten the ClusterInfoVoteListener leader lookahead window #2377

Shorten the ClusterInfoVoteListener leader lookahead window #2377

Conversation

steviez commented Jul 31, 2024 • edited Loading

Problem

Summary of Changes

steviez commented Aug 1, 2024

AshwinSekar left a comment • edited Loading

Choose a reason for hiding this comment

AshwinSekar commented Aug 1, 2024

steviez commented Aug 1, 2024

AshwinSekar left a comment • edited Loading

Choose a reason for hiding this comment

steviez commented Aug 2, 2024

steviez commented Jul 31, 2024 •

edited

Loading

AshwinSekar left a comment •

edited

Loading

AshwinSekar left a comment •

edited

Loading