Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shorten the ClusterInfoVoteListener leader lookahead window #2377

Conversation

steviez
Copy link

@steviez steviez commented Jul 31, 2024

Problem

The ClusterInfoVoteListener is responsible for forwarding votes to BankingStage. The service only forwards votes when the node will be leader soon, and it determines this by checking the PohRecorder.

As currently written, the service will begin forwarding when the node will be leader in slot_hashes::MAX_ENTRIES * 3 slots. This comes out to 512 * 3 = 1536 slots. However, any transaction with a blockhash exeeding MAX_PROCESSING_AGE (150) non-skipped slots is expired and invalid.

Summary of Changes

Shorten the lookahead window to MAX_PROCESSING_AGE (150) slots for forwarding votes to BankingStage

The ClusterInfoVoteListener is responsible for forwarding votes to
BankingStage. The service only forwards votes when the node will be
leader soon, and it determines this by checking the PohRecorder.

As currently written, the service will begin forwarding when the node
will be leader in slot_hashes::MAX_ENTRIES * 3 slots. This comes out to
512 * 3 = 1536 slots. However, any transaction with a blockhash exeeding
MAX_PROCESSING_AGE (150) non-skipped slots is expired and invalid.

Thus, there is no point in sending votes to BankingStage until we are
within MAX_PROCESSING_AGE (150) slots of our leader window. So, shorten
the lookead window down to MAX_PROCESSING_AGE slots.
@steviez
Copy link
Author

steviez commented Aug 1, 2024

Made a PR for this to facilitate conversation - Tracing back with git blame, the value was changed from 20 to 1536 in solana-labs#20873. Moreso, this constant came up for discussion in https://github.com/solana-labs/solana/pull/20873/files#r742437132

as mentioned in the PR description the window was a bit short. Seeing as we only keep around MAX_ENTRIES votes anyways, this allows for about 3 forks of size MAX_ENTRIES each

Carl's comment mentions a hypothetical worst-case scenario of 3 forks of length MAX_ENTRIES each, and wanting to start listening for votes early enough to observe those. This makes sense given that SlotHashes sysvar and the BlockHashQueue update for non-skipped slots only. So, if we have 3 forks of equal length and 1500 slots have progressed overall since a common ancestor, each fork will have 500 more slots from the common ancestor.

That being said, it is still the case that any transaction with a blockhash older than MAX_PROCESSING_AGE (150) non-skipped slots is invalid. So, using the same three-forks-as-worst-case example, it seems that 3 * MAX_PROCESSING_AGE might be more appropriate. With current 3 * 512, we really have space for ~10 forks of equal length with vote transactions that are unexpired. Moreso, logic such as nodes refreshing their votes would also be an argument against needing to start sending votes to a leader so early.

Curious to hear some thoughts from @AshwinSekar & @carllin. I realize that votes over gossip are a bit of a "last line of defense" when the cluster is struggling, but at the very least, still an interesting piece to discuss IMO

Copy link

@AshwinSekar AshwinSekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of this change, has been on my list of things to fix in the vote processing pipeline.
This 3 * 512 was designed in a pre VoteStateUpdate world, where vote transactions were incremental. Along with the giant buffer in VerifiedVotePackets, it made sense to hold votes from multiple forks as it could be necessary to apply these votes in order for each fork's VoteState to be accurate.
With VoteStateUpdate we switched to only storing the latest vote per validator as this captures all necessary info. In this case, we do not need to store any votes older than blockhash expiration as you've pointed out here. 150 is a conservative maximum, as under normal cluster conditions we would expect validators to vote close to once per slot.

@AshwinSekar
Copy link

Also linking #2183 as it is somewhat related to this gossip -> banking stage vote pipeline.

@steviez steviez requested a review from carllin August 1, 2024 15:49
@steviez
Copy link
Author

steviez commented Aug 1, 2024

In this case, we do not need to store any votes older than blockhash expiration as you've pointed out here. 150 is a conservative maximum, as under normal cluster conditions we would expect validators to vote close to once per slot.

Just to confirm:

  • When the cluster is running well, I think we are in agreement that 150 is quite conservative
  • When the cluster is NOT running well, do you think 150 is sufficient? If there is forking, we technically might be ignoring votes before the 150 slot lookahead window that would still be valid when we become leader. BUT, 150 slots (60 seconds of wall clock time) still seems quite generous for a node to get a vote out

Copy link

@AshwinSekar AshwinSekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry you make a good point, I think we can have a better solution, I will retract my previous approval, was confusing this with the old buffer in VerifiedVotePackets

  • When the cluster is NOT running well, do you think 150 is sufficient? If there is forking, we technically might be ignoring votes before the 150 slot lookahead window that would still be valid when we become leader. BUT, 150 slots (60 seconds of wall clock time) still seems quite generous for a node to get a vote out

When the cluster is not running well, 150 is not sufficient and neither is 3 * 512. Since the leader can choose to build off any block, we can place no lower bound on the block height of the fork. So the necessary vote for this fork could have been sent any number of slots ago. We can make a probabilistic assumption, but I believe the only true solution for these situations is the complete fix suggested in #2183 and not hold gossip votes at all.

Essentially every single vote observed through gossip regardless of how close we are to building our next block should be passed to banking_stage to be held in the latest_unprocessed_votes buffer. That way we can ensure that when it comes time to build our block, regardless of which parent we choose, we have captured any votes that could land.

As a temporary measure until I get to #2183 we can remove would_be_leader entirely, and send every vote to VerifiedVotePackets. VerifiedVotePackets accomplishes what latest_unprocessed_votes does for gossip votes by storing the latest vote for each pubkey. And then for the complete fix we can get rid of VerifiedVotePackets entirely and pass through every vote to latest_unprocessed_votes

@steviez
Copy link
Author

steviez commented Aug 2, 2024

Per above discussion, going to close this in favor of a more proper fix in #2183

@steviez steviez closed this Aug 2, 2024
@steviez steviez deleted the civl_shorten_would_be_leader_lookahead branch August 2, 2024 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants