Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validators should not serve BlockRequests #1878

Open
DrZoltanFazekas opened this issue Nov 22, 2024 · 2 comments · May be fixed by #2299
Open

Validators should not serve BlockRequests #1878

DrZoltanFazekas opened this issue Nov 22, 2024 · 2 comments · May be fixed by #2299
Assignees
Labels
Agate Required for mainnet launch

Comments

@DrZoltanFazekas
Copy link
Contributor

DrZoltanFazekas commented Nov 22, 2024

Syncing nodes can't distinguish the current validators from other nodes since their latest block contains an outdated validator set. But they validators can advertise that they don't have any blocks so that nodes don't send them request for blocks that they want to download.

@DrZoltanFazekas DrZoltanFazekas added the Agate Required for mainnet launch label Nov 22, 2024
@shawn-zil shawn-zil self-assigned this Dec 12, 2024
@shawn-zil
Copy link
Contributor

shawn-zil commented Dec 19, 2024

When we alter the Validators so that they will always respond with a 'null' response (to indicate that they don't have any blocks) the requesting node retries the range with a different peer. This is desired behaviour.

However, it creates a potential dead-lock. By definition, the non-Validators are generally 'behind' the Validators. In the rare case where restarted Validators are trying to sync, they may not be able to if the only nodes that have the latest blocks are other Validators as the latest blocks have not propagated to non-Validators. This can happen e.g. when the network is restarted before the latest blocks get propagated to the non-Validators.

As seen in this example where the Validator is requesting for blocks [770..772] but the non-Validators only have < 771.

zilliqanode2  | 2024-12-19T03:08:42.714937Z  WARN zilliqa::node: 292: block_store::BlockRequest : sending empty block response.
zilliqanode2  | 2024-12-19T03:08:42.716387Z  WARN zilliqa::block_store: 1014: block_store::request_blocks() : Could not find peers for views RangeMap { ranges: [770..772] }
zilliqanode2  | 2024-12-19T03:08:43.067254Z  INFO zilliqa::node: 967: block response from: 12D3KooWPXw2dXBRH1bT4vcNos9f6W2KoFTiarqptBuTzxaXg7zu len: 0 availability: None
zilliqanode3  | 2024-12-19T03:08:43.069067Z  INFO zilliqa::node: 967: block response from: 12D3KooWNtUCTB4J5MzXc3wawYic4m4pRwTLSL4BzTQc2Csk3VfL len: 1 availability: Some([Latest(10), CachedViewRange(0..771, None)])

Possible solutions:

  1. We will allow Validators to serve BlockRequests from other Validators only; and deny servicing BlockRequests from non-Validators; or
  2. We ignore this edge-case and assume that we will always have some sub-set of nodes running and propagating the latest blocks.
  3. We work out some other mechanism to do this.

I've implement the first solution for now, unless we think that a different solution makes more sense.

@shawn-zil
Copy link
Contributor

Am putting this on-hold for now, since implementing this will kill consensus - in the case where the only ones who have the latest Proposals are the Validators and other nodes are unable to sync. Also, the reason for having it is due to performance issues, which might be improved elsewhere.

github-merge-queue bot pushed a commit that referenced this issue Feb 3, 2025
* feat: added new ExternalMessage::RequestFromHeight and ExternalMessage::RequestFromHash message types.

* feat: initial blockstore.rs skeleton.

* feat: added request/response skeleton.

* feat: hook up initial wiring of blockstore with consensus.

* feat: added blockstore.rs.

* feat: added in-flight check.

* feat: added debug/warn/trace messages.

* feat: initial requests firing.

* feat: convert config value to usize for simplicity.

* feat: added blockstore::handle_response_from_height().

* feat: checkpoint, successful RequestFromHeight-ResponseFromHeight.

* feat: direct insert into DB, without receipts/touched/state.

* feat: successfully injecting blocks/state_trie

* feat: small refactor blockstore.rs

* sec: make RequestId random, to mitigate response injections.

* feat: minor reorg.

* feat: disable speculative requests for now, until we have a better way to limit it.

* feat: re-enabled speculative fetch.

* feat: use InjectedProposals instead of ProcessProposals.

* chore: minor cleanups.

* feat: avoid single source of truth.

* fix: insufficient peers in GCP.

* feat: only inject blocks sourced from two peers - impossible to sync, atm.

* feat: sort-of working sync with multiple sources of truth.

* feat: pre-allocate enough capacity; corrected block_gap check.

* feat: replace non-corroborated blocks in cache.

* chore: clippy. checkpoint. [corroborate proposals]

* feat: [speculative fetch]

* feat: remove peer check, which allows it to proceed under circumstances where there is only 1 peer with the blocks.

* chore: clippy.

* feat: added handle_metadata_request/response().

* feat: [checkpoint - retrieve chain metadata].

* feat: added handle_multiblock_request/response().

* feat: [checkpoint - multi_block_request/response; never quite catching up.]

* chore: clippy.

* feat: sync phase#3 - zip it up. works for syncing new nodes.

* feat: rename blockstore.rs to sync.rs - makes clear that its job is to sync.

* nit: minor refactor - removing previous strategy.

* feat: request multi-blocks from original meta-data peer.

* feat: validates the chain metadata as it is retrieved.

* chore: minor cleanup.

* feat: perform checks to ensure multi-block response matches multi-block request.

* feat: allow retries of request_missing_blocks().

* feat: added ability to retry phase 1, during phase 2 error.

* feat: combined p1_cursor/p2_cursor into a self.state value.

* feat: restructure sync_proposal() to make it legible.

* checkpoint: working sync with state machine.

* Revert "sec: make RequestId random, to mitigate response injections."

This reverts commit 33d45f6.

* feat: make fixed-sized queue size, configurable.

* feat: v1 sync compatibility.

* feat: use ChainMetaData as the main state variable structure.

* feat: make sync compatible with older nodes.

* feat: default to V1 peer; upgrade to V2 peer upon getting invalid response.

* feat: filter V1 responses for gaps and forks.

* feat: working phase 1 with protomainnet.

* feat: removed sending BlockRequest from block_store.rs

* chore: comments, cleanup.

* fix: correct Phase 2 range, the stored value is accurate.

* feat: ensure unique peers.

* feat: output rate stats in block/s

* feat: minor reorg, logging.

* feat: added saving of metadata/segments to DB - allows continuation.

* feat: added stateful sync algorithm feature - can continue after restart.

* feat: rebuilt the algorithm to use DB for state, instead of in-memory.

* feat: added PeerVer info to DB.

* chore: post-rebase.

* feat: removed block_store.rs

* feat: added sync from timeout, not just proposals.

* feat: made the batch_size dynamic, so that it can get past a larger range.

* feat: added dynamic_batch_sizing() which is reactive, not pro-active.

* feat: make dynamic_batch_sizing() work per-peer, not per sync.

* fix: wire up peers in test Network.

* fix: handle when V2 BlockResponse is late.

* feat: sync batch_size should depend on the current request_range, not peer specific.

* feat: simplified the request_missing_metadata() match selector.

* fix: improve test sync, added Network::run_until_synced();

* fix: fixed unreliable::blocks_are_produced_while_a_node_restarts() test.

* fix: staking::validators_can_join_and_become_proposer() test.

* fix: tests.

* nit: use Db::contains_block() instead of Db::get_block_by_hash().

* fix: tests.

* feat: retry sync against upgraded Peer, immediately.

* fix: checkpoints_test(), randomized add_peers() for tests.

* fix: handle_forking(), validators_can_join() test.

* fix: experiment.

* fix: checkpoints_test().

* feat: check for checkpoint block, not just history.

* fix: undoing some test timeouts.

* feat: replace ChainMetaData with BlockHeader.

* feat: changed BlockRequestV2 from 'hash'-based to 'height'-based.

* feat: simplify checkpointed check.

* nit: make sync_data temporary.

* fix: better error handling for committee_for_hash().

* feat: sets starting_block during Sync::new().

* feat: store gas_used as a proxy for block size.

* feat: reordered handle_metadata_response() to allow for micro-segmentation.

* fix: removed dynamic_batch_sizing() as it should be unnecessary until block 1.0M in protomainnet.

* feat: shifts txn verification from server side to client side.

* fix: validators_can_join..()

* fix PR comments:
- https://github.com/Zilliqa/zq2/pull/2089/files#r1926920199
- https://github.com/Zilliqa/zq2/pull/2089/files#r1926894987
- https://github.com/Zilliqa/zq2/pull/2089/files#r1927243328
- https://github.com/Zilliqa/zq2/pull/2089/files#r1927261156

* feat: moved sync db layer from sync.rs to db.rs

* feat: moved internal sync-peers to SyncPeers shared-state.
https://github.com/Zilliqa/zq2/pull/2089/files#r1927206432

* fix: tests with shared-state SyncPeers.

* fix: issue of upgraded node, encountered in protomainnet where a node was recorded as V1 in Phase1, but was updated to V2 in Phase2, causing the sync to be stuck in a loop.

* nit: increase deposit_v3 boundary to 24.

* feat: use libp2p timeout instead of internal sync timeout.

* nit: change sync_data to sync_metadata table name; misc nits.

* feat: added place-holder for active/passive sync.

* fix #2227; and remove txn.verify() during Phase 2 - it is checked during Injection.

* feat: place-holder to store old ZIL txn blocks.

* nit: simplify run_until_synced();

* fix: flaw in get_next_peer().

* feat: early prototype for issue #1878.

* Delete all non-finalized blocks from database at startup

Previously we only deleted 'canonical' blocks.

* Don't fail benchmark workflows on alert

* Remove redundant config

* Hide listen addrs

* feat: store raw header blob in sync_metadata().

* feat: minor log changes; remove redundant check in handle_metadata_response().

---------

Co-authored-by: James Hinshelwood <[email protected]>
@shawn-zil shawn-zil linked a pull request Feb 6, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Agate Required for mainnet launch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants