Validators should not serve `BlockRequests` #1878

DrZoltanFazekas · 2024-11-22T12:05:50Z

Syncing nodes can't distinguish the current validators from other nodes since their latest block contains an outdated validator set. But ~~they~~ validators can advertise that they don't have any blocks so that nodes don't send them request for blocks that they want to download.

shawn-zil · 2024-12-19T03:23:41Z

When we alter the Validators so that they will always respond with a 'null' response (to indicate that they don't have any blocks) the requesting node retries the range with a different peer. This is desired behaviour.

However, it creates a potential dead-lock. By definition, the non-Validators are generally 'behind' the Validators. In the rare case where restarted Validators are trying to sync, they may not be able to if the only nodes that have the latest blocks are other Validators as the latest blocks have not propagated to non-Validators. This can happen e.g. when the network is restarted before the latest blocks get propagated to the non-Validators.

As seen in this example where the Validator is requesting for blocks [770..772] but the non-Validators only have < 771.

zilliqanode2  | 2024-12-19T03:08:42.714937Z  WARN zilliqa::node: 292: block_store::BlockRequest : sending empty block response.
zilliqanode2  | 2024-12-19T03:08:42.716387Z  WARN zilliqa::block_store: 1014: block_store::request_blocks() : Could not find peers for views RangeMap { ranges: [770..772] }
zilliqanode2  | 2024-12-19T03:08:43.067254Z  INFO zilliqa::node: 967: block response from: 12D3KooWPXw2dXBRH1bT4vcNos9f6W2KoFTiarqptBuTzxaXg7zu len: 0 availability: None
zilliqanode3  | 2024-12-19T03:08:43.069067Z  INFO zilliqa::node: 967: block response from: 12D3KooWNtUCTB4J5MzXc3wawYic4m4pRwTLSL4BzTQc2Csk3VfL len: 1 availability: Some([Latest(10), CachedViewRange(0..771, None)])

Possible solutions:

We will allow Validators to serve BlockRequests from other Validators only; and deny servicing BlockRequests from non-Validators; or
We ignore this edge-case and assume that we will always have some sub-set of nodes running and propagating the latest blocks.
We work out some other mechanism to do this.

I've implement the first solution for now, unless we think that a different solution makes more sense.

shawn-zil · 2025-01-28T07:54:07Z

Am putting this on-hold for now, since implementing this will kill consensus - in the case where the only ones who have the latest Proposals are the Validators and other nodes are unable to sync. Also, the reason for having it is due to performance issues, which might be improved elsewhere.

* feat: added new ExternalMessage::RequestFromHeight and ExternalMessage::RequestFromHash message types. * feat: initial blockstore.rs skeleton. * feat: added request/response skeleton. * feat: hook up initial wiring of blockstore with consensus. * feat: added blockstore.rs. * feat: added in-flight check. * feat: added debug/warn/trace messages. * feat: initial requests firing. * feat: convert config value to usize for simplicity. * feat: added blockstore::handle_response_from_height(). * feat: checkpoint, successful RequestFromHeight-ResponseFromHeight. * feat: direct insert into DB, without receipts/touched/state. * feat: successfully injecting blocks/state_trie * feat: small refactor blockstore.rs * sec: make RequestId random, to mitigate response injections. * feat: minor reorg. * feat: disable speculative requests for now, until we have a better way to limit it. * feat: re-enabled speculative fetch. * feat: use InjectedProposals instead of ProcessProposals. * chore: minor cleanups. * feat: avoid single source of truth. * fix: insufficient peers in GCP. * feat: only inject blocks sourced from two peers - impossible to sync, atm. * feat: sort-of working sync with multiple sources of truth. * feat: pre-allocate enough capacity; corrected block_gap check. * feat: replace non-corroborated blocks in cache. * chore: clippy. checkpoint. [corroborate proposals] * feat: [speculative fetch] * feat: remove peer check, which allows it to proceed under circumstances where there is only 1 peer with the blocks. * chore: clippy. * feat: added handle_metadata_request/response(). * feat: [checkpoint - retrieve chain metadata]. * feat: added handle_multiblock_request/response(). * feat: [checkpoint - multi_block_request/response; never quite catching up.] * chore: clippy. * feat: sync phase#3 - zip it up. works for syncing new nodes. * feat: rename blockstore.rs to sync.rs - makes clear that its job is to sync. * nit: minor refactor - removing previous strategy. * feat: request multi-blocks from original meta-data peer. * feat: validates the chain metadata as it is retrieved. * chore: minor cleanup. * feat: perform checks to ensure multi-block response matches multi-block request. * feat: allow retries of request_missing_blocks(). * feat: added ability to retry phase 1, during phase 2 error. * feat: combined p1_cursor/p2_cursor into a self.state value. * feat: restructure sync_proposal() to make it legible. * checkpoint: working sync with state machine. * Revert "sec: make RequestId random, to mitigate response injections." This reverts commit 33d45f6. * feat: make fixed-sized queue size, configurable. * feat: v1 sync compatibility. * feat: use ChainMetaData as the main state variable structure. * feat: make sync compatible with older nodes. * feat: default to V1 peer; upgrade to V2 peer upon getting invalid response. * feat: filter V1 responses for gaps and forks. * feat: working phase 1 with protomainnet. * feat: removed sending BlockRequest from block_store.rs * chore: comments, cleanup. * fix: correct Phase 2 range, the stored value is accurate. * feat: ensure unique peers. * feat: output rate stats in block/s * feat: minor reorg, logging. * feat: added saving of metadata/segments to DB - allows continuation. * feat: added stateful sync algorithm feature - can continue after restart. * feat: rebuilt the algorithm to use DB for state, instead of in-memory. * feat: added PeerVer info to DB. * chore: post-rebase. * feat: removed block_store.rs * feat: added sync from timeout, not just proposals. * feat: made the batch_size dynamic, so that it can get past a larger range. * feat: added dynamic_batch_sizing() which is reactive, not pro-active. * feat: make dynamic_batch_sizing() work per-peer, not per sync. * fix: wire up peers in test Network. * fix: handle when V2 BlockResponse is late. * feat: sync batch_size should depend on the current request_range, not peer specific. * feat: simplified the request_missing_metadata() match selector. * fix: improve test sync, added Network::run_until_synced(); * fix: fixed unreliable::blocks_are_produced_while_a_node_restarts() test. * fix: staking::validators_can_join_and_become_proposer() test. * fix: tests. * nit: use Db::contains_block() instead of Db::get_block_by_hash(). * fix: tests. * feat: retry sync against upgraded Peer, immediately. * fix: checkpoints_test(), randomized add_peers() for tests. * fix: handle_forking(), validators_can_join() test. * fix: experiment. * fix: checkpoints_test(). * feat: check for checkpoint block, not just history. * fix: undoing some test timeouts. * feat: replace ChainMetaData with BlockHeader. * feat: changed BlockRequestV2 from 'hash'-based to 'height'-based. * feat: simplify checkpointed check. * nit: make sync_data temporary. * fix: better error handling for committee_for_hash(). * feat: sets starting_block during Sync::new(). * feat: store gas_used as a proxy for block size. * feat: reordered handle_metadata_response() to allow for micro-segmentation. * fix: removed dynamic_batch_sizing() as it should be unnecessary until block 1.0M in protomainnet. * feat: shifts txn verification from server side to client side. * fix: validators_can_join..() * fix PR comments: - https://github.com/Zilliqa/zq2/pull/2089/files#r1926920199 - https://github.com/Zilliqa/zq2/pull/2089/files#r1926894987 - https://github.com/Zilliqa/zq2/pull/2089/files#r1927243328 - https://github.com/Zilliqa/zq2/pull/2089/files#r1927261156 * feat: moved sync db layer from sync.rs to db.rs * feat: moved internal sync-peers to SyncPeers shared-state. https://github.com/Zilliqa/zq2/pull/2089/files#r1927206432 * fix: tests with shared-state SyncPeers. * fix: issue of upgraded node, encountered in protomainnet where a node was recorded as V1 in Phase1, but was updated to V2 in Phase2, causing the sync to be stuck in a loop. * nit: increase deposit_v3 boundary to 24. * feat: use libp2p timeout instead of internal sync timeout. * nit: change sync_data to sync_metadata table name; misc nits. * feat: added place-holder for active/passive sync. * fix #2227; and remove txn.verify() during Phase 2 - it is checked during Injection. * feat: place-holder to store old ZIL txn blocks. * nit: simplify run_until_synced(); * fix: flaw in get_next_peer(). * feat: early prototype for issue #1878. * Delete all non-finalized blocks from database at startup Previously we only deleted 'canonical' blocks. * Don't fail benchmark workflows on alert * Remove redundant config * Hide listen addrs * feat: store raw header blob in sync_metadata(). * feat: minor log changes; remove redundant check in handle_metadata_response(). --------- Co-authored-by: James Hinshelwood <[email protected]>

DrZoltanFazekas added the Agate Required for mainnet launch label Nov 22, 2024

shawn-zil self-assigned this Dec 12, 2024

shawn-zil mentioned this issue Dec 18, 2024

Get local docker network running again. #2030

Merged

shawn-zil mentioned this issue Dec 19, 2024

Refactor block_store.rs #2039

Closed

shawn-zil added a commit that referenced this issue Jan 28, 2025

feat: early prototype for issue #1878.

dba0235

shawn-zil linked a pull request Feb 6, 2025 that will close this issue

Validators ignore some sync requests #2299

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validators should not serve `BlockRequests` #1878

Validators should not serve `BlockRequests` #1878

DrZoltanFazekas commented Nov 22, 2024 •

edited by shawn-zil

Loading

shawn-zil commented Dec 19, 2024 •

edited

Loading

shawn-zil commented Jan 28, 2025

Validators should not serve BlockRequests #1878

Validators should not serve BlockRequests #1878

Comments

DrZoltanFazekas commented Nov 22, 2024 • edited by shawn-zil Loading

shawn-zil commented Dec 19, 2024 • edited Loading

shawn-zil commented Jan 28, 2025

Validators should not serve `BlockRequests` #1878

Validators should not serve `BlockRequests` #1878

DrZoltanFazekas commented Nov 22, 2024 •

edited by shawn-zil

Loading

shawn-zil commented Dec 19, 2024 •

edited

Loading