Trusted node sync #3209

arnetheduck · 2021-12-17T13:12:04Z

Trusted node sync, aka checkpoint sync, allows syncing the chain from a
trusted node instead of relying on a full sync from genesis.

Features include:

sync from any slot, including the latest finalized slot
backfill blocks either from the REST api (default)

Future improvements:

top up blocks between head in database and some other node - this
makes for an efficient backup tool
recreate historical state to enable historical queries
~~use common forked block/state reader in REST API~~ REST JSON support improvements #3232
~~disable JSON state readed to avoid the risk of stack overflows~~ REST JSON support improvements #3232
Backfilling moved to Backfiller #3263

cheatfate · 2021-12-17T17:12:06Z

beacon_chain/sync/sync_queue.nim

@@ -533,7 +533,7 @@ proc push*[T](sq: SyncQueue[T], sr: SyncRequest[T],
          some(sq.readyQueue.pop())
      of SyncQueueKind.Backward:
        let maxSlot = sq.readyQueue[0].request.slot +
-                      (sq.readyQueue[0].request.count - 1'u64)
+                      sq.readyQueue[0].request.count


Request slot points to first slot and request count points to number of slots, so maxSlot is definitely not request.slot + request.count because:

request.slot = 0
request.cout = 2

so maxSlot is not 2 because there will be [0, 1] downloaded, so maxSlot should be 1.

outSlot points to the next slot which should be processed, and does not points to the last slot processed.

github-actions · 2021-12-17T22:38:00Z

Unit Test Results

    12 files ±0   794 suites ±0 41m 56s ⏱️ + 2m 21s
1 601 tests ±0 1 555 ✔️ ±0 46 💤 ±0 0 ❌ ±0
9 453 runs ±0 9 357 ✔️ ±0 96 💤 ±0 0 ❌ ±0

Results for commit 50faee6. ± Comparison against base commit 7e1cdce.

♻️ This comment has been updated with latest results.

jclapis · 2022-01-01T18:38:30Z

Hi guys, first of all thank you for doing this. I cannot tell you how requested checkpoint syncing is from people; especially if they had a catastrophic node failure and need to get their validators back up immediately.

I just gave this branch a quick test. Some initial feedback:

My ideal mode of operation would be to add both --trusted-node-url and --backfill as command line arguments to the normal nimbus_beacon_node command (i.e. "no action"), and include them if the user specified that they would like to checkpoint sync. I would then set --trusted-node-url to whatever they specified, and --backfill to false so Nimbus pulls the historical blocks from gossip once checkpoint syncing has completed.
- This is the behavior that both Lighthouse and Teku currently provide.
- With the separate trustedNodeSync action, I have to manually determine whether or not Nimbus has sync'd before starting it up, and whether or not to start with the special action first and then restart into the conventional mode after. I'd strongly suggest removing the explicit action and building support for this into the "no action" mode so the client itself can make this determination and behave accordingly.
I tried syncing from Infura, and it failed with the following: Unable to download genesis state error="Downloading states via JSON not supported" restUrl=https://[email protected]. Support for checkpoint syncing via a provider like Infura is likely going to be a requirement in order to use it with Rocket Pool, so I encourage adding JSON support.
I was able to successfully checkpoint sync by pointing it at my other Nimbus node's REST endpoint. I see that setting --backfill=false and then restarting Nimbus to normal operations works as intended; it pulls historical blocks from gossip (sync="backfill: 14d01h34m (0.03%) 1.6831slots/s (QQDDQQQQDQ:2045409)") which is excellent.
If I disable my eth1 client, Nimbus will fail with the following after a few minutes:

eth2_1           | WRN 2022-01-01 18:33:17.143+00:00 Eth1 chain monitoring failure, restarting  topics="eth1" err="net_version(web3.provider) failed 3 times"
eth2_1           | Traceback (most recent call last, using override)
eth2_1           | /home/user/nimbus-eth2/vendor/nim-json-rpc/json_rpc/client.nim(372) main
eth2_1           | /home/user/nimbus-eth2/vendor/nim-json-rpc/json_rpc/client.nim(365) NimMain
eth2_1           | /home/user/nimbus-eth2/beacon_chain/nimbus_beacon_node.nim(1913) main
eth2_1           | /home/user/nimbus-eth2/beacon_chain/nimbus_beacon_node.nim(1570) doRunBeaconNode
eth2_1           | /home/user/nimbus-eth2/beacon_chain/nimbus_beacon_node.nim(1232) start
eth2_1           | /home/user/nimbus-eth2/beacon_chain/nimbus_beacon_node.nim(1176) run
eth2_1           | /home/user/nimbus-eth2/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(1215) colonanonymous
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(1202) start
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(1074) startEth1Syncing
eth2_1           | /home/user/nimbus-eth2/vendor/nim-chronos/chronos/asyncfutures2.nim(365) futureContinue
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(1076) startEth1Syncing
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(872) resetState
eth2_1           | /home/user/nimbus-eth2/vendor/nim-chronos/chronos/asyncfutures2.nim(365) futureContinue
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(879) resetState
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(397) close
eth2_1           | /home/user/nimbus-eth2/vendor/nim-chronos/chronos/asyncfutures2.nim(365) futureContinue
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(404) close
eth2_1           | /home/user/nimbus-eth2/vendor/nim-web3/web3.nim(92) close
eth2_1           | /home/user/nimbus-eth2/vendor/nim-json-rpc/json_rpc/client.nim(34) close
eth2_1           | /home/user/nimbus-eth2/vendor/nim-json-rpc/json_rpc/clients/websocketclient.nim(135) close
eth2_1           | /home/user/nimbus-eth2/vendor/nim-chronos/chronos/asyncfutures2.nim(365) futureContinue
eth2_1           | /home/user/nimbus-eth2/vendor/nim-chronos/chronos/asyncmacro2.nim(113) close
eth2_1           | /home/user/nimbus-eth2/vendor/nim-json-rpc/json_rpc/clients/websocketclient.nim(136) close
eth2_1           | /home/user/nimbus-eth2/vendor/nim-chronos/chronos/asyncfutures2.nim(839) cancelAndWait
eth2_1           | /home/user/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(610) signalHandler
eth2_1           | SIGSEGV: Illegal storage access. (Attempt to read from nil?)

Looks like it's failing when restarting the eth1 monitor. I suspect that isn't related to checkpoint syncing, but it should be looked at anyway.

arnetheduck · 2022-01-04T08:18:34Z

Support for checkpoint syncing via a provider like Infura is likely going to be a requirement in order to use it with Rocket Pool, so I encourage adding JSON support.

#3232

tersec · 2022-01-10T06:19:59Z

beacon_chain/trusted_node_sync.nim

+  # The slots mapping stores one linear block history - we construct it by
+  # starting from a given root/slot and walking the known parents as far back
+  # as possible - this ensures that
+  if cache.slots.len() < slot.int + 1:


lenu64 would allow this to be somewhat type-safer

...and crash on the next line where setLen(slot.int + 1) happens :/

Yeah, I don't see a great approach to this. Maybe better to at least be consistent within these two lines, leave it as is.

beacon_chain/trusted_node_sync.nim

tersec · 2022-01-10T06:33:21Z

beacon_chain/trusted_node_sync.nim

+      checkpointSlot, checkpointRoot = shortLog(checkpointBlock.root), headSlot
+    quit 1
+
+  if checkpointSlot.uint64 mod SLOTS_PER_EPOCH != 0:


isEpoch handles/encapsulates the type conversion aspect

mratsim

Question if user starts forward syncing first, then stops then checkpoint syncs.

beacon_chain/conf.nim

mratsim · 2022-01-10T09:35:59Z

beacon_chain/trusted_node_sync.nim

+    info "Database fully backfilled"
+  elif backfill:
+    notice "Backfilling historical blocks",
+      checkpointSlot, missingSlots


Will this part be mutualized with deferred backfill in #3263

The backfill of 3263 picks up wherever this back fill is interrupted

mratsim · 2022-01-10T09:40:27Z

beacon_chain/trusted_node_sync.nim

+      slot
+    else:
+      # When we don't have a head, we'll use the given checkpoint as head
+      FAR_FUTURE_SLOT


So let's say an user:

starts Nimbus

realizes sync will take some time and look for trusted node sync

stops their nodes after 30min

restart with trusted node sync.

The head would not be the chain head but the forward sync head, hence the user would have to delete the DB and restart over?

Yes - this condition is detected further down, when we find out the checkpoint is newer: https://github.com/status-im/nimbus-eth2/pull/3209/files/6160c7a91eff88801a9a07dbf9a79b1feaa41b40#diff-415a96547dcab28dacb7bd1503b5c191600e71925f475a85c8bd616fb584a962R190

It's possible to handle this case as well by simply moving the head to the new position and backfill to the old head instead of all the way to genesis, but that's for a separate future PR I think (in particular, the backfiller would need more work to support this case cc @cheatfate )

So let's say an user:

starts Nimbus

realizes sync will take some time and look for trusted node sync

stops their nodes after 30min

restart with trusted node sync.

The head would not be the chain head but the forward sync head, hence the user would have to delete the DB and restart over?

Based on the fact that checkpoint syncing is a separate action right now, this is likely the workflow we will end up using in Rocket Pool. Check if the database file exists, if not then do a checkpoint sync first (if enabled), else start normally. If the database already exists and the user wants to checkpoint sync because it's taking too long, we will just have them run a command to delete the database and start over using this logic.

Hey, I grabbed your feedback and put it in #3285: with trusted node sync in place, it's easier to consider a checkpoint URL for the "main" command as well, but there are some things we want to consider a bit from a UX point of view before moving in that direction.

What you propose for rocketpool is a good starting point regardless, but let's continue the discussion in that issue - as usual, the detail of the user story is much appreciated.

Trusted node sync, aka checkpoint sync, allows syncing tyhe chain from a trusted node instead of relying on a full sync from genesis. Features include: * sync from any slot, including the latest finalized slot * backfill blocks either from the REST api (default) or p2p (#3263) Future improvements: * top up blocks between head in database and some other node - this makes for an efficient backup tool * recreate historical state to enable historical queries

* load genesis from network metadata * check checkpoint block root against state * fix invalid block root in rest json decoding * odds and ends

mratsim · 2022-01-14T16:01:55Z

docs/the_nimbus_book/src/trusted-node-sync.md

+
+## Caveats
+
+A node synced using trusted node sync will not be able to serve historical requests from before the checkpoint. Future versions will resolve this issue.


What's the impact on sync committee duties?

None, really - the duties are read from the head state which is always available

arnetheduck force-pushed the trusted-node-sync branch from b67cf5a to d1540fd Compare December 17, 2021 16:36

cheatfate reviewed Dec 17, 2021

View reviewed changes

This comment has been minimized.

Sign in to view

arnetheduck force-pushed the trusted-node-sync branch from d1540fd to e3e6958 Compare December 20, 2021 12:15

arnetheduck mentioned this pull request Dec 20, 2021

Revert writing backfill root to database #3215

Merged

arnetheduck force-pushed the trusted-node-sync branch from e3e6958 to 6597155 Compare December 21, 2021 15:15

arnetheduck mentioned this pull request Dec 21, 2021

Implement 2 minutes beacon node sync with checkpoint state #2530

Closed

arnetheduck force-pushed the trusted-node-sync branch from 6597155 to ffc60fc Compare January 6, 2022 08:27

arnetheduck force-pushed the unstable branch 2 times, most recently from 657f9d5 to a4667d1 Compare January 6, 2022 16:14

arnetheduck mentioned this pull request Jan 9, 2022

Backfiller #3263

Merged

arnetheduck force-pushed the trusted-node-sync branch from ffc60fc to 6501fed Compare January 9, 2022 21:17

arnetheduck marked this pull request as ready for review January 9, 2022 21:18

arnetheduck force-pushed the trusted-node-sync branch from 6501fed to 6160c7a Compare January 9, 2022 21:36

tersec reviewed Jan 10, 2022

View reviewed changes

beacon_chain/trusted_node_sync.nim Outdated Show resolved Hide resolved

tersec reviewed Jan 10, 2022

View reviewed changes

mratsim reviewed Jan 10, 2022

View reviewed changes

arnetheduck force-pushed the trusted-node-sync branch from 2d471b9 to 29e809a Compare January 11, 2022 10:04

arnetheduck added 3 commits January 12, 2022 08:24

fixes

63116d7

* load genesis from network metadata * check checkpoint block root against state * fix invalid block root in rest json decoding * odds and ends

retry looking for epoch-boundary checkpoint blocks

50faee6

arnetheduck force-pushed the trusted-node-sync branch from 22a7f5e to 50faee6 Compare January 12, 2022 07:24

arnetheduck mentioned this pull request Jan 14, 2022

Consider checkpoint sync to main beacon node startup #3285

Open

mratsim approved these changes Jan 14, 2022

View reviewed changes

arnetheduck merged commit 68247f8 into unstable Jan 17, 2022

arnetheduck deleted the trusted-node-sync branch January 17, 2022 09:27

TennisBowling mentioned this pull request Feb 15, 2022

[Altair] Syncing from checkpoint state doesn't work anymore #2843

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trusted node sync #3209

Trusted node sync #3209

arnetheduck commented Dec 17, 2021 •

edited

Loading

cheatfate Dec 17, 2021 •

edited

Loading

This comment has been minimized.

github-actions bot commented Dec 17, 2021 •

edited

Loading

jclapis commented Jan 1, 2022

arnetheduck commented Jan 4, 2022

tersec Jan 10, 2022

arnetheduck Jan 10, 2022

tersec Jan 10, 2022

tersec Jan 10, 2022

mratsim left a comment

mratsim Jan 10, 2022

arnetheduck Jan 10, 2022

mratsim Jan 10, 2022

arnetheduck Jan 10, 2022

jclapis Jan 14, 2022

arnetheduck Jan 17, 2022

mratsim Jan 14, 2022 •

edited

Loading

arnetheduck Jan 14, 2022


		## Caveats

		A node synced using trusted node sync will not be able to serve historical requests from before the checkpoint. Future versions will resolve this issue.

Trusted node sync #3209

Trusted node sync #3209

Conversation

arnetheduck commented Dec 17, 2021 • edited Loading

cheatfate Dec 17, 2021 • edited Loading

Choose a reason for hiding this comment

This comment has been minimized.

github-actions bot commented Dec 17, 2021 • edited Loading

Unit Test Results

jclapis commented Jan 1, 2022

arnetheduck commented Jan 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mratsim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mratsim Jan 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnetheduck commented Dec 17, 2021 •

edited

Loading

cheatfate Dec 17, 2021 •

edited

Loading

github-actions bot commented Dec 17, 2021 •

edited

Loading

mratsim Jan 14, 2022 •

edited

Loading