Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trusted node sync #3209

Merged
merged 3 commits into from
Jan 17, 2022
Merged

Trusted node sync #3209

merged 3 commits into from
Jan 17, 2022

Conversation

arnetheduck
Copy link
Member

@arnetheduck arnetheduck commented Dec 17, 2021

Trusted node sync, aka checkpoint sync, allows syncing the chain from a
trusted node instead of relying on a full sync from genesis.

Features include:

  • sync from any slot, including the latest finalized slot
  • backfill blocks either from the REST api (default)

Future improvements:

@@ -533,7 +533,7 @@ proc push*[T](sq: SyncQueue[T], sr: SyncRequest[T],
some(sq.readyQueue.pop())
of SyncQueueKind.Backward:
let maxSlot = sq.readyQueue[0].request.slot +
(sq.readyQueue[0].request.count - 1'u64)
sq.readyQueue[0].request.count
Copy link
Contributor

@cheatfate cheatfate Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request slot points to first slot and request count points to number of slots, so maxSlot is definitely not request.slot + request.count because:

request.slot = 0
request.cout = 2

so maxSlot is not 2 because there will be [0, 1] downloaded, so maxSlot should be 1.

outSlot points to the next slot which should be processed, and does not points to the last slot processed.

@github-actions

This comment has been minimized.

@github-actions
Copy link

github-actions bot commented Dec 17, 2021

Unit Test Results

     12 files  ±0     794 suites  ±0   41m 56s ⏱️ + 2m 21s
1 601 tests ±0  1 555 ✔️ ±0  46 💤 ±0  0 ±0 
9 453 runs  ±0  9 357 ✔️ ±0  96 💤 ±0  0 ±0 

Results for commit 50faee6. ± Comparison against base commit 7e1cdce.

♻️ This comment has been updated with latest results.

@jclapis
Copy link
Contributor

jclapis commented Jan 1, 2022

Hi guys, first of all thank you for doing this. I cannot tell you how requested checkpoint syncing is from people; especially if they had a catastrophic node failure and need to get their validators back up immediately.

I just gave this branch a quick test. Some initial feedback:

  • My ideal mode of operation would be to add both --trusted-node-url and --backfill as command line arguments to the normal nimbus_beacon_node command (i.e. "no action"), and include them if the user specified that they would like to checkpoint sync. I would then set --trusted-node-url to whatever they specified, and --backfill to false so Nimbus pulls the historical blocks from gossip once checkpoint syncing has completed.
    • This is the behavior that both Lighthouse and Teku currently provide.
    • With the separate trustedNodeSync action, I have to manually determine whether or not Nimbus has sync'd before starting it up, and whether or not to start with the special action first and then restart into the conventional mode after. I'd strongly suggest removing the explicit action and building support for this into the "no action" mode so the client itself can make this determination and behave accordingly.
  • I tried syncing from Infura, and it failed with the following: Unable to download genesis state error="Downloading states via JSON not supported" restUrl=https://[email protected]. Support for checkpoint syncing via a provider like Infura is likely going to be a requirement in order to use it with Rocket Pool, so I encourage adding JSON support.
  • I was able to successfully checkpoint sync by pointing it at my other Nimbus node's REST endpoint. I see that setting --backfill=false and then restarting Nimbus to normal operations works as intended; it pulls historical blocks from gossip (sync="backfill: 14d01h34m (0.03%) 1.6831slots/s (QQDDQQQQDQ:2045409)") which is excellent.
  • If I disable my eth1 client, Nimbus will fail with the following after a few minutes:
eth2_1           | WRN 2022-01-01 18:33:17.143+00:00 Eth1 chain monitoring failure, restarting  topics="eth1" err="net_version(web3.provider) failed 3 times"
eth2_1           | Traceback (most recent call last, using override)
eth2_1           | /home/user/nimbus-eth2/vendor/nim-json-rpc/json_rpc/client.nim(372) main
eth2_1           | /home/user/nimbus-eth2/vendor/nim-json-rpc/json_rpc/client.nim(365) NimMain
eth2_1           | /home/user/nimbus-eth2/beacon_chain/nimbus_beacon_node.nim(1913) main
eth2_1           | /home/user/nimbus-eth2/beacon_chain/nimbus_beacon_node.nim(1570) doRunBeaconNode
eth2_1           | /home/user/nimbus-eth2/beacon_chain/nimbus_beacon_node.nim(1232) start
eth2_1           | /home/user/nimbus-eth2/beacon_chain/nimbus_beacon_node.nim(1176) run
eth2_1           | /home/user/nimbus-eth2/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(1215) colonanonymous
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(1202) start
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(1074) startEth1Syncing
eth2_1           | /home/user/nimbus-eth2/vendor/nim-chronos/chronos/asyncfutures2.nim(365) futureContinue
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(1076) startEth1Syncing
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(872) resetState
eth2_1           | /home/user/nimbus-eth2/vendor/nim-chronos/chronos/asyncfutures2.nim(365) futureContinue
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(879) resetState
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(397) close
eth2_1           | /home/user/nimbus-eth2/vendor/nim-chronos/chronos/asyncfutures2.nim(365) futureContinue
eth2_1           | /home/user/nimbus-eth2/beacon_chain/eth1/eth1_monitor.nim(404) close
eth2_1           | /home/user/nimbus-eth2/vendor/nim-web3/web3.nim(92) close
eth2_1           | /home/user/nimbus-eth2/vendor/nim-json-rpc/json_rpc/client.nim(34) close
eth2_1           | /home/user/nimbus-eth2/vendor/nim-json-rpc/json_rpc/clients/websocketclient.nim(135) close
eth2_1           | /home/user/nimbus-eth2/vendor/nim-chronos/chronos/asyncfutures2.nim(365) futureContinue
eth2_1           | /home/user/nimbus-eth2/vendor/nim-chronos/chronos/asyncmacro2.nim(113) close
eth2_1           | /home/user/nimbus-eth2/vendor/nim-json-rpc/json_rpc/clients/websocketclient.nim(136) close
eth2_1           | /home/user/nimbus-eth2/vendor/nim-chronos/chronos/asyncfutures2.nim(839) cancelAndWait
eth2_1           | /home/user/nimbus-eth2/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(610) signalHandler
eth2_1           | SIGSEGV: Illegal storage access. (Attempt to read from nil?)

Looks like it's failing when restarting the eth1 monitor. I suspect that isn't related to checkpoint syncing, but it should be looked at anyway.

@arnetheduck
Copy link
Member Author

Support for checkpoint syncing via a provider like Infura is likely going to be a requirement in order to use it with Rocket Pool, so I encourage adding JSON support.

#3232

@arnetheduck arnetheduck force-pushed the unstable branch 2 times, most recently from 657f9d5 to a4667d1 Compare January 6, 2022 16:14
@arnetheduck arnetheduck mentioned this pull request Jan 9, 2022
@arnetheduck arnetheduck marked this pull request as ready for review January 9, 2022 21:18
# The slots mapping stores one linear block history - we construct it by
# starting from a given root/slot and walking the known parents as far back
# as possible - this ensures that
if cache.slots.len() < slot.int + 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lenu64 would allow this to be somewhat type-safer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and crash on the next line where setLen(slot.int + 1) happens :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't see a great approach to this. Maybe better to at least be consistent within these two lines, leave it as is.

checkpointSlot, checkpointRoot = shortLog(checkpointBlock.root), headSlot
quit 1

if checkpointSlot.uint64 mod SLOTS_PER_EPOCH != 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isEpoch handles/encapsulates the type conversion aspect

Copy link
Contributor

@mratsim mratsim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question if user starts forward syncing first, then stops then checkpoint syncs.

beacon_chain/conf.nim Show resolved Hide resolved
info "Database fully backfilled"
elif backfill:
notice "Backfilling historical blocks",
checkpointSlot, missingSlots
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this part be mutualized with deferred backfill in #3263

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backfill of 3263 picks up wherever this back fill is interrupted

slot
else:
# When we don't have a head, we'll use the given checkpoint as head
FAR_FUTURE_SLOT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's say an user:

  1. starts Nimbus
  2. realizes sync will take some time and look for trusted node sync
  3. stops their nodes after 30min
  4. restart with trusted node sync.

The head would not be the chain head but the forward sync head, hence the user would have to delete the DB and restart over?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - this condition is detected further down, when we find out the checkpoint is newer: https://github.com/status-im/nimbus-eth2/pull/3209/files/6160c7a91eff88801a9a07dbf9a79b1feaa41b40#diff-415a96547dcab28dacb7bd1503b5c191600e71925f475a85c8bd616fb584a962R190

It's possible to handle this case as well by simply moving the head to the new position and backfill to the old head instead of all the way to genesis, but that's for a separate future PR I think (in particular, the backfiller would need more work to support this case cc @cheatfate )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's say an user:

  1. starts Nimbus
  2. realizes sync will take some time and look for trusted node sync
  3. stops their nodes after 30min
  4. restart with trusted node sync.

The head would not be the chain head but the forward sync head, hence the user would have to delete the DB and restart over?

Based on the fact that checkpoint syncing is a separate action right now, this is likely the workflow we will end up using in Rocket Pool. Check if the database file exists, if not then do a checkpoint sync first (if enabled), else start normally. If the database already exists and the user wants to checkpoint sync because it's taking too long, we will just have them run a command to delete the database and start over using this logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I grabbed your feedback and put it in #3285: with trusted node sync in place, it's easier to consider a checkpoint URL for the "main" command as well, but there are some things we want to consider a bit from a UX point of view before moving in that direction.

What you propose for rocketpool is a good starting point regardless, but let's continue the discussion in that issue - as usual, the detail of the user story is much appreciated.

Trusted node sync, aka checkpoint sync, allows syncing tyhe chain from a
trusted node instead of relying on a full sync from genesis.

Features include:

* sync from any slot, including the latest finalized slot
* backfill blocks either from the REST api (default) or p2p (#3263)

Future improvements:

* top up blocks between head in database and some other node - this
makes for an efficient backup tool
* recreate historical state to enable historical queries
* load genesis from network metadata
* check checkpoint block root against state
* fix invalid block root in rest json decoding
* odds and ends

## Caveats

A node synced using trusted node sync will not be able to serve historical requests from before the checkpoint. Future versions will resolve this issue.
Copy link
Contributor

@mratsim mratsim Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the impact on sync committee duties?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None, really - the duties are read from the head state which is always available

@arnetheduck arnetheduck merged commit 68247f8 into unstable Jan 17, 2022
@arnetheduck arnetheduck deleted the trusted-node-sync branch January 17, 2022 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants