-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trusted node sync #3209
Trusted node sync #3209
Conversation
b67cf5a
to
d1540fd
Compare
beacon_chain/sync/sync_queue.nim
Outdated
@@ -533,7 +533,7 @@ proc push*[T](sq: SyncQueue[T], sr: SyncRequest[T], | |||
some(sq.readyQueue.pop()) | |||
of SyncQueueKind.Backward: | |||
let maxSlot = sq.readyQueue[0].request.slot + | |||
(sq.readyQueue[0].request.count - 1'u64) | |||
sq.readyQueue[0].request.count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request slot points to first slot and request count points to number of slots, so maxSlot
is definitely not request.slot + request.count
because:
request.slot = 0
request.cout = 2
so maxSlot
is not 2
because there will be [0, 1] downloaded, so maxSlot
should be 1
.
outSlot
points to the next slot which should be processed, and does not points to the last slot processed.
This comment has been minimized.
This comment has been minimized.
d1540fd
to
e3e6958
Compare
e3e6958
to
6597155
Compare
Hi guys, first of all thank you for doing this. I cannot tell you how requested checkpoint syncing is from people; especially if they had a catastrophic node failure and need to get their validators back up immediately. I just gave this branch a quick test. Some initial feedback:
Looks like it's failing when restarting the eth1 monitor. I suspect that isn't related to checkpoint syncing, but it should be looked at anyway. |
|
6597155
to
ffc60fc
Compare
657f9d5
to
a4667d1
Compare
ffc60fc
to
6501fed
Compare
6501fed
to
6160c7a
Compare
# The slots mapping stores one linear block history - we construct it by | ||
# starting from a given root/slot and walking the known parents as far back | ||
# as possible - this ensures that | ||
if cache.slots.len() < slot.int + 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lenu64
would allow this to be somewhat type-safer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and crash on the next line where setLen(slot.int + 1)
happens :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't see a great approach to this. Maybe better to at least be consistent within these two lines, leave it as is.
beacon_chain/trusted_node_sync.nim
Outdated
checkpointSlot, checkpointRoot = shortLog(checkpointBlock.root), headSlot | ||
quit 1 | ||
|
||
if checkpointSlot.uint64 mod SLOTS_PER_EPOCH != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isEpoch
handles/encapsulates the type conversion aspect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question if user starts forward syncing first, then stops then checkpoint syncs.
info "Database fully backfilled" | ||
elif backfill: | ||
notice "Backfilling historical blocks", | ||
checkpointSlot, missingSlots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this part be mutualized with deferred backfill in #3263
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backfill of 3263 picks up wherever this back fill is interrupted
slot | ||
else: | ||
# When we don't have a head, we'll use the given checkpoint as head | ||
FAR_FUTURE_SLOT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's say an user:
- starts Nimbus
- realizes sync will take some time and look for trusted node sync
- stops their nodes after 30min
- restart with trusted node sync.
The head would not be the chain head but the forward sync head, hence the user would have to delete the DB and restart over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - this condition is detected further down, when we find out the checkpoint is newer: https://github.com/status-im/nimbus-eth2/pull/3209/files/6160c7a91eff88801a9a07dbf9a79b1feaa41b40#diff-415a96547dcab28dacb7bd1503b5c191600e71925f475a85c8bd616fb584a962R190
It's possible to handle this case as well by simply moving the head to the new position and backfill to the old head instead of all the way to genesis, but that's for a separate future PR I think (in particular, the backfiller would need more work to support this case cc @cheatfate )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's say an user:
- starts Nimbus
- realizes sync will take some time and look for trusted node sync
- stops their nodes after 30min
- restart with trusted node sync.
The head would not be the chain head but the forward sync head, hence the user would have to delete the DB and restart over?
Based on the fact that checkpoint syncing is a separate action right now, this is likely the workflow we will end up using in Rocket Pool. Check if the database file exists, if not then do a checkpoint sync first (if enabled), else start normally. If the database already exists and the user wants to checkpoint sync because it's taking too long, we will just have them run a command to delete the database and start over using this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I grabbed your feedback and put it in #3285: with trusted node sync in place, it's easier to consider a checkpoint URL for the "main" command as well, but there are some things we want to consider a bit from a UX point of view before moving in that direction.
What you propose for rocketpool is a good starting point regardless, but let's continue the discussion in that issue - as usual, the detail of the user story is much appreciated.
2d471b9
to
29e809a
Compare
Trusted node sync, aka checkpoint sync, allows syncing tyhe chain from a trusted node instead of relying on a full sync from genesis. Features include: * sync from any slot, including the latest finalized slot * backfill blocks either from the REST api (default) or p2p (#3263) Future improvements: * top up blocks between head in database and some other node - this makes for an efficient backup tool * recreate historical state to enable historical queries
22a7f5e
to
50faee6
Compare
|
||
## Caveats | ||
|
||
A node synced using trusted node sync will not be able to serve historical requests from before the checkpoint. Future versions will resolve this issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the impact on sync committee duties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None, really - the duties are read from the head state which is always available
Trusted node sync, aka checkpoint sync, allows syncing the chain from a
trusted node instead of relying on a full sync from genesis.
Features include:
Future improvements:
top up blocks between head in database and some other node - this
makes for an efficient backup tool
recreate historical state to enable historical queries
use common forked block/state reader in REST APIREST JSON support improvements #3232disable JSON state readed to avoid the risk of stack overflowsREST JSON support improvements #3232Backfilling moved to Backfiller #3263