Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snap sync is stuck on last 65 blocks #30886

Closed
stevemilk opened this issue Dec 10, 2024 · 7 comments
Closed

Snap sync is stuck on last 65 blocks #30886

stevemilk opened this issue Dec 10, 2024 · 7 comments
Labels

Comments

@stevemilk
Copy link
Contributor

stevemilk commented Dec 10, 2024

System information

Geth version: a fork from v1.14.7, add some non-critical changes
CL client & version: https://github.com/piplabs/story v0.13.0.
OS & Version: OSX
Commit hash : (if develop)

Expected behaviour

Our testnet contains about 100 nodes.
All nodes use a fork of geth as EL, starting with flag --syncmode=full.
One difference between our CL and prysm is that our CL request fcu to EL for snap sync with the same block hash for head, safe, and finalized block. Will this affect the snap sync?

Then I start an EL with flag --syncmode=snap.
The expected behaviour is that the node completes snap sync within 2 days as full sync takes about 2 days in our testnet and there're less than 2 million blocks in total.

Actual behaviour

EL completes header sync quickly and starts fetching batch of receipts and block bodies, but is stuck at 99.99% progress for 2 days.
It keeps logging:
Syncing: chain download in progress synced=99.99% chain=8.85GiB headers=1,050,[email protected] bodies=1,049,[email protected] receipts=1,049,[email protected] eta=1.844s

Steps to reproduce the behaviour

Backtrace

flags:
geth_config.txt

Logs:

@holiman this is the issue I asked yesterday. please take a look

@rjl493456442
Copy link
Member

Looks like the state syncer is not running, or it exits with some unknown reasons.

Please try to run with DEBUG level log and attach the entire log file here.

@stevemilk
Copy link
Contributor Author

stevemilk commented Dec 10, 2024

find that EL node is not transmitting data after reaching 99.99% progress.
Screenshot 2024-12-10 at 5 22 52 PM

here's the normal speed before 99.99%:
Screenshot 2024-12-10 at 5 01 59 PM

@stevemilk
Copy link
Contributor Author

stevemilk commented Dec 10, 2024

Looks like the state syncer is not running, or it exits with some unknown reasons.

Please try to run with DEBUG level log and attach the entire log file here.

debugLog.txt
Here's the entire log

@stevemilk
Copy link
Contributor Author

stevemilk commented Dec 11, 2024

Add some logs here:

P, beforeP, afterP := splitAroundPivot(pivot.Number.Uint64(), results)
log.Debug("pivot split", "P", P, "beforeP", len(beforeP), "afterP", len(afterP), "pivot", pivot.Number)
if err := d.commitSnapSyncData(beforeP, sync); err != nil {
	return err
}
if P != nil {
	// If new pivot block found, cancel old state retrieval and restart
	if oldPivot != P {
		log.Debug("Snap sync cancelled 3")
		log.Debug("New pivot block found", "number", P.Header.Number, "hash", P.Header.Hash())
		sync.Cancel()
		sync = d.syncState(P.Header.Root)

		go closeOnErr(sync)
		oldPivot = P
	}
	...
}

And find that the last round of data fetched contains the pivot block data:

  • skeleton boundry 1 - 1,189,000
  • pivot number: 1,188,936
  • last round range: 1,188,853 - 1,189,000

see detailed log: geth_DEBUG_extra.txt

Then, due to oldPivot ==nil, state sync is canceled and restarted.
But the new state sync never completes for some reason.
Still investigating

@rjl493456442
Copy link
Member

Peer rejected account range request

It seems that all the peers connected to your node are refusing to respond to snap sync requests.

// Response is valid, but check if peer is signalling that it does not have
// the requested data. For account range queries that means the state being
// retrieved was either already pruned remotely, or the peer is not yet
// synced to our head.

For some reasons the requested state is not reachable at the peers and you might need to investigate
the reasons at the remote side.

@holiman
Copy link
Contributor

holiman commented Jan 9, 2025

Seems stale (and not actually mainnet ethereum)

@holiman holiman closed this as completed Jan 9, 2025
@stevemilk
Copy link
Contributor Author

stevemilk commented Jan 9, 2025

Thanks for help guys.
I just found that the root case is our CL client not sending FCU calls with latest head. No issue found in geth side, but I'll add some logs for debugging in geth and you can review it. Here's it: #31009

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants