-
Notifications
You must be signed in to change notification settings - Fork 789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node can't sync with the network #493
Comments
Clarification: it does seem to download all the blocks after all and even tries to import them (multiple times), but fails to find parent block that according to logs it seems to receive without issues. This happens over and over again:
Explorer (currently live) for this version of the network: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc.devnet.subspace.network%2Fws#/explorer |
Parent is block 9718, which also happens to be the best block number the node is at, but from a different fork. Basically node stopped at non-canonical block and can't get back onto canonical chain. The common ancestor is just 3 blocks deep and not finalized. |
Yeah I agree, that is probably the core issue. If you do a sync from genesis with a clean DB does it work then? Could possibly be related to this paritytech/substrate#12830 |
Yes, just synced a fresh node from genesis successfully. There are also multiple nodes that were live all the time and work fine, but logs are from one of the two nodes so far that got stuck at different heights.
Yes, I also saw similar logs in paritytech/substrate#13295 which might be related/have a similar root cause. |
Before you continue with the sync refactoring, could you take a look at this problem? |
Hey folks, how high is it on your priority list? |
@dmitry-markin is looking into it, I'll give it a go as well if there is no definite answer soonish but there are other quite high priority tasks in the pipeline as well. Did bisect reveal paritytech/substrate@319a7bd to be the offending commit because it sounds a little suspicious? If this started failing for you recently, have you had a chance to bisect the history of Subspace, in case there is a commit in your history that triggers this behavior in Substrate? How easy is this to reproduce? Just running a Subspace node with default settings will do? |
It happens eventually, likely related to restarts (at least once happened on node that working fine prior to restart), not yet sure how to trigger it reliably, but it does happen quite often given tiny size of this test network. We have been doing many major changes lately, so bisecting will be trickier than to debug directly with local Substrate fork. I have uploaded zip archive with broken node using which I was generating above logs: https://nextcloud.mokrynskyi.com/s/qCoWgZ6Td49xwHf Should be 100% reproducible with above database and https://github.com/subspace/subspace/releases/tag/snapshot-2023-aug-07 when running like this:
It is already at a certain height and will never move any higher than that, it is stuck. Above release uses our Substrate fork at https://github.com/subspace/substrate/tree/subspace-v5, |
To share my findings so far — I was able to reproduce in tests what looks quite similar to the problem described in paritytech/substrate#12830. If it's the underlying issue, we now should at least have an easy way to reproduce and debug it. |
P.S. Here is the failing test if you want to look into: paritytech/substrate#14735. |
It's not completely clear yet, but may be what's happening is a weird combination of syncing logic and peer disconnection due to reputation system. At least with your subspace snapshot/database it looks like this (copying only the relevant lines):
And we never finish filling the gap. What I also find quite annoying is that the block chunks are requested by block number and not by block hash — while this should not cause issues with downloading of a canonical chain, it can interfere with downloading forks. Also, it looks like the issue is not directly related to paritytech/substrate#12830, because in that case the problem was connected with inability to fill fork gaps > 128 blocks with one request, and in our case we have multiple successful "gap" requests. What I would try doing next is disabling @nazar-pc By the way, how do you patch subspace to build against local substrate source? I tried naive patching with diener but hit the issue of building against multiple version of substrate crates. |
@dmitry-markin diener is what I use as well, just need to update |
It turned out the trick was to patch both It looks like the block download continues normally in a descending manner towards block 9715 at one block per response. Unfortunately, as soon as the client downloads block 9719, the sync state machine decides it's ready to import blocks (as it already has 9718), and tries to import all downloaded blocks. Unfortunately, they are on a different fork so the import fails, the blocks are discarded, and the process is restarted. So, IMO, the problem is that block import is triggered by block number (incorrectly, as the parent is different) instead of the block hash with respect to the parent hash. At the same time the block download request is seemingly correct, just the block import is triggered too soonish before all the blocks in the range are downloaded. What likely makes the issue manifest here is that the blocks are big in this part of the chain and only fit one block per response, so the range is not fulfilled completely via one response. |
Yep, that sounds exactly the way I expected it to be |
We clearly have issues with forks downloading. Here is the original range request:
There is no message And due to the logic of downloading of the canonical chain, block import is triggered (as I wrote in the previous message) before we actually have all the blocks from the range. @nazar-pc I will look into the issue more today, but I'm afraid there is no quick fix available. Unfortunately, I'm not available next week, so we'll either need to postpone the fix, or I'll need to ask you too look into it. |
Understandable, thanks for all the help so far, I appreciate it! We have an alternative sync in our protocol, so node will get un-stuck eventually (though might take a long time). I'll post an update here if I get to fixing this now that we have a better understanding why it happens. |
I have a unit test that reproduces the issue. To summarize the problem:
There are several ways to address this. But it seems the issue is because this is not treated as a fork target upon completion of ancestry search (because of the condition: Can you please share some context around why this check is enforced? Is this to use the fork target path only for smaller branches? |
To be honest, I don't know the reason behind this check. Did you try removing it and checking if it helps to unstall the syncing? |
Context: paritytech#493 (comment) Problem: When downloading from a fork, the peer may send a partial response that extends the current chain. Even though the response starts at a block number after the current best number, the parent block from the fork is not yet downloaded/imported. We currently try to import the partial response in this scenario, which causes import to fail as parent is not found in the backend. This causes sync to be aborted/restarted which may not converge. Fix: when we receive a partial response in this scenario, hold on to the partial blocks and try to download the missing blocks. And import when the full range is downloaded.
I have #1686, please take a look. I went down the path of trying to reuse the fork targets. But discovered it may not be suitable, as it does not handle cases like partial responses, etc and requires more extensive changes |
@arkpar To supplement our discussion, this is the branch that is triggered when selecting which blocks to download in |
This makes me curious if just switching the direction of the range requested to ascending will fix the issue not breaking syncing somewhere else. (The obvious issue we might run into is that downloading backwards from a given hash makes clear what fork we are requesting, while downloading forward says nothing about the fork preference.) |
I believe I have a simpler fix for the issue: #1812, PTAL |
Cool, looks very nice! (Switching the range direction in fact broke 5 tests in syncing.) |
When retrieving the ready blocks, verify that the parent of the first ready block is on chain. If the parent is not on chain, we are downloading from a fork. In this case, keep downloading until we have a parent on chain (common ancestor). Resolves #493. --------- Co-authored-by: Aaro Altonen <[email protected]>
When retrieving the ready blocks, verify that the parent of the first ready block is on chain. If the parent is not on chain, we are downloading from a fork. In this case, keep downloading until we have a parent on chain (common ancestor). Resolves paritytech#493. --------- Co-authored-by: Aaro Altonen <[email protected]> (cherry picked from commit 2b4b33d)
I think #1812 that fixed this issue actually resulted in a regression: autonomys/subspace#2122 I have attached some additional logs downstream if you want to look at it, I think the fix should probably be reverted until it is corrected. |
I was able to isolate the issue from the logs, have opened #1999 that adds a unit test for the failure scenario. The basic failure scenario is described in the PR. This also shows up in different ways, for instance:
This out of order scenario could be potentially mitigated by including time stamps in block announcements, so that older announcements can be ignored. Overall, the issue is the same that was addressed by 1812: relying on block numbers instead of hashes. There could be several solutions to this |
The obvious fix is to check if the announced block extends peer's best block. But there may be legit cases that may get filtered by this check. Also, if the announcement is for a block not immediately after the current best block, this check is not possible |
@rahulksnv thanks for investigation. But why does the import work correctly without the check for the parent block being on chain? Shouldn't it fail as we are importing only part of the fork? |
We don't know yet if the import works (we haven't deployed it after 1812 was rolled back). But I guess this is what would happen: import would fail, sync would get reset as before 1812. Then depending on which peers the blocks are downloaded from next time around, import of the whole chain may complete successfully. 1812 was basically preventing import happening in the first place. We had a head of line blocking kind of scenario: 1001 was from wrong fork, and we get stuck as the peer |
Can confirm, it happened few times after 1812 was rolled back as well. So it ends up discarding blocks and restarting the sync, eventually somehow completes. Here was an instance where it ends up with block 20715 from a branch, 20716 from different branch after a re-org. There is a burst of import failures and sync restarts
|
#1999 - adds unit test to demonstrate the issue |
The change adds a test to show the failure scenario that caused #1812 to be rolled back (more context: #493 (comment)) Summary of the scenario: 1. Node has finished downloading up to block 1000 from the peers, from the canonical chain. 2. Peers are undergoing re-org around this time. One of the peers has switched to a non-canonical chain, announces block 1001 from that chain 3. Node downloads 1001 from the peer, and tries to import which would fail (as we don't have the parent block 1000 from the other chain) --------- Co-authored-by: Dmitry Markin <[email protected]>
When retrieving the ready blocks, verify that the parent of the first ready block is on chain. If the parent is not on chain, we are downloading from a fork. In this case, keep downloading until we have a parent on chain (common ancestor). Resolves paritytech#493. --------- Co-authored-by: Aaro Altonen <[email protected]>
The change adds a test to show the failure scenario that caused paritytech#1812 to be rolled back (more context: paritytech#493 (comment)) Summary of the scenario: 1. Node has finished downloading up to block 1000 from the peers, from the canonical chain. 2. Peers are undergoing re-org around this time. One of the peers has switched to a non-canonical chain, announces block 1001 from that chain 3. Node downloads 1001 from the peer, and tries to import which would fail (as we don't have the parent block 1000 from the other chain) --------- Co-authored-by: Dmitry Markin <[email protected]>
* Expose two nodes publicly through brucke.link * Use flags for GNU `sed` instead of BSD `sed` * Update Substrate relay entrypoint scripts to initialize bridge * Add Rialto to Millau relay to Compose deployment * Stop initializing Rialto chain through chainspec * Include logging for Substrate pallet * Make Rialto to Millau entrypoint executable * Use YAML references for relay components * Use published Substrate Relay image * Relay messages from Millau to Rialto * Use Bob nodes to serve message lane * Fix some port number issues for PoA-Rialto deployment * Stop directly referencing `environment` anchor in nodes * Add probable cause to relayer error message * Edit monitoring config file in-place * Add some sleep time between bridge init call and starting relays * Expose grafana. * Use Root key as bridge pallet owner In our case that's going to be Alice since she's Root for our Dev and Local chains. Co-authored-by: Tomasz Drwięga <[email protected]>
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
We have discovered with somewhat recent Susbtrate (319a7bd) nodes sometimes fail to keep up with the network and even after restart they are unable to sync.
We have some heavy blocks on the network and we decreased block response size in our Substrate fork (autonomys/substrate@7fceb6c), but all blocks are strictly smaller than this lower limit anyway, so that is not an issue.
Here is logs of a node that was previously unable to sync with
-l sync=trace
:dariia-node-log.txt
I can see that it tries to download blocks backwards, but stops before reaching correctly identified common ancestor block 9715.
Steps to reproduce
Requires "broken" state of the node (that I have access to) in order to reliably reproduce the issue.
The text was updated successfully, but these errors were encountered: