-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent peer identity during restart make it unable to publish message with gossipsub #5097
Comments
This is very interesting. I don't have any resources to debug this but it is definitely something that shouldn't happen! :) |
From a little bit more tracing I found that logs for successful connection:
After reboot:
|
Does |
Both peers subscribe the topic on startup, and these are the logs of the peer that restarted. I'm just using the chat example, which does The node did not restart handles the subscription normally(note the time which corresponds to restarted connection):
To ease understanding: in my setup, |
A little bit more observation:
But even after this, |
I noticed that the errored endpoint is the listen port of
At this point, But a few seconds later, a
After that So with a static listen port and address, the connection will eventually resume because we will receive a |
While tracing it further, my educated guess is that when a QUIC connection is not closed on I observed that |
Sounds almost related to #4917 (since the timout was reduced in #4965) but sounds much deeper. I noticed something similar while connected with just quic back then but only assume that it was due to the connection havent timed out on the other end. EDIT: I can reproduce this easily as well, which is an odd behaviour. |
To further isolate the issue, could somebody try reproducing this with just |
Have this ever been reproduced upstream yet? Interesting to know if this is something within libp2p-quic or quinn. |
I have had what I believe may be the same issue when connected a webrtc-peer (A) to a libp2p-rust relay node (B). In my case, once A restarts, it is unable to receive new gossipsub messages from B, even though B appears in A's list of connected peers. |
@julienmalard could you provide some logs? |
Here's an example log. I could also rerun with debug loggin enabled, but the log size becomes very large.
|
I think the solution for now (at least for me and my use case) would be to use the following settings
so that when PeerB disconnects, PeerA would handle the connection in a short time span so when PeerB comes back online and connects using the same identity (and even address), both peers will be able to send messages to each other. Not ideal, but is the workaround i think I may have to use as a default for now in my project (after first testing for any ill-effects of a short duration, which so far there dont seem to be any, but havent tried it on a larger scale yet). Long term, this should be looked into since PeerA does drop the old connection when PeerB does come online before the timeout kicks in, and even send a event to the new handle to join the mesh (after the old connection drops due to timeout), but PeerB refuse to receive any RPC messages from PeerA, although PeerB does receive packets from PeerA so that works, |
@dariusc93 Would this work if the rust node is a long-running relay node, and the same peer connects, disconnects, and then reconnects many hours (or even a few days) later? |
…arted too quickly See libp2p/rust-libp2p#5097
* feat(code): Add dummy `BlockSync` actor * Add `/block_sync` pubsub channel * Rename `broadcast` to `publish` * Use `eyre` instead of boxed `Error` * Have peers publish their status every time they move to a new height/round * Fix bug where `StartRound` effect was performed twice * Move status handling into BlockSync actor * Less noise * Move status update logic onto BlockSync actor * Decouple GossipConsensus actor from Consensus and BlockSync actors * Push latest hight from consensus to blocksync * Revert "Decouple GossipConsensus actor from Consensus and BlockSync actors" This reverts commit 0a59ada. * feat(code): BlockSync RPC (#457) * Define RPC behaviour for BlockSync * Hook BlockSync behaviour into networking layer * Revert to encoding-agnostic networking layer * Hook BlockSync actor to BlockSync behaviour * Fix unwraps * Remove round from status * code(feat): Add block store (#458) * Add block store for blocksync testing * Sync helper get stored block at height from host * Add block store pruning * Update Cargo.lock * Add blocksync protocol initial implementation, wip * Add config flag for maximum number of blocks in store * Ask for the block only from peer at lower height. * Fix mistake in host, should return ReceivedProposedValue and not ProposedValue. * When processing the synced block use the on_ handlers instead of apply_driver_input. * When storing the decision look for full proposal at proposal round and NOT consensus round * Change max_retain_blocks in config.toml to match the default. * Set voting power for all vals to 1 for easier debugging. * Use on_vote instead of apply_driver_input so our own commit is stored and used on decided operations. * Fix spelling. * Remove lower number of peers (was used for testing). * Store the commits in a set to avoid duplicates. * Return if proposal from non-proposer. * Add TODO for potential min block height in Status * Change max_retain_blocks default to 100. * Small cleanup * Work around libp2p bug where a node cannot publish to a topic if restarted too quickly See libp2p/rust-libp2p#5097 * Cleanup * Ensure validator set is always sorted properly, even after being deserialized from genesis * Update spawn.fish * Move BlockSyncState to blocksync crate * Add batched request for blocks * Fix bug where only a single block would be sent back * Revert block batching * Cleanup * Remove outdated comment * Post-merge fixes * Post-merge fix * Cleanup * chore(code): Extract logic from `BlockSync` actor and into `blocksync` crate (#487) * chore(code): Extract BlockSync logic from actor and into `blocksync` crate * Cleanup * Fix doc * Fix logs * Use custom timeout per test * Add timeouts to BlockSync requests (#490) * feat(code/blocksync): Add timeouts to BlockSync requests * Fix parsing of blocksync config * Randomize choice of peer to sync from (#492) * Add basic integration test (#493) * Re-enable mempool in integration test * Add config option `blocksync.enabled` * Send back actual block bytes instead of just the block hash * Fix example config file * test(code/blocksync): Introduce small DSL for expressing test scenarios (#496) * Introduce small DSL for expressing test scenarios * Add basic integration test for BlockSync * Update fish script * Allow overriding the transport used in tests via a `MALACHITE_TRANSPORT` env variable * Update test * Add doc comments * Use TCP by default in tests Can be overriden locally with `MALACHITE_TRANSPORT=quic` * Run integration tests sequentially * Run integration tests in release mode * Fix coverage job * Try again * Do not panic when tracing subscribers are set twice * Enable basic optimizations in debug mode * Update MSRV to 1.82 * feat(code/blocksync): Only request sync from peers which have the requested block in their store (#495) * feat(code/blocksync): Only request sync from peers which have the requested block in their store For this, we extend the status with the earliest block height available in the store, to ensure we only request a block from peers which have told us they have it. * Remove `blocksync::State::random_peer_at_or_above()` method * Update clippy msrv * Add metrics * Ensure Consensus and BlockSync actors never fail --------- Co-authored-by: Anca Zamfir <[email protected]>
Summary
When using QUIC transport exclusively with a persistent identity, the restart of the client of one node would make that node unable to publish message and getting an
InsufficientPeers
error. However, bothswarm
andgossipsub
would report that there is a connected peer.For example:
Both PeerA and PeerB has a static identity(and peer id derived from that identity) and connected to each other using QUIC transport exclusively, and gossipsub is able to send messages to each other.
Now PeerA restarts, connect to PeerB with the same identity, PeerA will be unable to publish any message and will always get
InsufficientPeers
, but theswarm.connected_peers()
andgossipsub.all_peers()
are returning PeerB as connected peer. PeerB can publish message to PeerA.To reproduce this, we can simply modify the
chat
example:First we add
ed25519_dalek
as a direct dependency, then add the following to thechat
example:Then change the
private_key.pem
for PeerA and PeerB, we can use openssl to generate the pem:Make sure PeerA and PeerB is using different private key, and restart one of the client, you will see
Publish error: InsufficientPeers
on the restarted node.Expected behavior
The peer should be able to publish message even after restart
Actual behavior
The peer restarted will report
InsufficientPeers
, the peer not restarted is able to publish message to the peer restarted.Relevant log output
Enter messages via STDIN and they will be sent to connected peers using Gossipsub Local node is listening on /ip4/127.0.0.1/udp/62392/quic-v1 Local node is listening on /ip4/192.168.125.145/udp/62392/quic-v1 mDNS discovered a new peer: 12D3KooWC1tB5J6W7NWwzXVFHZBggNLU1T5etuTtmpnf9WBQaJCm mDNS discovered a new peer: 12D3KooWC1tB5J6W7NWwzXVFHZBggNLU1T5etuTtmpnf9WBQaJCm test Publish error: InsufficientPeers
Possible Solution
Current workarounds including:
swarm::ConnectionError::IO
event, then start the node. Starting the node after remote closed the connection will not triggerInsufficientPeers
Version
I tried "0.53.2" as well as running the chat example directly on
master
with3e57cf494792031459d7f831408af24dbedc1a0b
.Would you like to work on fixing this bug ?
Yes
The text was updated successfully, but these errors were encountered: