fix(request-response): cleanup connected on dial/listen failures #4777

nathanielc · 2023-11-01T21:43:34Z

Description

It is possible for dial and listen failures to be sent even after handle_established_*_connection methods are called on behaviours. Therefore we need to clean up any state about those failed connections.

Prior to this change the state would get out of sync and cause a debug_assert_eq panic.

Fixes: #4773.

Notes & open questions

The tests do not pass locally yet. But I would like feedback on the approach before tracking down the failing test.

Change checklist

I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
A changelog entry has been made in the appropriate crates

thomaseizinger

Thanks for tackling this!

Two comments :)

protocols/request-response/src/lib.rs

swarm/src/behaviour.rs

mergify · 2023-11-02T16:31:55Z

This pull request has merge conflicts. Could you please resolve them @nathanielc? 🙏

thomaseizinger

Thank you! I've left some more comments.

thomaseizinger · 2023-11-10T02:02:05Z

protocols/request-response/src/lib.rs

-    fn on_dial_failure(&mut self, DialFailure { peer_id, .. }: DialFailure) {
+    // Removes the connection if it exists.
+    fn remove_connection(&mut self, connection_id: ConnectionId, peer: Option<PeerId>) {
+        if let Some(peer) = peer {


I don't understand the benefit of doing this. I'd much rather always remove by ConnectionId and simply ignore the PeerId.

Perhaps we should change our internal data structure to index by ConnectionId always?

Perhaps we should change our internal data structure to index by ConnectionId always?

Generally yes that makes sense. I took a look at the API the request response exposes and there are a few methods that want to look connections by peer id and do not have access to the connection id. They are

is_connected

is_pending_outbound

is_pending_inbound

try_send_request

I am not too familiar with the access patterns of the behavior and not sure if the trade offs make sense. If we change the internal structure to be indexed by connection id these methods above would be O(n) instead of O(1) and removing a connection state for a failed connection would be O(1) vs O(n).

My hunch is that the above methods are a more common and more critical code path than removing connections for failed dials/listens, so therefore we should leave the internal index by PeerId.

Or I could change it to keep two indexes but that seems more complicated that we might want.

Thoughts? Happy to go in either direction.

Lets go with O(n) for removing connections. I think there is an opportunity to build a data structure which can be used across all protocols for indexing some state T by both IDs but we don't have to build that in this PR.

Let me know if you are interested though :)

For this PR I agree lets keep it simple. However a shared data structure across proctocols sounds very useful

Do you mind opening an issue so we can track the idea of the shared data structure?

protocols/request-response/CHANGELOG.md

Its possible for dial and listen failures to be sent even after handle_established_*_connection methods are called. Therefore we need to clean up any state about those failed connections. Prior to this change the state would get out of sync and cause a debug_assert_eq panic.

Co-authored-by: Thomas Eizinger <[email protected]>

thomaseizinger

I just had a realisation that this is a much bigger problem than I originally thought it is :(

thomaseizinger · 2023-11-10T21:25:22Z

protocols/request-response/CHANGELOG.md

+## 0.26.1 - unreleased
+
+- Correctly update internal state for failed connections.
+  See [PR 4777](https://github.com/libp2p/rust-libp2p/pull/4777)


Suggested change

See [PR 4777](https://github.com/libp2p/rust-libp2p/pull/4777)

See [PR 4777](https://github.com/libp2p/rust-libp2p/pull/4777).

thomaseizinger · 2023-11-10T21:25:45Z

protocols/request-response/Cargo.toml

@@ -3,7 +3,7 @@ name = "libp2p-request-response"
 edition = "2021"
 rust-version = { workspace = true }
 description = "Generic Request/Response Protocols"
-version = "0.26.0"
+version = "0.26.1"


Missing this bump in the main Cargo.toml

thomaseizinger · 2023-11-10T21:29:55Z

protocols/request-response/src/lib.rs

+        // Its possible that this listen failure is for an existing connection.
+        // If so we need to remove the connection.
+        //
+        // TODO: Once  https://github.com/libp2p/rust-libp2p/pull/4818 merges we should pass in the


Stale comment. #4818 is unrelated now that we agreed to just always work based off ConnectionId.

thomaseizinger · 2023-11-10T21:33:27Z

protocols/request-response/src/lib.rs

@@ -696,6 +723,17 @@ where
                }
            }
        }
+        // Its possible that this dial failure is for an existing connection.


This comment is somewhat unnecessary because there is no such thing as a "non-existing" connection.

If we wanted to be precise we could explain that, depending on where this behaviour is in the tree, we could have already altered our state for this ID but another behaviour "denied" the connection after us.

Now that I am writing this, I had the idea that we can also fix this by only alterting the state in the ConnectionEstablished event.

Perhaps that might be the better solution after all?

Perhaps we should extend the docs of NetworkBehaviour to explain that.

cc @mxinden This can actually a bit of a problem that didn't occur to me before. Our strategy of "preloading" handlers can lead to us losing state if a NetworkBehaviour "after" us denies a connection.

To be safe, these "connection management" plugins should always come first in the behaviour tree. But also, we currently don't call poll_close on the handler when the connection is denied which means a handler doesn't have a way of giving the events back to the behaviour. But that is not an elegant solution anyway because it would mean each protocol has to deal with thes lifecycle bugs.

I have to think about this some more but the better solution might be to change the handle_established callbacks to take &self and dispatch the ConnectionEstablished event only after the handlers have been fully created.

That would avoid this partial altering of state at compile-time.

I don't know enough of the details of how preloading handlers works, however it was very confusing issue to debug that a connection that was "established" could have a listen or dial failure. I had wrongly assumed that once established a connection could fail but that would not be communicated as a listen/dial failure but rather just a connection error/closed message.

So anything that makes that relationship clear would be good.

Yeah it didn't realise this issue until I thought deeper about this PR :(

It is unfortunately somewhat of a design flaw. Because NetworkBehaviours are composed into a tree AND the handle_ functions take &mut self, you might end up modifying our local state and a behaviour that is deeper down in the tree ends up rejecting the connection which essentially discards the data you just moved into the handler.

To avoid this, make sure any "connection management" behaviour like libp2p-connection-limits are listed first in your Behaviour that has #[derive(NetworkBehaviour)].

To avoid this, make sure any "connection management" behaviour like libp2p-connection-limits are listed first in your Behaviour that has #[derive(NetworkBehaviour)].

Thanks I'll make this change internally for now.

thomaseizinger · 2023-11-15T23:54:43Z

I am going to close this in favor of having to order the behaviours in a particular way (for now). Perhaps at a later point, we can fix the design issue and make those handle functions take &self.

thomaseizinger · 2023-11-15T23:57:31Z

I am wondering what is better:

Keeping this as is, where the debug-assert makes people aware that there is inconsistent state
Fixing this but silently losing state

thomaseizinger · 2023-11-16T00:02:11Z

I've recorded the issue in #4870.

nathanielc · 2023-11-16T15:46:31Z

I am wondering what is better:

* Keeping this as is, where the debug-assert makes people aware that there is inconsistent state

* Fixing this but silently losing state

In dev the debug assert failure breaks the process because it kills the running task and no further progress is made on the protocol. That's mildly annoying in dev because I have to restart the process. But its not a deal breaker and made me aware of the issue.

In release mode it is just a memory leak. However the conditions that trigger this bug are in my experience much more common in dev than in a release mode. For example using mdns to get many concurrent dials. Its more common to use mdns in a dev setting than in a production setting.

TL;DR leaving the debug_assert in place is fine.

thomaseizinger · 2023-11-16T22:23:58Z

Now that you have re-ordered the behaviours, you should not actually not hit this at all.

It's possible in certain failure modes to know the peer_id of a failed incoming connection. For example when an inbound connection has been negotiated/upgraded but then rejected locally for a connection limit or similar reason. In these cases it makes sense to communicate the peer_id to behaviours in case they have created any internal state about the peer. Related #4777 Pull-Request: #4818.

It's possible in certain failure modes to know the peer_id of a failed incoming connection. For example when an inbound connection has been negotiated/upgraded but then rejected locally for a connection limit or similar reason. In these cases it makes sense to communicate the peer_id to behaviours in case they have created any internal state about the peer. Related libp2p#4777 Pull-Request: libp2p#4818.

nathanielc mentioned this pull request Nov 1, 2023

Failing debug_assert in Request-Response protocol #4773

Open

thomaseizinger reviewed Nov 1, 2023

View reviewed changes

protocols/request-response/src/lib.rs Outdated Show resolved Hide resolved

swarm/src/behaviour.rs Outdated Show resolved Hide resolved

nathanielc mentioned this pull request Nov 7, 2023

feat(swarm): add peer_id to ListenFailure #4818

Merged

4 tasks

nathanielc force-pushed the fix/issue-4773 branch from fbbea4e to 1f39ce9 Compare November 9, 2023 15:43

nathanielc requested a review from thomaseizinger November 9, 2023 15:44

thomaseizinger reviewed Nov 10, 2023

View reviewed changes

nathanielc force-pushed the fix/issue-4773 branch from 8792361 to 99b818d Compare November 10, 2023 20:39

nathanielc and others added 3 commits November 10, 2023 13:40

fix: add request-response changelog entry

18db44d

fix(request-response): update CHANGELOG and bump version

ea1ff0b

Co-authored-by: Thomas Eizinger <[email protected]>

nathanielc force-pushed the fix/issue-4773 branch from 99b818d to ea1ff0b Compare November 10, 2023 20:40

nathanielc requested a review from thomaseizinger November 10, 2023 20:41

thomaseizinger reviewed Nov 10, 2023

View reviewed changes

nathanielc mentioned this pull request Nov 14, 2023

fix: check limits first before other behaviours ceramicnetwork/rust-ceramic#183

Merged

thomaseizinger closed this Nov 15, 2023

thomaseizinger mentioned this pull request Nov 16, 2023

swarm: connection management behaviours can lead to inconsistent state #4870

Open

erhant mentioned this pull request Jan 2, 2025

bug: crash in debug modes firstbatchxyz/dkn-compute-node#158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(request-response): cleanup connected on dial/listen failures #4777

fix(request-response): cleanup connected on dial/listen failures #4777

nathanielc commented Nov 1, 2023 •

edited by thomaseizinger

Loading

thomaseizinger left a comment

mergify bot commented Nov 2, 2023

thomaseizinger left a comment

thomaseizinger Nov 10, 2023

nathanielc Nov 10, 2023

thomaseizinger Nov 10, 2023

nathanielc Nov 10, 2023

thomaseizinger Nov 10, 2023 •

edited

Loading

thomaseizinger left a comment

thomaseizinger Nov 10, 2023

thomaseizinger Nov 10, 2023

thomaseizinger Nov 10, 2023

thomaseizinger Nov 10, 2023

thomaseizinger Nov 10, 2023

nathanielc Nov 13, 2023

thomaseizinger Nov 14, 2023

nathanielc Nov 14, 2023

thomaseizinger commented Nov 15, 2023

thomaseizinger commented Nov 15, 2023

thomaseizinger commented Nov 16, 2023

nathanielc commented Nov 16, 2023

thomaseizinger commented Nov 16, 2023

	See [PR 4777](https://github.com/libp2p/rust-libp2p/pull/4777)
	See [PR 4777](https://github.com/libp2p/rust-libp2p/pull/4777).

fix(request-response): cleanup connected on dial/listen failures #4777

fix(request-response): cleanup connected on dial/listen failures #4777

Conversation

nathanielc commented Nov 1, 2023 • edited by thomaseizinger Loading

Description

Notes & open questions

Change checklist

thomaseizinger left a comment

Choose a reason for hiding this comment

mergify bot commented Nov 2, 2023

thomaseizinger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomaseizinger Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

thomaseizinger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomaseizinger commented Nov 15, 2023

thomaseizinger commented Nov 15, 2023

thomaseizinger commented Nov 16, 2023

nathanielc commented Nov 16, 2023

thomaseizinger commented Nov 16, 2023

nathanielc commented Nov 1, 2023 •

edited by thomaseizinger

Loading

thomaseizinger Nov 10, 2023 •

edited

Loading