-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(request-response): cleanup connected on dial/listen failures #4777
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this!
Two comments :)
This pull request has merge conflicts. Could you please resolve them @nathanielc? 🙏 |
fbbea4e
to
1f39ce9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I've left some more comments.
fn on_dial_failure(&mut self, DialFailure { peer_id, .. }: DialFailure) { | ||
// Removes the connection if it exists. | ||
fn remove_connection(&mut self, connection_id: ConnectionId, peer: Option<PeerId>) { | ||
if let Some(peer) = peer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the benefit of doing this. I'd much rather always remove by ConnectionId
and simply ignore the PeerId
.
Perhaps we should change our internal data structure to index by ConnectionId
always?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should change our internal data structure to index by ConnectionId always?
Generally yes that makes sense. I took a look at the API the request response exposes and there are a few methods that want to look connections by peer id and do not have access to the connection id. They are
- is_connected
- is_pending_outbound
- is_pending_inbound
- try_send_request
I am not too familiar with the access patterns of the behavior and not sure if the trade offs make sense. If we change the internal structure to be indexed by connection id these methods above would be O(n) instead of O(1) and removing a connection state for a failed connection would be O(1) vs O(n).
My hunch is that the above methods are a more common and more critical code path than removing connections for failed dials/listens, so therefore we should leave the internal index by PeerId.
Or I could change it to keep two indexes but that seems more complicated that we might want.
Thoughts? Happy to go in either direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets go with O(n) for removing connections. I think there is an opportunity to build a data structure which can be used across all protocols for indexing some state T
by both IDs but we don't have to build that in this PR.
Let me know if you are interested though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR I agree lets keep it simple. However a shared data structure across proctocols sounds very useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind opening an issue so we can track the idea of the shared data structure?
8792361
to
99b818d
Compare
Its possible for dial and listen failures to be sent even after handle_established_*_connection methods are called. Therefore we need to clean up any state about those failed connections. Prior to this change the state would get out of sync and cause a debug_assert_eq panic.
Co-authored-by: Thomas Eizinger <[email protected]>
99b818d
to
ea1ff0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had a realisation that this is a much bigger problem than I originally thought it is :(
## 0.26.1 - unreleased | ||
|
||
- Correctly update internal state for failed connections. | ||
See [PR 4777](https://github.com/libp2p/rust-libp2p/pull/4777) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See [PR 4777](https://github.com/libp2p/rust-libp2p/pull/4777) | |
See [PR 4777](https://github.com/libp2p/rust-libp2p/pull/4777). |
@@ -3,7 +3,7 @@ name = "libp2p-request-response" | |||
edition = "2021" | |||
rust-version = { workspace = true } | |||
description = "Generic Request/Response Protocols" | |||
version = "0.26.0" | |||
version = "0.26.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing this bump in the main Cargo.toml
// Its possible that this listen failure is for an existing connection. | ||
// If so we need to remove the connection. | ||
// | ||
// TODO: Once https://github.com/libp2p/rust-libp2p/pull/4818 merges we should pass in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stale comment. #4818 is unrelated now that we agreed to just always work based off ConnectionId
.
@@ -696,6 +723,17 @@ where | |||
} | |||
} | |||
} | |||
// Its possible that this dial failure is for an existing connection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is somewhat unnecessary because there is no such thing as a "non-existing" connection.
If we wanted to be precise we could explain that, depending on where this behaviour is in the tree, we could have already altered our state for this ID but another behaviour "denied" the connection after us.
Now that I am writing this, I had the idea that we can also fix this by only alterting the state in the ConnectionEstablished
event.
Perhaps that might be the better solution after all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should extend the docs of NetworkBehaviour
to explain that.
cc @mxinden This can actually a bit of a problem that didn't occur to me before. Our strategy of "preloading" handlers can lead to us losing state if a NetworkBehaviour
"after" us denies a connection.
To be safe, these "connection management" plugins should always come first in the behaviour tree. But also, we currently don't call poll_close
on the handler when the connection is denied which means a handler doesn't have a way of giving the events back to the behaviour. But that is not an elegant solution anyway because it would mean each protocol has to deal with thes lifecycle bugs.
I have to think about this some more but the better solution might be to change the handle_established
callbacks to take &self
and dispatch the ConnectionEstablished
event only after the handlers have been fully created.
That would avoid this partial altering of state at compile-time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know enough of the details of how preloading handlers works, however it was very confusing issue to debug that a connection that was "established" could have a listen or dial failure. I had wrongly assumed that once established a connection could fail but that would not be communicated as a listen/dial failure but rather just a connection error/closed message.
So anything that makes that relationship clear would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it didn't realise this issue until I thought deeper about this PR :(
It is unfortunately somewhat of a design flaw. Because NetworkBehaviour
s are composed into a tree AND the handle_
functions take &mut self
, you might end up modifying our local state and a behaviour that is deeper down in the tree ends up rejecting the connection which essentially discards the data you just moved into the handler.
To avoid this, make sure any "connection management" behaviour like libp2p-connection-limits
are listed first in your Behaviour
that has #[derive(NetworkBehaviour)]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid this, make sure any "connection management" behaviour like libp2p-connection-limits are listed first in your Behaviour that has #[derive(NetworkBehaviour)].
Thanks I'll make this change internally for now.
I am going to close this in favor of having to order the behaviours in a particular way (for now). Perhaps at a later point, we can fix the design issue and make those handle functions take |
I am wondering what is better:
|
I've recorded the issue in #4870. |
In dev the debug assert failure breaks the process because it kills the running task and no further progress is made on the protocol. That's mildly annoying in dev because I have to restart the process. But its not a deal breaker and made me aware of the issue. In release mode it is just a memory leak. However the conditions that trigger this bug are in my experience much more common in dev than in a release mode. For example using mdns to get many concurrent dials. Its more common to use mdns in a dev setting than in a production setting. TL;DR leaving the debug_assert in place is fine. |
Now that you have re-ordered the behaviours, you should not actually not hit this at all. |
It's possible in certain failure modes to know the peer_id of a failed incoming connection. For example when an inbound connection has been negotiated/upgraded but then rejected locally for a connection limit or similar reason. In these cases it makes sense to communicate the peer_id to behaviours in case they have created any internal state about the peer. Related #4777 Pull-Request: #4818.
It's possible in certain failure modes to know the peer_id of a failed incoming connection. For example when an inbound connection has been negotiated/upgraded but then rejected locally for a connection limit or similar reason. In these cases it makes sense to communicate the peer_id to behaviours in case they have created any internal state about the peer. Related libp2p#4777 Pull-Request: libp2p#4818.
Description
It is possible for dial and listen failures to be sent even after
handle_established_*_connection
methods are called on behaviours. Therefore we need to clean up any state about those failed connections.Prior to this change the state would get out of sync and cause a debug_assert_eq panic.
Fixes: #4773.
Notes & open questions
The tests do not pass locally yet. But I would like feedback on the approach before tracking down the failing test.
Change checklist