Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(request-response): cleanup connected on dial/listen failures #4777

Closed
wants to merge 3 commits into from

Conversation

nathanielc
Copy link
Contributor

@nathanielc nathanielc commented Nov 1, 2023

Description

It is possible for dial and listen failures to be sent even after handle_established_*_connection methods are called on behaviours. Therefore we need to clean up any state about those failed connections.

Prior to this change the state would get out of sync and cause a debug_assert_eq panic.

Fixes: #4773.

Notes & open questions

The tests do not pass locally yet. But I would like feedback on the approach before tracking down the failing test.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this!

Two comments :)

protocols/request-response/src/lib.rs Outdated Show resolved Hide resolved
swarm/src/behaviour.rs Outdated Show resolved Hide resolved
Copy link
Contributor

mergify bot commented Nov 2, 2023

This pull request has merge conflicts. Could you please resolve them @nathanielc? 🙏

Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I've left some more comments.

fn on_dial_failure(&mut self, DialFailure { peer_id, .. }: DialFailure) {
// Removes the connection if it exists.
fn remove_connection(&mut self, connection_id: ConnectionId, peer: Option<PeerId>) {
if let Some(peer) = peer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the benefit of doing this. I'd much rather always remove by ConnectionId and simply ignore the PeerId.

Perhaps we should change our internal data structure to index by ConnectionId always?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should change our internal data structure to index by ConnectionId always?

Generally yes that makes sense. I took a look at the API the request response exposes and there are a few methods that want to look connections by peer id and do not have access to the connection id. They are

  • is_connected
  • is_pending_outbound
  • is_pending_inbound
  • try_send_request

I am not too familiar with the access patterns of the behavior and not sure if the trade offs make sense. If we change the internal structure to be indexed by connection id these methods above would be O(n) instead of O(1) and removing a connection state for a failed connection would be O(1) vs O(n).

My hunch is that the above methods are a more common and more critical code path than removing connections for failed dials/listens, so therefore we should leave the internal index by PeerId.

Or I could change it to keep two indexes but that seems more complicated that we might want.

Thoughts? Happy to go in either direction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets go with O(n) for removing connections. I think there is an opportunity to build a data structure which can be used across all protocols for indexing some state T by both IDs but we don't have to build that in this PR.

Let me know if you are interested though :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR I agree lets keep it simple. However a shared data structure across proctocols sounds very useful

Copy link
Contributor

@thomaseizinger thomaseizinger Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind opening an issue so we can track the idea of the shared data structure?

protocols/request-response/CHANGELOG.md Outdated Show resolved Hide resolved
protocols/request-response/CHANGELOG.md Outdated Show resolved Hide resolved
nathanielc and others added 3 commits November 10, 2023 13:40
Its possible for dial and listen failures to be sent even after handle_established_*_connection
methods are called. Therefore we need to clean up any state about those failed connections.

Prior to this change the state would get out of sync and cause a debug_assert_eq panic.
Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a realisation that this is a much bigger problem than I originally thought it is :(

## 0.26.1 - unreleased

- Correctly update internal state for failed connections.
See [PR 4777](https://github.com/libp2p/rust-libp2p/pull/4777)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
See [PR 4777](https://github.com/libp2p/rust-libp2p/pull/4777)
See [PR 4777](https://github.com/libp2p/rust-libp2p/pull/4777).

@@ -3,7 +3,7 @@ name = "libp2p-request-response"
edition = "2021"
rust-version = { workspace = true }
description = "Generic Request/Response Protocols"
version = "0.26.0"
version = "0.26.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing this bump in the main Cargo.toml

// Its possible that this listen failure is for an existing connection.
// If so we need to remove the connection.
//
// TODO: Once https://github.com/libp2p/rust-libp2p/pull/4818 merges we should pass in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment. #4818 is unrelated now that we agreed to just always work based off ConnectionId.

@@ -696,6 +723,17 @@ where
}
}
}
// Its possible that this dial failure is for an existing connection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is somewhat unnecessary because there is no such thing as a "non-existing" connection.

If we wanted to be precise we could explain that, depending on where this behaviour is in the tree, we could have already altered our state for this ID but another behaviour "denied" the connection after us.

Now that I am writing this, I had the idea that we can also fix this by only alterting the state in the ConnectionEstablished event.

Perhaps that might be the better solution after all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should extend the docs of NetworkBehaviour to explain that.

cc @mxinden This can actually a bit of a problem that didn't occur to me before. Our strategy of "preloading" handlers can lead to us losing state if a NetworkBehaviour "after" us denies a connection.

To be safe, these "connection management" plugins should always come first in the behaviour tree. But also, we currently don't call poll_close on the handler when the connection is denied which means a handler doesn't have a way of giving the events back to the behaviour. But that is not an elegant solution anyway because it would mean each protocol has to deal with thes lifecycle bugs.

I have to think about this some more but the better solution might be to change the handle_established callbacks to take &self and dispatch the ConnectionEstablished event only after the handlers have been fully created.

That would avoid this partial altering of state at compile-time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough of the details of how preloading handlers works, however it was very confusing issue to debug that a connection that was "established" could have a listen or dial failure. I had wrongly assumed that once established a connection could fail but that would not be communicated as a listen/dial failure but rather just a connection error/closed message.

So anything that makes that relationship clear would be good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it didn't realise this issue until I thought deeper about this PR :(

It is unfortunately somewhat of a design flaw. Because NetworkBehaviours are composed into a tree AND the handle_ functions take &mut self, you might end up modifying our local state and a behaviour that is deeper down in the tree ends up rejecting the connection which essentially discards the data you just moved into the handler.

To avoid this, make sure any "connection management" behaviour like libp2p-connection-limits are listed first in your Behaviour that has #[derive(NetworkBehaviour)].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid this, make sure any "connection management" behaviour like libp2p-connection-limits are listed first in your Behaviour that has #[derive(NetworkBehaviour)].

Thanks I'll make this change internally for now.

@thomaseizinger
Copy link
Contributor

I am going to close this in favor of having to order the behaviours in a particular way (for now). Perhaps at a later point, we can fix the design issue and make those handle functions take &self.

@thomaseizinger
Copy link
Contributor

I am wondering what is better:

  • Keeping this as is, where the debug-assert makes people aware that there is inconsistent state
  • Fixing this but silently losing state

@thomaseizinger
Copy link
Contributor

I've recorded the issue in #4870.

@nathanielc
Copy link
Contributor Author

I am wondering what is better:

* Keeping this as is, where the debug-assert makes people aware that there is inconsistent state

* Fixing this but silently losing state

In dev the debug assert failure breaks the process because it kills the running task and no further progress is made on the protocol. That's mildly annoying in dev because I have to restart the process. But its not a deal breaker and made me aware of the issue.

In release mode it is just a memory leak. However the conditions that trigger this bug are in my experience much more common in dev than in a release mode. For example using mdns to get many concurrent dials. Its more common to use mdns in a dev setting than in a production setting.

TL;DR leaving the debug_assert in place is fine.

@thomaseizinger
Copy link
Contributor

Now that you have re-ordered the behaviours, you should not actually not hit this at all.

mergify bot pushed a commit that referenced this pull request Jun 13, 2024
It's possible in certain failure modes to know the peer_id of a failed incoming connection. For example when an inbound connection has been negotiated/upgraded but then rejected locally for a connection limit or similar reason. In these cases it makes sense to communicate the peer_id to behaviours in case they have created any internal state about the peer.

Related #4777

Pull-Request: #4818.
TimTinkers pushed a commit to unattended-backpack/rust-libp2p that referenced this pull request Sep 14, 2024
It's possible in certain failure modes to know the peer_id of a failed incoming connection. For example when an inbound connection has been negotiated/upgraded but then rejected locally for a connection limit or similar reason. In these cases it makes sense to communicate the peer_id to behaviours in case they have created any internal state about the peer.

Related libp2p#4777

Pull-Request: libp2p#4818.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failing debug_assert in Request-Response protocol
2 participants