-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use_intra_process_comms does not work with LifecycleNode transitions for BondCpp #2721
Comments
I think the way I read this now is that there's a message that gets delivered after the object is destroyed:
@fujitatomoya is the plausible that the intraprocess subscription management is delivering messages to subscriptions that have been reset / destroyed? |
@SteveMacenski i am not sure with a glance what went wrong here, i am not familiar with bond code. i will take a look. |
I'm not familiar with bond either, but I regularly use IPC with lifecycle nodes without issues. |
@alsora I'm curious - do you have higher-rate subscriptions, lets say 20+hz, reset in the If that's not how either of you read that backtrace, then I should also take a 3rd look at bond, experiment, and see if I can find more specific thoughts -- that might involve migrating this ticket from a possible |
I usually do it in the
there should be a mutex preventing this from happening. Looking at the backtrace it seems to me that the subscription is fine and alive, and that the crash happens inside the bond callback. I wonder if this is caused by IPC having different latency/timing than non-IPC and exercising this race condition |
@SteveMacenski we discussed a bug that could be the cause of this problem in the client library WG. |
This issue has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/nav2-speedups-in-mppi-smac-planner/41667/1 |
Bug report
Required Info:
Expected behavior
IPC enabled has the system able to function correctly with bond, lifecycle nodes.
Actual behavior
IPC enabled does not work with lifecycle and bond nodes. Servers crash when destroying bond connections in
on_deactivate
lifecycle transitions with intra-process related traces.Additional information
I'm working on migrating Nav2 over to using IPC and ran into another issue after #2704 and #2705. I now generally have data flowing around, but we fail to be able to cleanly shutdown any lifecycle nodes when IPC is enabled, both when composed into the same container and also when the nodes are in separate processes. This causes all of our system-level tests to fail, as well as a couple of unit tests that are nice for reproducibility. We have never seen this issue before or without using IPC, so I'm reasonably convinced now its an issue with the IPC manager on shutdown related tasks.
I have backtraces and I think there's an issue with intra-process subscription handling in the lifecycle node case, where we're destroying the bond connection in the
on_deactivate
callback. I'm not 100% sure the root cause (yet, will comment below if it becomes more clear to me), but what is clear to me from the backtrace is that it goes down into the IPC pipeline, as I see manyintra-
prefixed methods deeper into the trace. See stage 15, where it starts ondispatch_intra_process
- in 22, we seerclcpp::Experimental::SubscriptionIntraProcess
.Steps to reproduce issue
We actually have a nice unit test that is pretty concise and relies almost exclusively on standard rclcpp / bond APIs in Nav2: https://github.com/ros-navigation/navigation2/blob/main/nav2_lifecycle_manager/test/test_bond.cpp#L128-L169. Pick any of the
nav2_system_tests/src/system
as well to reproduce. Or simply launch Nav2 and try to control+C out of it.We have our
on_deactivate
method reset the bond connection https://github.com/ros-navigation/navigation2/blob/main/nav2_docking/opennav_docking/src/docking_server.cpp#L126-L144 who is implemented in Nav2's lifecycle node wrapper that is shared by all nodes https://github.com/ros-navigation/navigation2/blob/main/nav2_util/src/lifecycle_node.cpp#L146-L155The text was updated successfully, but these errors were encountered: