Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use_intra_process_comms does not work with LifecycleNode transitions for BondCpp #2721

Open
SteveMacenski opened this issue Jan 7, 2025 · 7 comments

Comments

@SteveMacenski
Copy link
Collaborator

SteveMacenski commented Jan 7, 2025

Bug report

Required Info:

  • Operating System: Ubuntu 22.04 (using OSRF Ubuntu 24.04 docker)
  • Installation type: Binaries
  • Version or commit hash: Latest Rolling 29.3.0
  • DDS implementation: Fast-DDS
  • Client library (if applicable): rclcpp

Expected behavior

IPC enabled has the system able to function correctly with bond, lifecycle nodes.

Actual behavior

IPC enabled does not work with lifecycle and bond nodes. Servers crash when destroying bond connections in on_deactivate lifecycle transitions with intra-process related traces.

Additional information

I'm working on migrating Nav2 over to using IPC and ran into another issue after #2704 and #2705. I now generally have data flowing around, but we fail to be able to cleanly shutdown any lifecycle nodes when IPC is enabled, both when composed into the same container and also when the nodes are in separate processes. This causes all of our system-level tests to fail, as well as a couple of unit tests that are nice for reproducibility. We have never seen this issue before or without using IPC, so I'm reasonably convinced now its an issue with the IPC manager on shutdown related tasks.

I have backtraces and I think there's an issue with intra-process subscription handling in the lifecycle node case, where we're destroying the bond connection in the on_deactivate callback. I'm not 100% sure the root cause (yet, will comment below if it becomes more clear to me), but what is clear to me from the backtrace is that it goes down into the IPC pipeline, as I see many intra- prefixed methods deeper into the trace. See stage 15, where it starts on dispatch_intra_process - in 22, we see rclcpp::Experimental::SubscriptionIntraProcess.

Screenshot from 2025-01-06 16-23-19

Screenshot from 2025-01-06 16-31-07

Screenshot from 2025-01-06 16-31-16

Steps to reproduce issue

We actually have a nice unit test that is pretty concise and relies almost exclusively on standard rclcpp / bond APIs in Nav2: https://github.com/ros-navigation/navigation2/blob/main/nav2_lifecycle_manager/test/test_bond.cpp#L128-L169. Pick any of the nav2_system_tests/src/system as well to reproduce. Or simply launch Nav2 and try to control+C out of it.

We have our on_deactivate method reset the bond connection https://github.com/ros-navigation/navigation2/blob/main/nav2_docking/opennav_docking/src/docking_server.cpp#L126-L144 who is implemented in Nav2's lifecycle node wrapper that is shared by all nodes https://github.com/ros-navigation/navigation2/blob/main/nav2_util/src/lifecycle_node.cpp#L146-L155

@SteveMacenski SteveMacenski changed the title use_intra_process_comms does not work with BondCpp use_intra_process_comms does not work with LifecycleNode transitions for BondCpp Jan 7, 2025
@SteveMacenski
Copy link
Collaborator Author

I think the way I read this now is that there's a message that gets delivered after the object is destroyed:

@fujitatomoya is the plausible that the intraprocess subscription management is delivering messages to subscriptions that have been reset / destroyed?

@fujitatomoya
Copy link
Collaborator

@SteveMacenski i am not sure with a glance what went wrong here, i am not familiar with bond code. i will take a look.

@alsora
Copy link
Collaborator

alsora commented Jan 8, 2025

I'm not familiar with bond either, but I regularly use IPC with lifecycle nodes without issues.

@SteveMacenski
Copy link
Collaborator Author

SteveMacenski commented Jan 9, 2025

@alsora I'm curious - do you have higher-rate subscriptions, lets say 20+hz, reset in the deactivate() state? I'm interested if this is unique to bond or a general subscriber issues. My (possibly wrong) gut instinct was that the intraprocess manager was attempting to deliver a message after the subscription was reset -- and in the bond case, the object being destroyed causing the effects seen.

If that's not how either of you read that backtrace, then I should also take a 3rd look at bond, experiment, and see if I can find more specific thoughts -- that might involve migrating this ticket from a possible rclcpp issue to a bond_core one.

@alsora
Copy link
Collaborator

alsora commented Jan 10, 2025

do you have higher-rate subscriptions, lets say 20+hz, reset in the deactivate() state?

I usually do it in the on_cleanup() transition, but it should be equivalent.

the intraprocess manager was attempting to deliver a message after the subscription was reset

there should be a mutex preventing this from happening.

Looking at the backtrace it seems to me that the subscription is fine and alive, and that the crash happens inside the bond callback.
At this point the IPC mechanism should not be involved anymore (the message is already delivered and is being processed).

I wonder if this is caused by IPC having different latency/timing than non-IPC and exercising this race condition

@alsora
Copy link
Collaborator

alsora commented Jan 10, 2025

@SteveMacenski we discussed a bug that could be the cause of this problem in the client library WG.
See details #2678 (comment)

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/nav2-speedups-in-mppi-smac-planner/41667/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants