Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waitset not triggering timer when initialized in lifecycle node #2652

Open
luca-della-vedova opened this issue Oct 16, 2024 · 10 comments
Open
Assignees

Comments

@luca-della-vedova
Copy link
Contributor

luca-della-vedova commented Oct 16, 2024

Bug report

Required Info:

  • Operating System:
    • Ubuntu 24.04
  • Installation type:
    • Jazzy binaries (but reproducible from source, bisected to the commit linked below which introduced the issue)
  • Version or commit hash:
    • I bisected and found the exact commit that introduced this behavior (it was OK before): 5632a09
  • DDS implementation:
    • Cyclone
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

ros2 lifecycle set /minimal_timer configure
ros2 lifecycle set /minimal_timer activate
ros2 run examples_rclcpp_minimal_service service_main

Expected behavior

Since the service callback destroys only timer_, timer2_ should keep running and Hello, world! should keep printing.

Actual behavior

All timers are stopped and no Hello, world! is printed.

Additional information

The following might shed some light into this.

  • If the timer that is erronously not triggered is initialized in the constructor, rather than in a lifecycle transition, the issue does not arise.
  • If the action server is ran before the lifecycle transition (which means timer2_ is only created after timer_ is destroyed) the issue does not arise.
  • If I manually add a || true here making this condition always evaluate to true, all timers are executed, however I added some printing when timers are cleaned up / added and I can see that every iteration a timer gets added and removed, so something odd is going on there.

So I suspect there is something about how timers (and maybe other entities?) are added to wait sets inside lifecycle node transitions that creates this issue.

@mjcarroll
Copy link
Member

This definitely sounds like a bug.

If I manually add a || true here making this condition always evaluate to true,

This is going to force the waitset to always be rebuilt, which is not good for performance. It's likely that the bug is somewhere else (in how timers are added?)

I probably won't get a chance to look at this until after roscon, but thank you for the full repro example and detailed steps.

@alsora
Copy link
Collaborator

alsora commented Oct 16, 2024

Yes, I agree that this sounds like a bug.

So I suspect there is something about how timers (and maybe other entities?) are added to wait sets inside lifecycle node transitions that creates this issue.

I wouldn't exclude that this is "timers-specific" since timers are handled in a slightly different ways from other entities.
Moreover, lifecycle nodes do not do anything special here, so the same problem should be reproducible in a regular node (executors don't really care about nodes, they only care about callback groups)

@mjcarroll
Copy link
Member

lifecycle nodes do not do anything special here

My assumption as well is that something about the execution order from lifecycle nodes is surfacing this, but it's not something that would be unique to lifecycle nodes.

@fmrico
Copy link
Contributor

fmrico commented Oct 17, 2024

Hi,

I discovered this bug five days ago at fmrico/cascade_lifecycle#14, and now I am solving a problem that is directly related at PlanSys2 (ros/rosdistro#43212 (comment)). I am solving this now by changing the SingleThreadedExecutors to EventsExecutors in some cases and removing the node before destroying it in other cases.

Basically, if I have two nodes in an executor, each one with a timer, if I destroy one of the nodes, the timer callback in the other node is no longer called.

I can investigate more if you want.

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/next-client-library-wg-meeting-friday-6th-december-2024-8am-pt/40954/2

@jmachowinski
Copy link
Contributor

I found the bug:

for (; ii < wait_set.size_of_timers(); ++ii) {
if (rcl_wait_set.timers[ii] != nullptr) {
ret = wait_set.timers(ii);

We got a mismatch here between the sizeof the rcl waitset and the rclcpp waitset. Therefore the ready timer in the rcl waitset is never checked. The fix is simple, but it is in exposed template code...

@clalancette Are we breaking ABI if we are patching the templates ?

@jmachowinski
Copy link
Contributor

My initial analysis was not correct. We got a mismatch between the index in the rcl_waitset and the index in the rclcpp waitset. Thats the bug.

@clalancette
Copy link
Contributor

@clalancette Are we breaking ABI if we are patching the templates ?

It depends on what we are changing, I think. What I'm going to say here is that we open the PR to fix things against rolling, and get that in. After that we can figure out what we need to do to backport it.

@jmachowinski
Copy link
Contributor

@wjwwood @mjcarroll This bug was introduced by the rewrite of the executor. It looks like there is a prune missing here :

There is a variable that indicates, that we need pruning, but it is never used. Also the interface is not available at this place...

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/preparing-for-jazzy-sync-and-patch-release-2024-12-23/41213/5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants