Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running a subscriber on Humble with publishing from Jazzy / Rolling uses up all memory #797

Open
urfeex opened this issue Jan 15, 2025 · 5 comments

Comments

@urfeex
Copy link

urfeex commented Jan 15, 2025

Bug report

While I am aware that inter-distribution traffic isn't supported, I would at least expect systems not to crash if that occurs. However, we noticed that running both, Humble and Jazzy nodes on the same network can cause the machines running Humble nodes to run out of memory, probably because of discovery traffic. It is sufficient to have a Humble subscriber and do a ros2 topic list from Jazzy to (sometimes) trigger the issue.

Required Info:

  • Operating System: Ubuntu 22.04 / 24.04
  • Installation type: binary
  • Version or commit hash: latest
  • DDS implementation:
    • default -> fastrtps
  • Client library (if applicable):
    • Checked using rclpy

Steps to reproduce issue

The following docker-compose file illustrates the issue. Running this will make your system run out of memory!!!

---
version: '2'

networks:
  rosdocker:
    driver: bridge

services:
  talker:
    image: ros:jazzy
    container_name: lister
    hostname: lister
    networks:
      - rosdocker
    environment:
      - "ROS_DOMAIN_ID=13"
    command: "ros2 topic list"
    restart: always  # It seems not to happen every time, hence the restart

  listener:
    image: ros:humble
    container_name: listener
    hostname: listener
    networks:
      - rosdocker
    environment:
      - "ROS_DOMAIN_ID=13"
    command: "ros2 topic echo /chatter std_msgs/String"

Expected behavior

Error messages, silent ignores, or magically just work. Note: As said earlier, I do not expect cross-distro communication to simply work, but I would expect stable behavior.

Actual behavior

The ros2 topic echo seems to cause unlimited memory consumption and makes the system run OOM in a matter of seconds.

Additional information

  • We haven't investigated things further down RMW code, but we do see the error output caught in Capture std::bad_alloc on deserializeROSmessage. (backport #665) #737 on the humble node.
  • I have tried the same thing with rmw_cyclone_cpp which throws warnings on the list and errors on the echo but doesn't crash or end up in OOM. So, this seems to be a rmw_fastrtps_cpp issue.
@fujitatomoya
Copy link
Collaborator

@urfeex thanks for creating issue.

Note: As said earlier, I do not expect cross-distro communication to simply work, but I would expect stable behavior.

as you already mentioned, cross-distro communication is not supported. (interfaces are not guaranteed compatible, including breaking ABI/API)
that said, unfortunately we do not really expect the stable behavior...

you can keep this open, but i do not investigate any further on this issue.

@urfeex
Copy link
Author

urfeex commented Jan 17, 2025

As I said earlier, this issue is not about cross-distro communication. It's about taking down a complete computer as soon as some random jazzy / rolling node shows up on the network sending discovery messages.

In my opinion it would be fine to silently ignore those, print errors all over the place, whatever. But having a PC use up all memory seems not like something that should be considered "just not supported". It was my impression that #737 was created also with the motivation to prevent crashes because of that scenario.

@fujitatomoya
Copy link
Collaborator

As I said earlier, this issue is not about cross-distro communication.

i think what you mean is application data-plane.

as soon as some random jazzy / rolling node shows up on the network sending discovery messages.

ROS 2 already communicate in discovery, so cross-distro communication is taking place in discovery to develop the endpoint connectivity.

In my opinion it would be fine to silently ignore those, print errors all over the place, whatever. But having a PC use up all memory seems not like something that should be considered "just not supported".

good point, totally agree this.

one thing i would like to ask you as a possible work-around. can you set the different ROS_DOMAIN_ID for jazzy and rolling? https://docs.ros.org/en/eloquent/Tutorials/Configuring-ROS2-Environment.html#the-ros-domain-id-variable

this should provide the logical partition for the discovery process, that means no discovery between jazzy and rolling at all.

@urfeex
Copy link
Author

urfeex commented Jan 17, 2025

ROS 2 already communicate in discovery, so cross-distro communication is taking place in discovery to develop the endpoint connectivity.

Yes, that is clear to me. What I wanted to say is: We do not try to actively do any cross-distro communication or expect any cross-distro communication to work. We just want systems not to go down because of a participant in the same domain ID gets active on the same network. But I think that has become clear by now :-)

one thing i would like to ask you as a possible work-around. can you set the different ROS_DOMAIN_ID for jazzy and rolling?

Yes, setting the ROS_DOMAIN_ID has been identified as a workaround already, I should have mentioned that. Unfortunately, this only makes it less likely to happen.

In my opinion it would be fine to silently ignore those, print errors all over the place, whatever. But having a PC use up all memory seems not like something that should be considered "just not supported".

good point, totally agree this.

Does that mean you think searching for a solution for this might be the way to go? Can we support this in any way? I cannot promise any resources at the moment, though.

@fujitatomoya
Copy link
Collaborator

What I wanted to say is: We do not try to actively do any cross-distro communication or expect any cross-distro communication to work. We just want systems not to go down because of a participant in the same domain ID gets active on the same network.

yeah, this is not good user-experience, silently causing the problem. if that is not supported, disallow / warning notification would be much better for user.

Yes, setting the ROS_DOMAIN_ID has been identified as a workaround already, I should have mentioned that.

no worries, good to know that works.

Does that mean you think searching for a solution for this might be the way to go?

i do not think so, as far as i know there is nobody is planning for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants