Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Addition: Port EventsExecutor concept to Python #1389

Open
bmartin427 opened this issue Dec 18, 2024 · 8 comments · May be fixed by #1391
Open

Feature Addition: Port EventsExecutor concept to Python #1389

bmartin427 opened this issue Dec 18, 2024 · 8 comments · May be fixed by #1391

Comments

@bmartin427
Copy link

Feature request

Feature description

TL;DR: Port the rclcpp EventsExecutor concept to Python for use in rclpy.

Motivation: My employer, Merlin Labs, has recently completed the porting of a large existing codebase from ROS1 to ROS2. We found this effort to be much more difficult than we anticipated, and the end result to be much less performant than we expected. After analysis, I found that the single biggest contribution to this issue was the fact that every Python node in our system was using something like five times the CPU that it had with ROS1! As a small toy example, I made a simple test node with a 100Hz timer, which published a small fixed-content message in the timer callback. In ROS1 Noetic, this node consumed 3% of a CPU core (Ubuntu on x86-64); in ROS2 Iron, this jumped to 16%!

I looked for existing discussions of this problem, and while there seem to be a few, there didn't seem to be much interest in doing anything about it. It seems the prevailing opinion is just "don't use Python if you care about CPU usage", which is maybe understandable in the context of new development, but was an unsuitable answer for us, since we were porting an existing project consisting of a mix of Python and C++ nodes and high rate publications, which all worked fine with ROS1.

The solution I ended up developing at Merlin was a port of the EventsExecutor concept to Python. To cite my earlier simple example case again, this implementation dropped the usage from 16% to 6%, which while still greater than ROS1's 3%, was much more tolerable for us. Furthermore, I have gotten approval from within Merlin to publish my work on this for upstream use.

This is not a request for development effort or code review, at least not yet. For now, I am just posting publicly that internal code for which I have approval to do so. This code is not suitable for merging to rclpy as-is, because it was designed to exist outside of rclpy and not require us to maintain a modified version of rclpy for it to function. As the next step, I myself in my personal capacity intend to rework this code into a suitable PR against rclpy itself. I do however have a couple design-level questions I'll get into below.

Implementation considerations

In addition to all of the arguments relating to wait sets that iRobot made in favor of their implementation over rclcpp's SingleThreadedExecutor, rclpy executors suffer the additional overhead of actually being implemented in Python. I chose to implement my executor in C++ instead to mitigate this additional overhead.

My implementation of this concept ended up somewhat simpler than iRobot's, because I based it on boost::asio instead of recreating their EventsQueue. I think I understand ROS prefers to avoid use of Boost at almost any cost, is this accurate? If this is the case, I believe I can probably preserve much of the simplicity I got out of using bound callbacks instead of their ExecutorEvent objects, even if I have to implement my own callback dispatch queue.

One improvement I've been able to make in this implementation relative to rclcpp is that the TimersManager component now supports debug time. It is however potentially more limited in terms of the number of simultaneous timers it can support in a performant manner, because I eschewed their 'heap' approach to timer maintenance in favor of just iterating through the entire list each update. (None of Merlin's nodes contain a large number of timers, so the simplicity of this approach outweighed the complexity of the heap for us.)

This implementation is single-threaded only, and completely ignores all CallbackGroups. Adding multi-threading would be significant additional work that we at Merlin didn't find a need to do.

This implementation does not yet properly support async callbacks. It was on my TODO list to work on this, but unfortunately I never got to it. Would this be a blocker for upstream merging? I can make sure to properly implement and test this first if so (and may do so anyway).

This code was implemented and deployed against Iron. I'll forward port to Rolling before I stage a PR.

merlin_python_events_executor.tar.gz

@mjcarroll
Copy link
Member

Wow! Very interesting stuff. I think two things to make this conversation easier.

  1. Can add the code as PR (even marked draft) just so we can read it in the webui?

  2. Can you join the client library working group meeting? This would be very interesting to discuss there.

@bmartin427
Copy link
Author

  1. So the tarball I attached is a selection of files I pulled out of Merlin's internal codebase. It won't build as-is and in fact this excerpt contains no CMake build scripts at all; furthermore, not all of these files should even end up in the same directory. I can however take it as an action item to make a draft PR with such things resolved, before I then continue with any other rework.
  2. Yes, I'll keep an eye out for the next meeting announcement.

@mjcarroll
Copy link
Member

It won't build as-is and in fact this excerpt contains no CMake build scripts at all; furthermore, not all of these files should even end up in the same directory. I can however take it as an action item to make a draft PR with such things resolved,

I probably wouldn't even worry about getting it into pristine (or even buildable shape). It's just a lot easier to read and talk about when it's already in git.

@clalancette
Copy link
Contributor

Yes, thanks for opening this, this is great stuff.

I think I understand ROS prefers to avoid use of Boost at almost any cost, is this accurate?

Yes, that is accurate. It's just too big of a dependency.

@fujitatomoya
Copy link
Collaborator

@bmartin427 this is really interesting! please consider to open the draft PR 👍

bmartin427 pushed a commit to bmartin427/rclpy that referenced this issue Dec 31, 2024
@bmartin427
Copy link
Author

Sorry for the delay on this, I've been busy (and sick) around the holidays.

I finally have that draft PR up, and I've done a little bit of beating the code into shape, though it doesn't quite work yet.
#1391

@bmartin427 bmartin427 linked a pull request Jan 1, 2025 that will close this issue
bmartin427 pushed a commit to bmartin427/rclpy that referenced this issue Jan 4, 2025
@bmartin427
Copy link
Author

I think I understand ROS prefers to avoid use of Boost at almost any cost, is this accurate?

Yes, that is accurate. It's just too big of a dependency.

I've switched the code in the PR to use the non-Boost version of asio, is this acceptable? I see that this already seems to get used from a number of places, so I'm hoping so.

@bmartin427
Copy link
Author

As of today, I think the current state of the PR is in pretty decent shape, provided the use of non-Boost asio is acceptable. That's not to say there aren't a few rough edges still, but those can probably be addressed either as a follow-up effort, or potentially during this PR's review if necessary.

Also fwiw, I created a quick benchmark script that allows to compare the performance of this executor with the existing SingleThreadedExecutor. This benchmark by default creates three nodes, each of which has a 100Hz timer that triggers one publication, and then each node subscribes to every nodes' topics. I get the following output running it:

$ ./test_rclpy_performance.py --executor single-threaded; ./test_rclpy_performance.py --executor events
Running benchmark for 30.0 seconds with 'single-threaded' executor...
Wall time elapsed: 30.055s
User CPU time: 24.130s (80.3%)
System CPU time: 0.030s (0.1%)
Running benchmark for 30.0 seconds with 'events' executor...
Wall time elapsed: 30.061s
User CPU time: 2.110s (7.0%)
System CPU time: 0.000s (0.0%)

(Any more than three test nodes caused SingleThreadedExecutor to max out a CPU core and presumably drop subscribed messages.)
test_rclpy_performance.tar.gz

Absent any further feedback, I'll probably move the PR to 'ready for review' soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants