Skip to content

Conversation

@bmartin427
Copy link
Contributor

@bmartin427 bmartin427 commented Dec 31, 2024

Closes #1389

@bmartin427
Copy link
Contributor Author

bmartin427 commented Dec 31, 2024

Draft form of this feature to drive discussion only. I've done some minimal work on this relative to the original code to get things in roughly the right shape; however the functional test currently seems to explode, and a bunch of the linters are unhappy. Even once this nominally works, there's a fair bit more work to be done on it, starting with the removal of the boost dependency.

@bmartin427 bmartin427 changed the title Introduce EventsExecutor implementation (#1389) Introduce EventsExecutor implementation Jan 7, 2025
// objects you're working with. We'll try to figure out what kind of stuff is hiding behind the
// abstraction by having the Waitable add itself to a wait set, then take stock of what all ended
// up there. We'll also have to hope that no Waitable implementations ever change their component
// entities over their lifetimes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, the rclcpp layer explicitly forces the waitables, to implement events handling. I don't know much about the rclpy waitable internals, but the approach taken here, would break the rclcpp waitables, so we should check this... A good candidate to have a look at should be the action interfaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are test cases in test_events_executor.py specifically for actions, so I know that rclpy Waitable type works, unless you think there's something specific missing from that test coverage.

Can you clarify what it is you're looking for here? Are you wanting me to modify the Waitable interface to better support callbacks directly? That's probably a bit more ambitious and invasive than I was hoping to be, but I can look into it if you think it's necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expire mechanism of the action server is strange in the way, that the timer callback does not expire the goals. Instead this happens if the timer is expired / ready in the waitable mechanism. We had problems in that area in the rclcpp implementation. Therefore I would check if this still works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_events_executor.py has coverage for this, unless I'm misunderstanding what you mean.
https://github.com/bmartin427/rclpy/blob/events_executor/rclpy/test/test_events_executor.py#L643-L650

@bmartin427 bmartin427 force-pushed the events_executor branch 3 times, most recently from 648b153 to 168c5cc Compare January 26, 2025 19:35
@bmartin427
Copy link
Contributor Author

asio has now been purged from this PR, and existing review comments have been addressed. I'm going to take this PR out of draft.

FYI: on the 30th I will be leaving on vacation for a couple of weeks, so don't be concerned if I don't respond to review comments right away.

@bmartin427 bmartin427 marked this pull request as ready for review January 29, 2025 04:26
@jmachowinski
Copy link
Contributor

I'll try to have a look on it on Friday.

@Yadunund
Copy link
Member

Yadunund commented Feb 6, 2025

@alsora / @jmachowinski friendly reminder to take another look at this PR. 🧇

waitables.attr("update")(py::set(node.attr("waitables")));

// It doesn't seem to be possible to support guard conditions with a callback-based (as
// opposed to waitset-based) API. Fortunately we don't seem to need to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guard conditions can be supported, and you need them for wakeup for the spin_until_future_complete (At least this the case for rclcpp).
Normally you would just enqueue an empty event if one is triggered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rcl doesn't seem to provide any way to know when a guard condition is triggered though, other than rcl_wait()? There's potential to hack a callback into rclpy.GuardCondition.trigger(), though some of the guard conditions I might want to know about are triggered at the rcl level and not through rclpy so that doesn't help.

I also wasn't able to identify any particular need for guard conditions? spin_until_future_complete() works as-is by hooking the Future done callback. Signal handling and wake-up upon new entities added are both done in different ways. I'm not sure there's anything else? Also as far as I can tell, rclcpp EventsExecutor doesn't support them either.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rclcpp guard conditions support the events executor, but it currently requires that they are triggered from the rclcpp layer and not from the rcl layer: https://github.com/ros2/rclcpp/blob/rolling/rclcpp/src/rclcpp/guard_condition.cpp

As you pointed out however, they are not hooked up in the executor.

The spin_until_future_complete relies on the executor base class implementation, where spin_once_impl is used and the future is handled separately

// objects you're working with. We'll try to figure out what kind of stuff is hiding behind the
// abstraction by having the Waitable add itself to a wait set, then take stock of what all ended
// up there. We'll also have to hope that no Waitable implementations ever change their component
// entities over their lifetimes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expire mechanism of the action server is strange in the way, that the timer callback does not expire the goals. Instead this happens if the timer is expired / ready in the waitable mechanism. We had problems in that area in the rclcpp implementation. Therefore I would check if this still works.

@jmachowinski
Copy link
Contributor

I looked a bit more over the code, but don't have the time right now to finish the review. Overall the code looks good and seems to be of high quality.
The naming convention of the function names is off and might need changes.

Something I don't understand right now, is how the liefetime tracking of python objects works, and how the executer ensures that they are alive over the period of the call. @bmartin427 can you give some information on that ?
Actually, how do you destroy / remove a subscriber in the python world ?

@jmachowinski
Copy link
Contributor

@Flova you seem to work on a similar thing, would you mind taking a look at this ?

@Flova
Copy link
Contributor

Flova commented Feb 13, 2025

Yeah this is a quite funny timing. I also did a WIP prototype of a python events executor, that greatly increased the performance. It currently is missing timers, a graceful shutdown, tests and proper formatting. It orients itself on the C++ one, but I also added support for callback groups in combination with spin until future complete, allowing to properly await e.g. a service result with a single thread. Fighting the GIL was quite challenging at time, leading to a lot of race conditions and dead locks, but it seems to be working now. It currently also supports stuff like adding subscriptions at runtime. You can find my current (not very clean) state here: iron...Flova:rclpy:feature/events_executor

I didn't have the time to look at this implementation here, but maybe we can combine both efforts in some way or another.

@bmartin427
Copy link
Contributor Author

The naming convention of the function names is off and might need changes.

You're talking about snake_case vs CamelCase? I thought I understood that either was acceptable, but I can switch if needed.

Something I don't understand right now, is how the liefetime tracking of python objects works, and how the executer ensures that they are alive over the period of the call. @bmartin427 can you give some information on that ?

rclpy seems to have a convention where the python object owns the C++ _rclpy object, which supports context management; any sections during which the C++ object must not be destroyed must take place within a with section on the C++ object. I created the C++ object ScopedWith so that I can do similar things like:

{
  ScopedWith with(object);
  // object can't be destroyed
}
// object may be destroyed

But then in practice, in most cases I found the need to protect the _rclpy object for as long as the executor is working with it (i.e. until HandleRemoveFoo()); so for that reason I keep the ScopedWith object around for longer than a single function scope by stashing it somewhere in the executor itself.

Actually, how do you destroy / remove a subscriber in the python world ?

In Python you can't just let a subscription object go out of scope, you have to actually call Node.destroy_subscription(), which removes the subscription from the node and sends a wakeup to the executor, which will notice the removal and clean up the callbacks and such. Looking through test_events_executor.py now, although I test deleting a whole subscriber node, it looks like I don't have coverage for an existing node deleting one subscription, though I have no reason to think that won't work. I can add that coverage for completeness' sake.

Brad Martin added 8 commits February 21, 2025 10:50
The previous version worked on 22.04+Iron, but fails on 24.04+Jazzy or Rolling.  It
seems like the behavior of std::bind() may have changed when asked to bind a py::object
to a callback taking a py::handle.

Signed-off-by: Brad Martin <[email protected]>
Signed-off-by: Brad Martin <[email protected]>
clang-format fixes things that uncrustify tends to leave alone, but then uncrustify seems slightly
unhappy with that result.  Also reflow comments at 100 columns.

This uses the .clang-format file from the ament-clang-format package, with the exception of
SortIncludes being set to false, because I didn't like what it tried to do with includes without
that override.

Signed-off-by: Brad Martin <[email protected]>
@jmachowinski
Copy link
Contributor

@bmartin427 what about adding a second test, just running one executor as c&p, this would at least solve the question of which executor is the culprit.

@bmartin427
Copy link
Contributor Author

I completed a run of the entire Windows CI (run_ros2_batch.py from ros2/ci), and got 9 errors, none of which were in rclpy:

==> env.bat C:\dev\.pixi\envs\default\Scripts\colcon.EXE test-result --test-result-base "build"
**********************************************************************
** Visual Studio 2019 Developer Command Prompt v16.11.44
** Copyright (c) 2021 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x86_x64'
build\demo_nodes_cpp\Testing\20250318-2048\Test.xml: 1 test, 0 errors, 1 failure, 0 skipped
build\demo_nodes_cpp\test_results\demo_nodes_cpp\test_tutorial_add_two_ints_server_add_two_ints_client_async__rmw_fastrtps_cpp.xunit.xml: 2 tests, 0 errors, 2 failures, 0 skipped
build\ros2bag\pytest.xml: 32 tests, 0 errors, 1 failure, 0 skipped
build\sros2\pytest.xml: 35 tests, 18 errors, 4 failures, 0 skipped
build\test_security\Testing\20250318-2120\Test.xml: 47 tests, 0 errors, 1 failure, 46 skipped

Summary: 45574 tests, 18 errors, 9 failures, 9539 skipped

II> colcon test-result returned: '1'

I also was still not entirely successful at reproducing the environment, because connextdds stuff denied me the permission to check it out.

@bmartin427 what about adding a second test, just running one executor as c&p, this would at least solve the question of which executor is the culprit.

I can maybe look at reformulating the test yet again to get better debug output. However, this has become a huge time sink for me with nothing to show for it so far. I'm reaching the limit of the time I'm able to devote to trying to fix a problem in a build environment that's impossible to reproduce accurately.

@bmartin427
Copy link
Contributor Author

Would it be possible for someone who is a Windows developer to try to reproduce this locally please?

time.monotonic() has a resolution of 16ms, which is way too coarse for the intervals
this test is trying to measure.

Signed-off-by: Brad Martin <[email protected]>
@bmartin427
Copy link
Contributor Author

bmartin427 commented Mar 19, 2025

I think I know the problem (although not necessarily why it only happens in CI). The test is looking at the elapsed time as measured by time.monotonic(). On Linux, time.get_clock_info("monotonic").resolution returns 1e9, or nanosecond precision. On Windows, the same expression returns 0.016, or 16 milliseconds. When this test fails, it does so by less than that amount. I think the actual elapsed time is correct, but the read back time is just being rounded down to this large increment.

Internally, EventsExecutor is using std::chrono::steady_clock, which seems to have at least somewhat higher precision. Invoking std::chrono::duration_cast<std::chrono::duration<double>>(std::chrono::steady_clock::duration(1)).count() returns 1e-9, also implying nanosecond resolution, although I'm not certain if that accurately describes the underlying clock or just the representation.

I have replaced all of the uses of time.monotonic() with time.perf_counter() in test_executor.py. get_clock_info() says that clock is monotonic as well, so there shouldn't be any problems with time warping. Interestingly, it seems monotonic() uses GetTickCount64() as its implementation, whereas perf_counter() uses QueryPerformanceCounter() as advertised; but on Python 3.13, it appears monotonic() has changed to use QueryPerformanceCounter() as well.

Anyway, @alsora could you kick off one more Windows build please?

@fujitatomoya
Copy link
Collaborator

@bmartin427

  • Windows Build Status

@jmachowinski
Copy link
Contributor

@jmachowinski
Copy link
Contributor

rclpy.Clock with ClockType.STEADY_TIME seems like the more reasonable time source. This one is is known to return the correct time.

@jmachowinski
Copy link
Contributor

CI looks good, I don't think the rosbag test is related.

@alsora
Copy link

alsora commented Mar 21, 2025

Running a full CI again, just in case

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@alsora
Copy link

alsora commented Mar 21, 2025

Merging this, as discussed today in the client library WG.
The Windows failure in the rosbag test is common to also other rclpy PRs

@alsora alsora merged commit 4c05d92 into ros2:rolling Mar 21, 2025
2 checks passed
@Flova
Copy link
Contributor

Flova commented Mar 22, 2025

Any chance this will be back ported into jazzy?

@bmartin427
Copy link
Contributor Author

I originally implemented it for Iron and then forward-ported it during the PR process, so there's no big reason why it couldn't work with Jazzy. The biggest compatibility changes I ran into when porting was the addition of TimerInfo as an argument to timer callbacks, and a new rcl API for getting a timer's absolute expiration time instead of forcing me to deal with relative times. I'm unsure if either of those things appeared in Jazzy or were new for Rolling.

That said, I myself am not planning to do any more porting work, but anyone else is welcome to. 🙂

@Flova
Copy link
Contributor

Flova commented Mar 22, 2025

Okay. I could try opening a backport PR, but would this be aligned with the backporting policy of new features (I am not familiar with it)? @alsora

I am really looking forward to it. Performance was a major pain point when we migrated to ROS2 a few years ago, as many parts needed a cpp rewrite. Do you have any numbers on the performance gains of the new executor. What is the maximum throughput of this implementation? I also did a python port of the events executor concept (you were faster and more polished in the end, so I discontinued it) and the maximum throughput of a node with 10 subscriptions increased from ~1khz to >10khz both running at 100% of one core iirc..

@alsora
Copy link

alsora commented Mar 22, 2025

@Flova that's great!
The main requirement for a backport is to not break API/ABI.
Looking at the changes here, that are almost entirely just adding new files, it should be doable!

@bmartin427
Copy link
Contributor Author

Do you have any numbers on the performance gains of the new executor. What is the maximum throughput of this implementation?

Have a look at #1389 , I posted a simple benchmark script and some results.

@Flova
Copy link
Contributor

Flova commented Mar 23, 2025

Btw, thanks a lot for implementing this very much needed feature @bmartin427 !

@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-pmc-minutes-for-march-25-2025/42807/1

@mjcarroll
Copy link
Member

@bmartin427 Just as one final follow-up, would you like to announce this feature to discourse? I would also like to include this in the feature list for Kilted as many people will be interested in it.

ahcorde pushed a commit that referenced this pull request Mar 28, 2025
* Introduce EventsExecutor implementation (#1389)

Signed-off-by: Brad Martin <[email protected]>
Signed-off-by: Brad Martin <[email protected]>
Signed-off-by: Florian Vahl <[email protected]>
Co-authored-by: Brad Martin <[email protected]>
Co-authored-by: Brad Martin <[email protected]>
Co-authored-by: Janosch Machowinski <[email protected]>
n8hui pushed a commit to n8hui/rclpy that referenced this pull request Apr 2, 2025
* Introduce EventsExecutor implementation (ros2#1389)

Signed-off-by: Brad Martin <[email protected]>

* Fix explosion for reference count updates without GIL

The previous version worked on 22.04+Iron, but fails on 24.04+Jazzy or Rolling.  It
seems like the behavior of std::bind() may have changed when asked to bind a py::object
to a callback taking a py::handle.

Signed-off-by: Brad Martin <[email protected]>

* Fix ament linter complaints

Signed-off-by: Brad Martin <[email protected]>

* Switch to non-boost asio library

Signed-off-by: Brad Martin <[email protected]>

* Use internal _rclpy C++ types instead of jumping through Python hoops

Signed-off-by: Brad Martin <[email protected]>

* Reformat with clang-format, then uncrustify again

clang-format fixes things that uncrustify tends to leave alone, but then uncrustify seems slightly
unhappy with that result.  Also reflow comments at 100 columns.

This uses the .clang-format file from the ament-clang-format package, with the exception of
SortIncludes being set to false, because I didn't like what it tried to do with includes without
that override.

Signed-off-by: Brad Martin <[email protected]>

* Respect init signal handling options

Signed-off-by: Brad Martin <[email protected]>

* Reconcile signal handling differences with SingleThreadedExecutor

Signed-off-by: Brad Martin <[email protected]>

* test_executor.py: Add coverage for EventsExecutor

Tests that currently work are enabled, and those that don't are annotated what needs to be done before they can be enabled

Signed-off-by: Brad Martin <[email protected]>

* Make spin_once() only return after a user-visible callback

Signed-off-by: Brad Martin <[email protected]>

* Add get_nodes() method

Signed-off-by: Brad Martin <[email protected]>

* Add context management support

Signed-off-by: Brad Martin <[email protected]>

* Remove mutex protection on nodes_ member access

It was being used only inconsistently, and on reflection it wasn't adding any protection beyond what
the GIL already provides.

Signed-off-by: Brad Martin <[email protected]>

* Fix deadlock during shutdown()

Needed to release the GIL while blocking on another lock.

Signed-off-by: Brad Martin <[email protected]>

* A little further on using C++ _rclpy types instead of Python interface

Signed-off-by: Brad Martin <[email protected]>

* Fix issue with iterating through incomplete Tasks

Never bool-test a py::object::attr() return value without an explicit py::cast<bool>.

Signed-off-by: Brad Martin <[email protected]>

* Add support for coroutines to timer handling

Signed-off-by: Brad Martin <[email protected]>

* Fix test_not_lose_callback() test to destroy entities properly

EventsExecutor was exploding because the test was leaving destroyed entities in the node by using an
incorrect API to destroy them.

Signed-off-by: Brad Martin <[email protected]>

* Correct test that wasn't running at all, and remove EventsExecutor from it

* Test methods need to be prefixed with 'test_' in order to function.  This had been entirely dead
  code, and when I enabled it EventsExecutor deadlocked.

* On reflection, it seems like a deadlock is exactly what should be expected from what this test is
  doing.  Remove EventsExecutor from the test and document the problem.

Signed-off-by: Brad Martin <[email protected]>

* Warn on every timer added over the threshold, not just the first

Co-authored-by: Janosch Machowinski <[email protected]>
Signed-off-by: Brad Martin <[email protected]>

* Keep rcl_timer_call() tightly coupled with user callback dispatch

This both prevents an issue where user callbacks could potentially queue up if they didn't dispatch
fast enough, and ensures that a timer that has passed its expiration without being dispatched yet
can still be canceled without the user callback being delivered later.

This commit also makes use of the new rcl API for querying the absolute timer expiration time,
instead of the relative number of nanoseconds remaining until it expires.  This should both make
things more accurate, and potentially reduce overhead as we don't have to re-query the current time
for each outstanding timer.

Signed-off-by: Brad Martin <[email protected]>

* Protect against deferred method calls happening against a deleted ClockManager

Signed-off-by: Brad Martin <[email protected]>

* Add support for new TimerInfo data passed to timer handlers

Signed-off-by: Brad Martin <[email protected]>

* Simplify spin_once() implementation

This both reduces duplicate code now, and simplifies the asio interface used which would need
replacing.

Signed-off-by: Brad Martin <[email protected]>

* Fix stale Future done callbacks with spin_until_future_complete()

This method can't be allowed to leave its Future done callback outstanding in case the method is
returning for a reason other than the Future being done.

Signed-off-by: Brad Martin <[email protected]>

* Use existing rclpy signal handling instead of asio

Signed-off-by: Brad Martin <[email protected]>

* Replace asio timers with a dedicated timer wait thread

This is dumb on its own, but it helps me move towards eliminating asio.

Signed-off-by: Brad Martin <[email protected]>

* Correct busy-looping in async callback handling

This isn't ideal because there are some ways async callbacks could become unblocked which wouldn't
get noticed right away (if at all); however this seems to match the behavior of
SingleThreadedExecutor.

Signed-off-by: Brad Martin <[email protected]>

* Replace asio::io_context with a new EventsQueue object

Signed-off-by: Brad Martin <[email protected]>

* Add EventsExecutor to new test_executor test from merge

Signed-off-by: Brad Martin <[email protected]>

* Swap 'pragma once' for ifndef include guards

Signed-off-by: Brad Martin <[email protected]>

* Add test coverage for Node.destroy_subscription()

Signed-off-by: Brad Martin <[email protected]>

* Replace '|' type markup with typing.Optional and typing.Union

Python 3.9, still used by RHEL 9, doesn't seem to understand '|' syntax, and
Optional/Union seems to be what gets used throughout the rest of the codebase.

Additionally, fix a couple other mypy nits:
 * mypy is mad that I haven't explicitly annotated every __init__ as returning None
 * mypy wants generic arguments to service and action clients now

Signed-off-by: Brad Martin <[email protected]>

* Use 'auto' in place of explicit return type on hash

pybind11::hash() is documented as returning ssize_t, but this seems to be a lie because MSVC
doesn't understand that type; so, let's just return whatever we do get.

Signed-off-by: Brad Martin <[email protected]>

* Change initialization of struct members

MSVC didn't like the more concise method.

Signed-off-by: Brad Martin <[email protected]>

* Use subTest in test_executor to distinguish which executor type failed

Signed-off-by: Brad Martin <[email protected]>

* Use time.perf_counter() instead of time.monotonic() in executor test

time.monotonic() has a resolution of 16ms, which is way too coarse for the intervals
this test is trying to measure.

Signed-off-by: Brad Martin <[email protected]>

---------

Signed-off-by: Brad Martin <[email protected]>
Signed-off-by: Brad Martin <[email protected]>
Co-authored-by: Brad Martin <[email protected]>
Co-authored-by: Janosch Machowinski <[email protected]>
@n8hui
Copy link

n8hui commented May 3, 2025

Hi all, I've made an attempt to backport this to Humble (#1454) (ros2/rcl#1237). Please take a look if you have time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Addition: Port EventsExecutor concept to Python

9 participants