-
Notifications
You must be signed in to change notification settings - Fork 259
Introduce EventsExecutor implementation #1391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Draft form of this feature to drive discussion only. I've done some minimal work on this relative to the original code to get things in roughly the right shape; however the functional test currently seems to explode, and a bunch of the linters are unhappy. Even once this nominally works, there's a fair bit more work to be done on it, starting with the removal of the boost dependency. |
039907a to
18036c4
Compare
6a7c1e7 to
68ecfb4
Compare
| // objects you're working with. We'll try to figure out what kind of stuff is hiding behind the | ||
| // abstraction by having the Waitable add itself to a wait set, then take stock of what all ended | ||
| // up there. We'll also have to hope that no Waitable implementations ever change their component | ||
| // entities over their lifetimes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, the rclcpp layer explicitly forces the waitables, to implement events handling. I don't know much about the rclpy waitable internals, but the approach taken here, would break the rclcpp waitables, so we should check this... A good candidate to have a look at should be the action interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are test cases in test_events_executor.py specifically for actions, so I know that rclpy Waitable type works, unless you think there's something specific missing from that test coverage.
Can you clarify what it is you're looking for here? Are you wanting me to modify the Waitable interface to better support callbacks directly? That's probably a bit more ambitious and invasive than I was hoping to be, but I can look into it if you think it's necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expire mechanism of the action server is strange in the way, that the timer callback does not expire the goals. Instead this happens if the timer is expired / ready in the waitable mechanism. We had problems in that area in the rclcpp implementation. Therefore I would check if this still works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_events_executor.py has coverage for this, unless I'm misunderstanding what you mean.
https://github.com/bmartin427/rclpy/blob/events_executor/rclpy/test/test_events_executor.py#L643-L650
648b153 to
168c5cc
Compare
|
asio has now been purged from this PR, and existing review comments have been addressed. I'm going to take this PR out of draft. FYI: on the 30th I will be leaving on vacation for a couple of weeks, so don't be concerned if I don't respond to review comments right away. |
|
I'll try to have a look on it on Friday. |
|
@alsora / @jmachowinski friendly reminder to take another look at this PR. 🧇 |
| waitables.attr("update")(py::set(node.attr("waitables"))); | ||
|
|
||
| // It doesn't seem to be possible to support guard conditions with a callback-based (as | ||
| // opposed to waitset-based) API. Fortunately we don't seem to need to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guard conditions can be supported, and you need them for wakeup for the spin_until_future_complete (At least this the case for rclcpp).
Normally you would just enqueue an empty event if one is triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rcl doesn't seem to provide any way to know when a guard condition is triggered though, other than rcl_wait()? There's potential to hack a callback into rclpy.GuardCondition.trigger(), though some of the guard conditions I might want to know about are triggered at the rcl level and not through rclpy so that doesn't help.
I also wasn't able to identify any particular need for guard conditions? spin_until_future_complete() works as-is by hooking the Future done callback. Signal handling and wake-up upon new entities added are both done in different ways. I'm not sure there's anything else? Also as far as I can tell, rclcpp EventsExecutor doesn't support them either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rclcpp guard conditions support the events executor, but it currently requires that they are triggered from the rclcpp layer and not from the rcl layer: https://github.com/ros2/rclcpp/blob/rolling/rclcpp/src/rclcpp/guard_condition.cpp
As you pointed out however, they are not hooked up in the executor.
The spin_until_future_complete relies on the executor base class implementation, where spin_once_impl is used and the future is handled separately
| // objects you're working with. We'll try to figure out what kind of stuff is hiding behind the | ||
| // abstraction by having the Waitable add itself to a wait set, then take stock of what all ended | ||
| // up there. We'll also have to hope that no Waitable implementations ever change their component | ||
| // entities over their lifetimes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expire mechanism of the action server is strange in the way, that the timer callback does not expire the goals. Instead this happens if the timer is expired / ready in the waitable mechanism. We had problems in that area in the rclcpp implementation. Therefore I would check if this still works.
|
I looked a bit more over the code, but don't have the time right now to finish the review. Overall the code looks good and seems to be of high quality. Something I don't understand right now, is how the liefetime tracking of python objects works, and how the executer ensures that they are alive over the period of the call. @bmartin427 can you give some information on that ? |
|
@Flova you seem to work on a similar thing, would you mind taking a look at this ? |
|
Yeah this is a quite funny timing. I also did a WIP prototype of a python events executor, that greatly increased the performance. It currently is missing timers, a graceful shutdown, tests and proper formatting. It orients itself on the C++ one, but I also added support for callback groups in combination with spin until future complete, allowing to properly await e.g. a service result with a single thread. Fighting the GIL was quite challenging at time, leading to a lot of race conditions and dead locks, but it seems to be working now. It currently also supports stuff like adding subscriptions at runtime. You can find my current (not very clean) state here: iron...Flova:rclpy:feature/events_executor I didn't have the time to look at this implementation here, but maybe we can combine both efforts in some way or another. |
You're talking about
rclpy seems to have a convention where the python object owns the C++ But then in practice, in most cases I found the need to protect the
In Python you can't just let a subscription object go out of scope, you have to actually call |
Signed-off-by: Brad Martin <[email protected]>
The previous version worked on 22.04+Iron, but fails on 24.04+Jazzy or Rolling. It seems like the behavior of std::bind() may have changed when asked to bind a py::object to a callback taking a py::handle. Signed-off-by: Brad Martin <[email protected]>
Signed-off-by: Brad Martin <[email protected]>
Signed-off-by: Brad Martin <[email protected]>
Signed-off-by: Brad Martin <[email protected]>
clang-format fixes things that uncrustify tends to leave alone, but then uncrustify seems slightly unhappy with that result. Also reflow comments at 100 columns. This uses the .clang-format file from the ament-clang-format package, with the exception of SortIncludes being set to false, because I didn't like what it tried to do with includes without that override. Signed-off-by: Brad Martin <[email protected]>
Signed-off-by: Brad Martin <[email protected]>
Signed-off-by: Brad Martin <[email protected]>
|
@bmartin427 what about adding a second test, just running one executor as c&p, this would at least solve the question of which executor is the culprit. |
|
I completed a run of the entire Windows CI (run_ros2_batch.py from ros2/ci), and got 9 errors, none of which were in rclpy: I also was still not entirely successful at reproducing the environment, because connextdds stuff denied me the permission to check it out.
I can maybe look at reformulating the test yet again to get better debug output. However, this has become a huge time sink for me with nothing to show for it so far. I'm reaching the limit of the time I'm able to devote to trying to fix a problem in a build environment that's impossible to reproduce accurately. |
|
Would it be possible for someone who is a Windows developer to try to reproduce this locally please? |
time.monotonic() has a resolution of 16ms, which is way too coarse for the intervals this test is trying to measure. Signed-off-by: Brad Martin <[email protected]>
|
I think I know the problem (although not necessarily why it only happens in CI). The test is looking at the elapsed time as measured by Internally, EventsExecutor is using I have replaced all of the uses of Anyway, @alsora could you kick off one more Windows build please? |
|
rclpy.Clock with ClockType.STEADY_TIME seems like the more reasonable time source. This one is is known to return the correct time. |
|
CI looks good, I don't think the rosbag test is related. |
|
Merging this, as discussed today in the client library WG. |
|
Any chance this will be back ported into jazzy? |
|
I originally implemented it for Iron and then forward-ported it during the PR process, so there's no big reason why it couldn't work with Jazzy. The biggest compatibility changes I ran into when porting was the addition of That said, I myself am not planning to do any more porting work, but anyone else is welcome to. 🙂 |
|
Okay. I could try opening a backport PR, but would this be aligned with the backporting policy of new features (I am not familiar with it)? @alsora I am really looking forward to it. Performance was a major pain point when we migrated to ROS2 a few years ago, as many parts needed a cpp rewrite. Do you have any numbers on the performance gains of the new executor. What is the maximum throughput of this implementation? I also did a python port of the events executor concept (you were faster and more polished in the end, so I discontinued it) and the maximum throughput of a node with 10 subscriptions increased from ~1khz to >10khz both running at 100% of one core iirc.. |
|
@Flova that's great! |
Have a look at #1389 , I posted a simple benchmark script and some results. |
|
Btw, thanks a lot for implementing this very much needed feature @bmartin427 ! |
|
This pull request has been mentioned on ROS Discourse. There might be relevant details there: https://discourse.ros.org/t/ros-pmc-minutes-for-march-25-2025/42807/1 |
|
@bmartin427 Just as one final follow-up, would you like to announce this feature to discourse? I would also like to include this in the feature list for Kilted as many people will be interested in it. |
* Introduce EventsExecutor implementation (#1389) Signed-off-by: Brad Martin <[email protected]> Signed-off-by: Brad Martin <[email protected]> Signed-off-by: Florian Vahl <[email protected]> Co-authored-by: Brad Martin <[email protected]> Co-authored-by: Brad Martin <[email protected]> Co-authored-by: Janosch Machowinski <[email protected]>
* Introduce EventsExecutor implementation (ros2#1389) Signed-off-by: Brad Martin <[email protected]> * Fix explosion for reference count updates without GIL The previous version worked on 22.04+Iron, but fails on 24.04+Jazzy or Rolling. It seems like the behavior of std::bind() may have changed when asked to bind a py::object to a callback taking a py::handle. Signed-off-by: Brad Martin <[email protected]> * Fix ament linter complaints Signed-off-by: Brad Martin <[email protected]> * Switch to non-boost asio library Signed-off-by: Brad Martin <[email protected]> * Use internal _rclpy C++ types instead of jumping through Python hoops Signed-off-by: Brad Martin <[email protected]> * Reformat with clang-format, then uncrustify again clang-format fixes things that uncrustify tends to leave alone, but then uncrustify seems slightly unhappy with that result. Also reflow comments at 100 columns. This uses the .clang-format file from the ament-clang-format package, with the exception of SortIncludes being set to false, because I didn't like what it tried to do with includes without that override. Signed-off-by: Brad Martin <[email protected]> * Respect init signal handling options Signed-off-by: Brad Martin <[email protected]> * Reconcile signal handling differences with SingleThreadedExecutor Signed-off-by: Brad Martin <[email protected]> * test_executor.py: Add coverage for EventsExecutor Tests that currently work are enabled, and those that don't are annotated what needs to be done before they can be enabled Signed-off-by: Brad Martin <[email protected]> * Make spin_once() only return after a user-visible callback Signed-off-by: Brad Martin <[email protected]> * Add get_nodes() method Signed-off-by: Brad Martin <[email protected]> * Add context management support Signed-off-by: Brad Martin <[email protected]> * Remove mutex protection on nodes_ member access It was being used only inconsistently, and on reflection it wasn't adding any protection beyond what the GIL already provides. Signed-off-by: Brad Martin <[email protected]> * Fix deadlock during shutdown() Needed to release the GIL while blocking on another lock. Signed-off-by: Brad Martin <[email protected]> * A little further on using C++ _rclpy types instead of Python interface Signed-off-by: Brad Martin <[email protected]> * Fix issue with iterating through incomplete Tasks Never bool-test a py::object::attr() return value without an explicit py::cast<bool>. Signed-off-by: Brad Martin <[email protected]> * Add support for coroutines to timer handling Signed-off-by: Brad Martin <[email protected]> * Fix test_not_lose_callback() test to destroy entities properly EventsExecutor was exploding because the test was leaving destroyed entities in the node by using an incorrect API to destroy them. Signed-off-by: Brad Martin <[email protected]> * Correct test that wasn't running at all, and remove EventsExecutor from it * Test methods need to be prefixed with 'test_' in order to function. This had been entirely dead code, and when I enabled it EventsExecutor deadlocked. * On reflection, it seems like a deadlock is exactly what should be expected from what this test is doing. Remove EventsExecutor from the test and document the problem. Signed-off-by: Brad Martin <[email protected]> * Warn on every timer added over the threshold, not just the first Co-authored-by: Janosch Machowinski <[email protected]> Signed-off-by: Brad Martin <[email protected]> * Keep rcl_timer_call() tightly coupled with user callback dispatch This both prevents an issue where user callbacks could potentially queue up if they didn't dispatch fast enough, and ensures that a timer that has passed its expiration without being dispatched yet can still be canceled without the user callback being delivered later. This commit also makes use of the new rcl API for querying the absolute timer expiration time, instead of the relative number of nanoseconds remaining until it expires. This should both make things more accurate, and potentially reduce overhead as we don't have to re-query the current time for each outstanding timer. Signed-off-by: Brad Martin <[email protected]> * Protect against deferred method calls happening against a deleted ClockManager Signed-off-by: Brad Martin <[email protected]> * Add support for new TimerInfo data passed to timer handlers Signed-off-by: Brad Martin <[email protected]> * Simplify spin_once() implementation This both reduces duplicate code now, and simplifies the asio interface used which would need replacing. Signed-off-by: Brad Martin <[email protected]> * Fix stale Future done callbacks with spin_until_future_complete() This method can't be allowed to leave its Future done callback outstanding in case the method is returning for a reason other than the Future being done. Signed-off-by: Brad Martin <[email protected]> * Use existing rclpy signal handling instead of asio Signed-off-by: Brad Martin <[email protected]> * Replace asio timers with a dedicated timer wait thread This is dumb on its own, but it helps me move towards eliminating asio. Signed-off-by: Brad Martin <[email protected]> * Correct busy-looping in async callback handling This isn't ideal because there are some ways async callbacks could become unblocked which wouldn't get noticed right away (if at all); however this seems to match the behavior of SingleThreadedExecutor. Signed-off-by: Brad Martin <[email protected]> * Replace asio::io_context with a new EventsQueue object Signed-off-by: Brad Martin <[email protected]> * Add EventsExecutor to new test_executor test from merge Signed-off-by: Brad Martin <[email protected]> * Swap 'pragma once' for ifndef include guards Signed-off-by: Brad Martin <[email protected]> * Add test coverage for Node.destroy_subscription() Signed-off-by: Brad Martin <[email protected]> * Replace '|' type markup with typing.Optional and typing.Union Python 3.9, still used by RHEL 9, doesn't seem to understand '|' syntax, and Optional/Union seems to be what gets used throughout the rest of the codebase. Additionally, fix a couple other mypy nits: * mypy is mad that I haven't explicitly annotated every __init__ as returning None * mypy wants generic arguments to service and action clients now Signed-off-by: Brad Martin <[email protected]> * Use 'auto' in place of explicit return type on hash pybind11::hash() is documented as returning ssize_t, but this seems to be a lie because MSVC doesn't understand that type; so, let's just return whatever we do get. Signed-off-by: Brad Martin <[email protected]> * Change initialization of struct members MSVC didn't like the more concise method. Signed-off-by: Brad Martin <[email protected]> * Use subTest in test_executor to distinguish which executor type failed Signed-off-by: Brad Martin <[email protected]> * Use time.perf_counter() instead of time.monotonic() in executor test time.monotonic() has a resolution of 16ms, which is way too coarse for the intervals this test is trying to measure. Signed-off-by: Brad Martin <[email protected]> --------- Signed-off-by: Brad Martin <[email protected]> Signed-off-by: Brad Martin <[email protected]> Co-authored-by: Brad Martin <[email protected]> Co-authored-by: Janosch Machowinski <[email protected]>
|
Hi all, I've made an attempt to backport this to Humble (#1454) (ros2/rcl#1237). Please take a look if you have time! |
Closes #1389