Fix overflow caused by default spin timeout #1563

Flova · 2025-12-09T18:25:27Z

Description

Previously, the max() value of the steady time was used as the default deadline. In some environments this results in overflows in the underlying pthread_cond_timedwait call, which waits for the conditional variable in the events queue implementation. Consequently, this lead to undefined behavior and freezes in the executor. Reducing the deadline significantly helped, but using cv.wait instead of cv_.wait_until seems to be the cleaner solution.

Did you use Generative AI?

No

Additional Information

Previously the max() value of the steady time was used as the default deadline. In some environments this results in overflows in the underlying pthread_cond_timedwait call, which waits for the conditional variable in the events queue implementation. Consequently, this lead to freezes in the executor. Reducing the deadline significantly helped, but using `cv.wait` instead of `cv_.wait_until` seems to be the cleaner solution. Signed-off-by: Florian Vahl <[email protected]>

traversaro · 2025-12-11T10:54:24Z

To give a bit more context, in a nutshell it seems that passing std::chrono::steady_clock::time_point::max() to std::condition_variable::wait_until is allowed in theory, but quite problematic in practice, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58931 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113327 .

fujitatomoya

lgtm with green CI.

fujitatomoya · 2025-12-12T00:20:09Z

Pulls: #1563
Gist: https://gist.githubusercontent.com/fujitatomoya/e4e5bc6068d3e325f078d39bae24d680/raw/f605c9888dbd10e19688d128f55e00d8f395791c/ros2.repos
BUILD args: --packages-above-and-dependencies rclpy
TEST args: --packages-above rclpy
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/17758

Linux
Linux-aarch64
Linux-rhel
Windows

Flova · 2025-12-12T08:56:10Z

I think most of the CI failures are unrelated right?

Regarding the failing test on windows:
The test fails because threading.Event().wait(0.1) # Simulate some work only takes ~93ms. Instead of >=100ms and >200ms. I suspect this is because of the low timer resolution on windows (10-16 ms), leading to a slightly earler trigger, which aligns with the observed behavior. Linux has a much higher timer resolution by default. I would suggest we either increase the timer resolution (Idk if this is possible from Python) or add some tolerances.

Also while being related to the changes in this PR, these changes should not modify the behavior when an explicit timeout is given for the executor, which is the case here. So it is kind of interesting that this fails.

Flova · 2025-12-12T09:15:00Z

A quick minimal test without ROS on Windows 11 and Python 3.12 shows that the logic used in the test is flawed on Windows:

>>> t1 = time.monotonic(); threading.Event().wait(0.1); print(time.monotonic() - t1)
False
0.125
>>> t1 = time.monotonic(); threading.Event().wait(0.1); print(time.monotonic() - t1)
False
0.10900000000037835
>>> t1 = time.monotonic(); threading.Event().wait(0.1); print(time.monotonic() - t1)
False
0.10900000000037835
>>> t1 = time.monotonic(); threading.Event().wait(0.1); print(time.monotonic() - t1)
False
0.10900000000037835
>>> t1 = time.monotonic(); threading.Event().wait(0.1); print(time.monotonic() - t1)
False
0.09400000000005093

but it is not strictly an issue of the Event().wait() as we have a similar issue with time.sleep() which uses a high res timer. This is a result of time.monotonic() also only having ~15ms resolution on Windows....

>>> t1 = time.monotonic(); time.sleep(0.1); print(time.monotonic() - t1)
0.10999999999967258
>>> t1 = time.monotonic(); time.sleep(0.1); print(time.monotonic() - t1)
0.10899999999946886
>>> t1 = time.monotonic(); time.sleep(0.1); print(time.monotonic() - t1)
0.09400000000005093
>>> t1 = time.monotonic(); time.sleep(0.1); print(time.monotonic() - t1)
0.10999999999967258
>>> t1 = time.monotonic(); time.sleep(0.1); print(time.monotonic() - t1)
0.10900000000037835
>>> t1 = time.monotonic(); time.sleep(0.1); print(time.monotonic() - t1)
0.10899999999946886
>>> t1 = time.monotonic(); time.sleep(0.1); print(time.monotonic() - t1)
0.10899999999946886

>>> for i in range(100): time.monotonic()
4984.406
4984.406
4984.406
4984.406
4984.406
4984.406
4984.406
4984.406
4984.406
4984.406
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
...

traversaro · 2025-12-12T09:34:20Z

The test fails because threading.Event().wait(0.1) # Simulate some work only takes ~93ms. Instead of >=100ms and >200ms. I suspect this is because of the low timer resolution on windows (10-16 ms), leading to a slightly earler trigger, which aligns with the observed behavior.

Slightly unrelated, but the 10/16 ms scheduler resolution on Windows can be avoided by calling timeBeginPeriod, see:

Flova · 2025-12-12T09:50:43Z

Slightly unrelated, but the 10/16 ms scheduler resolution on Windows can be avoided by calling timeBeginPeriod, see:

But this is only possible in the underlying C implementation right?

Flova · 2025-12-12T09:57:09Z

Using time.sleep() and time.perf_counter() fixes the issue, as both use a more accurate timer:

>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10133029999997234
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10042189999967377
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10094169999956648
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10087759999987611
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10047479999957432
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10064109999984794
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10096969999995054
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.1007225000003018
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10061409999980242
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)

traversaro · 2025-12-12T09:57:38Z

Slightly unrelated, but the 10/16 ms scheduler resolution on Windows can be avoided by calling timeBeginPeriod, see:

But this is only possible in the underlying C implementation right?

That is where perhaps it starts being OT, but one can write a small class with constructors and destructors like:

WindowsSchedulerBooster()
{
#if defined(_WIN32)
    // Only affects Windows systems.
    TIMECAPS tm; // Stores system timer capabilities.
    // Get the minimum timer resolution supported by the system.
    timeGetDevCaps(&tm, sizeof(TIMECAPS));
    // Set the system timer resolution to the minimum value for higher precision.
    timeBeginPeriod(tm.wPeriodMin);
#endif
}

~WindowsSchedulerBooster()
{
#if defined(_WIN32)
    // Only affects Windows systems.
    TIMECAPS tm; // Stores system timer capabilities.
    // Get the minimum timer resolution supported by the system.
    timeGetDevCaps(&tm, sizeof(TIMECAPS));
    // Restore the system timer resolution to the default value.
    timeEndPeriod(tm.wPeriodMin);
#endif
}

and wrap it in a Python bindings, so that it can be called from Python tests. Alternatively, something similar can also be done directly via ctypes. Again, this is probably out of scope here, I just want to mention it for search engines and llm to find this. :)

Flova · 2025-12-12T10:11:51Z

Slightly unrelated, but the 10/16 ms scheduler resolution on Windows can be avoided by calling timeBeginPeriod, see:

But this is only possible in the underlying C implementation right?

That is where perhaps it starts being OT, but one can write a small class with constructors and destructors like:
WindowsSchedulerBooster()
{
#if defined(_WIN32)
    // Only affects Windows systems.
    TIMECAPS tm; // Stores system timer capabilities.
    // Get the minimum timer resolution supported by the system.
    timeGetDevCaps(&tm, sizeof(TIMECAPS));
    // Set the system timer resolution to the minimum value for higher precision.
    timeBeginPeriod(tm.wPeriodMin);
#endif
}

~WindowsSchedulerBooster()
{
#if defined(_WIN32)
    // Only affects Windows systems.
    TIMECAPS tm; // Stores system timer capabilities.
    // Get the minimum timer resolution supported by the system.
    timeGetDevCaps(&tm, sizeof(TIMECAPS));
    // Restore the system timer resolution to the default value.
    timeEndPeriod(tm.wPeriodMin);
#endif
}
and wrap it in a Python bindings, so that it can be called from Python tests. Alternatively, something similar can also be done directly via ctypes. Again, this is probably out of scope here, I just want to mention it for search engines and llm to find this. :)

Cool, we should make an issue for that. For now I made an additional PR that uses perf_counter, which is already used by most other tests.

Flova changed the title ~~Switch events queue timeout from deadline to duration and use unconditional wait when possible~~ Use unconditional wait when possible Dec 10, 2025

Flova force-pushed the rolling branch from ab1662d to 88f355e Compare December 10, 2025 08:54

Flova changed the title ~~Use unconditional wait when possible~~ Fix overflow caused by default spin timeout Dec 10, 2025

Flova added a commit to Flova/ros-jazzy that referenced this pull request Dec 10, 2025

Add ros2/rclpy#1563

d16e085

Flova mentioned this pull request Dec 10, 2025

Fix rclpy events executor RoboStack/ros-jazzy#137

Merged

Flova changed the title ~~Fix overflow caused by default spin timeout~~ Fix invalid syscall caused by default spin timeout Dec 10, 2025

Flova changed the title ~~Fix invalid syscall caused by default spin timeout~~ Fix overflow caused by default spin timeout Dec 10, 2025

Flova mentioned this pull request Dec 10, 2025

Use Pixi bit-bots/bitbots_main#730

Open

3 tasks

fujitatomoya approved these changes Dec 12, 2025

View reviewed changes

Flova mentioned this pull request Dec 12, 2025

👩‍🌾 Regression on TestExecutor.test_create_task_coroutine_wake_from_another_thread #1562

Open

Flova mentioned this pull request Dec 12, 2025

Fix windows regression: Increase clock accuracy #1564

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix overflow caused by default spin timeout #1563

Fix overflow caused by default spin timeout #1563

Flova commented Dec 9, 2025 •

edited

Loading

Uh oh!

traversaro commented Dec 11, 2025

Uh oh!

fujitatomoya left a comment

Uh oh!

fujitatomoya commented Dec 12, 2025

Uh oh!

Flova commented Dec 12, 2025

Uh oh!

Flova commented Dec 12, 2025 •

edited

Loading

Uh oh!

traversaro commented Dec 12, 2025

Uh oh!

Flova commented Dec 12, 2025

Uh oh!

Flova commented Dec 12, 2025

Uh oh!

traversaro commented Dec 12, 2025

Uh oh!

Flova commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix overflow caused by default spin timeout #1563

Are you sure you want to change the base?

Fix overflow caused by default spin timeout #1563

Conversation

Flova commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Did you use Generative AI?

Additional Information

Uh oh!

traversaro commented Dec 11, 2025

Uh oh!

fujitatomoya left a comment

Choose a reason for hiding this comment

Uh oh!

fujitatomoya commented Dec 12, 2025

Uh oh!

Flova commented Dec 12, 2025

Uh oh!

Flova commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

traversaro commented Dec 12, 2025

Uh oh!

Flova commented Dec 12, 2025

Uh oh!

Flova commented Dec 12, 2025

Uh oh!

traversaro commented Dec 12, 2025

Uh oh!

Flova commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Flova commented Dec 9, 2025 •

edited

Loading

Flova commented Dec 12, 2025 •

edited

Loading