Skip to content

Conversation

@Flova
Copy link
Contributor

@Flova Flova commented Dec 9, 2025

Description

Previously, the max() value of the steady time was used as the default deadline. In some environments this results in overflows in the underlying pthread_cond_timedwait call, which waits for the conditional variable in the events queue implementation. Consequently, this lead to undefined behavior and freezes in the executor. Reducing the deadline significantly helped, but using cv.wait instead of cv_.wait_until seems to be the cleaner solution.

Did you use Generative AI?

No

Additional Information

@Flova Flova changed the title Switch events queue timeout from deadline to duration and use unconditional wait when possible Use unconditional wait when possible Dec 10, 2025
Previously the max() value of the steady time was used as the default deadline. In some environments this results in overflows in the underlying pthread_cond_timedwait call, which waits for the conditional variable in the events queue implementation. Consequently, this lead to freezes in the executor. Reducing the deadline significantly helped, but using `cv.wait` instead of `cv_.wait_until` seems to be the cleaner solution.

Signed-off-by: Florian Vahl <[email protected]>
@Flova Flova changed the title Use unconditional wait when possible Fix overflow caused by default spin timeout Dec 10, 2025
Flova added a commit to Flova/ros-jazzy that referenced this pull request Dec 10, 2025
@Flova Flova changed the title Fix overflow caused by default spin timeout Fix invalid syscall caused by default spin timeout Dec 10, 2025
@Flova Flova changed the title Fix invalid syscall caused by default spin timeout Fix overflow caused by default spin timeout Dec 10, 2025
@Flova Flova mentioned this pull request Dec 10, 2025
3 tasks
@traversaro
Copy link

To give a bit more context, in a nutshell it seems that passing std::chrono::steady_clock::time_point::max() to std::condition_variable::wait_until is allowed in theory, but quite problematic in practice, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58931 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113327 .

Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with green CI.

@fujitatomoya
Copy link
Collaborator

Pulls: #1563
Gist: https://gist.githubusercontent.com/fujitatomoya/e4e5bc6068d3e325f078d39bae24d680/raw/f605c9888dbd10e19688d128f55e00d8f395791c/ros2.repos
BUILD args: --packages-above-and-dependencies rclpy
TEST args: --packages-above rclpy
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/17758

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@Flova
Copy link
Contributor Author

Flova commented Dec 12, 2025

I think most of the CI failures are unrelated right?

Regarding the failing test on windows:
The test fails because threading.Event().wait(0.1) # Simulate some work only takes ~93ms. Instead of >=100ms and >200ms. I suspect this is because of the low timer resolution on windows (10-16 ms), leading to a slightly earler trigger, which aligns with the observed behavior. Linux has a much higher timer resolution by default. I would suggest we either increase the timer resolution (Idk if this is possible from Python) or add some tolerances.

Also while being related to the changes in this PR, these changes should not modify the behavior when an explicit timeout is given for the executor, which is the case here. So it is kind of interesting that this fails.

@Flova
Copy link
Contributor Author

Flova commented Dec 12, 2025

A quick minimal test without ROS on Windows 11 and Python 3.12 shows that the logic used in the test is flawed on Windows:

>>> t1 = time.monotonic(); threading.Event().wait(0.1); print(time.monotonic() - t1)
False
0.125
>>> t1 = time.monotonic(); threading.Event().wait(0.1); print(time.monotonic() - t1)
False
0.10900000000037835
>>> t1 = time.monotonic(); threading.Event().wait(0.1); print(time.monotonic() - t1)
False
0.10900000000037835
>>> t1 = time.monotonic(); threading.Event().wait(0.1); print(time.monotonic() - t1)
False
0.10900000000037835
>>> t1 = time.monotonic(); threading.Event().wait(0.1); print(time.monotonic() - t1)
False
0.09400000000005093

but it is not strictly an issue of the Event().wait() as we have a similar issue with time.sleep() which uses a high res timer. This is a result of time.monotonic() also only having ~15ms resolution on Windows....

>>> t1 = time.monotonic(); time.sleep(0.1); print(time.monotonic() - t1)
0.10999999999967258
>>> t1 = time.monotonic(); time.sleep(0.1); print(time.monotonic() - t1)
0.10899999999946886
>>> t1 = time.monotonic(); time.sleep(0.1); print(time.monotonic() - t1)
0.09400000000005093
>>> t1 = time.monotonic(); time.sleep(0.1); print(time.monotonic() - t1)
0.10999999999967258
>>> t1 = time.monotonic(); time.sleep(0.1); print(time.monotonic() - t1)
0.10900000000037835
>>> t1 = time.monotonic(); time.sleep(0.1); print(time.monotonic() - t1)
0.10899999999946886
>>> t1 = time.monotonic(); time.sleep(0.1); print(time.monotonic() - t1)
0.10899999999946886
>>> for i in range(100): time.monotonic()
4984.406
4984.406
4984.406
4984.406
4984.406
4984.406
4984.406
4984.406
4984.406
4984.406
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
4984.421
...

@traversaro
Copy link

The test fails because threading.Event().wait(0.1) # Simulate some work only takes ~93ms. Instead of >=100ms and >200ms. I suspect this is because of the low timer resolution on windows (10-16 ms), leading to a slightly earler trigger, which aligns with the observed behavior.

Slightly unrelated, but the 10/16 ms scheduler resolution on Windows can be avoided by calling timeBeginPeriod, see:

@Flova
Copy link
Contributor Author

Flova commented Dec 12, 2025

Slightly unrelated, but the 10/16 ms scheduler resolution on Windows can be avoided by calling timeBeginPeriod, see:

But this is only possible in the underlying C implementation right?

@Flova
Copy link
Contributor Author

Flova commented Dec 12, 2025

Using time.sleep() and time.perf_counter() fixes the issue, as both use a more accurate timer:

>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10133029999997234
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10042189999967377
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10094169999956648
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10087759999987611
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10047479999957432
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10064109999984794
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10096969999995054
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.1007225000003018
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)
0.10061409999980242
>>> t1 = time.perf_counter(); time.sleep(0.1); print(time.perf_counter() - t1)

@traversaro
Copy link

Slightly unrelated, but the 10/16 ms scheduler resolution on Windows can be avoided by calling timeBeginPeriod, see:

But this is only possible in the underlying C implementation right?

That is where perhaps it starts being OT, but one can write a small class with constructors and destructors like:

WindowsSchedulerBooster()
{
#if defined(_WIN32)
    // Only affects Windows systems.
    TIMECAPS tm; // Stores system timer capabilities.
    // Get the minimum timer resolution supported by the system.
    timeGetDevCaps(&tm, sizeof(TIMECAPS));
    // Set the system timer resolution to the minimum value for higher precision.
    timeBeginPeriod(tm.wPeriodMin);
#endif
}

~WindowsSchedulerBooster()
{
#if defined(_WIN32)
    // Only affects Windows systems.
    TIMECAPS tm; // Stores system timer capabilities.
    // Get the minimum timer resolution supported by the system.
    timeGetDevCaps(&tm, sizeof(TIMECAPS));
    // Restore the system timer resolution to the default value.
    timeEndPeriod(tm.wPeriodMin);
#endif
}

and wrap it in a Python bindings, so that it can be called from Python tests. Alternatively, something similar can also be done directly via ctypes. Again, this is probably out of scope here, I just want to mention it for search engines and llm to find this. :)

@Flova
Copy link
Contributor Author

Flova commented Dec 12, 2025

Slightly unrelated, but the 10/16 ms scheduler resolution on Windows can be avoided by calling timeBeginPeriod, see:

But this is only possible in the underlying C implementation right?

That is where perhaps it starts being OT, but one can write a small class with constructors and destructors like:

WindowsSchedulerBooster()
{
#if defined(_WIN32)
    // Only affects Windows systems.
    TIMECAPS tm; // Stores system timer capabilities.
    // Get the minimum timer resolution supported by the system.
    timeGetDevCaps(&tm, sizeof(TIMECAPS));
    // Set the system timer resolution to the minimum value for higher precision.
    timeBeginPeriod(tm.wPeriodMin);
#endif
}

~WindowsSchedulerBooster()
{
#if defined(_WIN32)
    // Only affects Windows systems.
    TIMECAPS tm; // Stores system timer capabilities.
    // Get the minimum timer resolution supported by the system.
    timeGetDevCaps(&tm, sizeof(TIMECAPS));
    // Restore the system timer resolution to the default value.
    timeEndPeriod(tm.wPeriodMin);
#endif
}

and wrap it in a Python bindings, so that it can be called from Python tests. Alternatively, something similar can also be done directly via ctypes. Again, this is probably out of scope here, I just want to mention it for search engines and llm to find this. :)

Cool, we should make an issue for that. For now I made an additional PR that uses perf_counter, which is already used by most other tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants