Is your enhancement proposal related to a problem? Please describe.
Some benchmarks shows Zephyr behind in context swap performance compared to ThreadX.
Describe the solution you'd like
Avoid any branch-link (function call) operations in PendSV handling for Arm, likely other archs could have the same idea implemented.
Every bl op is a potential pipeline flush, certainly some lost context, we almost immediately call out to a C function handler for PendSV handling (used for Arm context swap). There's several other bl ops involved depending on which options are involved.
ThreadX avoids almost all bl ops except a hook for swap in/swap out that is opt in. Otherwise has ~80 asm instructions for PendSV handling. Clearly has some performance implications somewhere here, maybe partially due to the branch out of inline asm. Perhaps other things, needs investigating.
https://github.com/eclipse-threadx/threadx/blob/master/ports_arch/ARMv7-M/threadx/gnu/src/tx_thread_schedule.S#L131
https://github.com/zephyrproject-rtos/zephyr/blob/main/arch/arm/core/cortex_m/swap_helper.S#L56
Describe alternatives you've considered
Not doing anything
Additional context
Benchmark report showing difference in context swap performance, on a cortex-m4
https://www.dropbox.com/scl/fi/opimwfbvkd9coeprc7d5h/Beningo_RtosPerformance_2024_Report.pdf?rlkey=s3n007s6hgubnj37ovto88bs2&e=3&dl=0
In large part the difference is due to our MPU usage for hw stack protection by default, but this isn't the only thing playing a part.
Is your enhancement proposal related to a problem? Please describe.
Some benchmarks shows Zephyr behind in context swap performance compared to ThreadX.
Describe the solution you'd like
Avoid any branch-link (function call) operations in PendSV handling for Arm, likely other archs could have the same idea implemented.
Every bl op is a potential pipeline flush, certainly some lost context, we almost immediately call out to a C function handler for PendSV handling (used for Arm context swap). There's several other bl ops involved depending on which options are involved.
ThreadX avoids almost all bl ops except a hook for swap in/swap out that is opt in. Otherwise has ~80 asm instructions for PendSV handling. Clearly has some performance implications somewhere here, maybe partially due to the branch out of inline asm. Perhaps other things, needs investigating.
https://github.com/eclipse-threadx/threadx/blob/master/ports_arch/ARMv7-M/threadx/gnu/src/tx_thread_schedule.S#L131
https://github.com/zephyrproject-rtos/zephyr/blob/main/arch/arm/core/cortex_m/swap_helper.S#L56
Describe alternatives you've considered
Not doing anything
Additional context
Benchmark report showing difference in context swap performance, on a cortex-m4
https://www.dropbox.com/scl/fi/opimwfbvkd9coeprc7d5h/Beningo_RtosPerformance_2024_Report.pdf?rlkey=s3n007s6hgubnj37ovto88bs2&e=3&dl=0
In large part the difference is due to our MPU usage for hw stack protection by default, but this isn't the only thing playing a part.