-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Kernel: Use word-sized writes in the generic memset implementation #26434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This decreases the boot time on my x86-64 host system from 15 s to 13 s for AArch64 QEMU TCG and 33 s to 30 s for RISC-V QEMU TCG! I additionally measured the performance of this new implementation with this simple benchmark: https://gist.github.com/spholz/b06ea737b435ecc181069cf0d911faa4 Based on to this benchmark, an unroll level 8 seems like a good choice for all tested systems. Here are the speedups for n=0x10000: - Raspberry Pi 5: 7.6 ( 81984 ns -> 10748 ns) - Raspberry Pi 4: 3.3 (131197 ns -> 39704 ns) - StarFive VisionFive 2: 5.5 (279107 ns -> 50650 ns) - AArch64 QEMU TCG: 6.8 (374287 ns -> 54847 ns) - RISC-V QEMU TCG: 6.7 (354195 ns -> 52615 ns) - x86-64 QEMU KVM: 3.8 ( 32443 ns -> 8542 ns)
|
The generic LibC memset also just uses byte-sized writes currently. Maybe we can deduplicate these two implementations and therefore also use this new implementation in userland as a follow-up. |
|
(note on clang you can use the |
|
Clang doesn't perform this "optimization", only GCC does. So this attribute doesn't help. |
On x86 we already special case on whether the fast |
|
That's why I said generic memset. We can still use this implementation in AArch64 and RISC-V userland. (We also have a |
|
Playing around with godbolt I have found:
|
This decreases the boot time on my x86-64 host system from 15 s to 13 s for AArch64 QEMU TCG and 33 s to 30 s for RISC-V QEMU TCG!
I additionally measured the performance of this new implementation with this simple benchmark:
https://gist.github.com/spholz/b06ea737b435ecc181069cf0d911faa4
Based on to this benchmark, an unroll level 8 seems like a good choice for all tested systems.
Here are the speedups for n=0x10000: