Kernel: Use word-sized writes in the generic memset implementation #26434

spholz · 2025-11-27T12:03:27Z

This decreases the boot time on my x86-64 host system from 15 s to 13 s for AArch64 QEMU TCG and 33 s to 30 s for RISC-V QEMU TCG!

I additionally measured the performance of this new implementation with this simple benchmark:
https://gist.github.com/spholz/b06ea737b435ecc181069cf0d911faa4

Based on to this benchmark, an unroll level 8 seems like a good choice for all tested systems.

Here are the speedups for n=0x10000:

System	Speedup	Old runtime	New runtime
Raspberry Pi 5	7.6	81984 ns	10748 ns
Raspberry Pi 4	3.3	131197 ns	39704 ns
StarFive VisionFive 2	5.5	279107 ns	50650 ns
AArch64 QEMU TCG	6.8	374287 ns	54847 ns
RISC-V QEMU TCG	6.7	354195 ns	52615 ns
x86-64 QEMU KVM	3.8	32443 ns	8542 ns

This decreases the boot time on my x86-64 host system from 15 s to 13 s for AArch64 QEMU TCG and 33 s to 30 s for RISC-V QEMU TCG! I additionally measured the performance of this new implementation with this simple benchmark: https://gist.github.com/spholz/b06ea737b435ecc181069cf0d911faa4 Based on to this benchmark, an unroll level 8 seems like a good choice for all tested systems. Here are the speedups for n=0x10000: - Raspberry Pi 5: 7.6 ( 81984 ns -> 10748 ns) - Raspberry Pi 4: 3.3 (131197 ns -> 39704 ns) - StarFive VisionFive 2: 5.5 (279107 ns -> 50650 ns) - AArch64 QEMU TCG: 6.8 (374287 ns -> 54847 ns) - RISC-V QEMU TCG: 6.7 (354195 ns -> 52615 ns) - x86-64 QEMU KVM: 3.8 ( 32443 ns -> 8542 ns)

spholz · 2025-11-27T12:13:43Z

The generic LibC memset also just uses byte-sized writes currently. Maybe we can deduplicate these two implementations and therefore also use this new implementation in userland as a follow-up.

Hendiadyoin1 · 2025-11-27T12:14:39Z

(note on clang you can use the nobuiltin attribute instead of the other options, or more fine-grained nobuiltin(memset))

spholz · 2025-11-27T12:16:37Z

Clang doesn't perform this "optimization", only GCC does. So this attribute doesn't help.

Hendiadyoin1 · 2025-11-27T12:20:29Z

The generic LibC memset also just uses byte-sized writes currently. Maybe we can deduplicate these two implementations and therefore also use this new implementation in userland as a follow-up.

On x86 we already special case on whether the fast stosb micorcode is available and use a handwritten SSE alternative instead

spholz · 2025-11-27T12:26:51Z

That's why I said generic memset. We can still use this implementation in AArch64 and RISC-V userland.

(We also have a rep stos* impl in the kernel)

Hendiadyoin1 · 2025-11-27T12:56:36Z

Playing around with godbolt I have found:

Clang on x86 also wants to unroll 8 times, without needing to be asked to do so
Making the alignment portion of the loop use a per-calculated length gives mixed results, so maybe looking at perf is worth it (I hope my math is right):
- x86 gets longer, possibly worse on both gcc and clang
- not sure about riscv (also same-y on clang)
- aarch64 is also odd
- tested on https://godbolt.org/z/enzb7a3nW with clang/gcc trunk on x86_64 + mnosse, riscv-64, arm64/armv8a + mgeneral-regs-only

github-actions bot added the 👀 pr-needs-review PR needs review from a maintainer or community member label Nov 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kernel: Use word-sized writes in the generic memset implementation #26434

Kernel: Use word-sized writes in the generic memset implementation #26434

spholz commented Nov 27, 2025 •

edited

Loading

Uh oh!

spholz commented Nov 27, 2025 •

edited

Loading

Uh oh!

Hendiadyoin1 commented Nov 27, 2025

Uh oh!

spholz commented Nov 27, 2025

Uh oh!

Hendiadyoin1 commented Nov 27, 2025

Uh oh!

spholz commented Nov 27, 2025 •

edited

Loading

Uh oh!

Hendiadyoin1 commented Nov 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kernel: Use word-sized writes in the generic memset implementation #26434

Are you sure you want to change the base?

Kernel: Use word-sized writes in the generic memset implementation #26434

Conversation

spholz commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spholz commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hendiadyoin1 commented Nov 27, 2025

Uh oh!

spholz commented Nov 27, 2025

Uh oh!

Hendiadyoin1 commented Nov 27, 2025

Uh oh!

spholz commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hendiadyoin1 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spholz commented Nov 27, 2025 •

edited

Loading

spholz commented Nov 27, 2025 •

edited

Loading

spholz commented Nov 27, 2025 •

edited

Loading

Hendiadyoin1 commented Nov 27, 2025 •

edited

Loading