Skip to content

Conversation

@spholz
Copy link
Member

@spholz spholz commented Nov 27, 2025

This decreases the boot time on my x86-64 host system from 15 s to 13 s for AArch64 QEMU TCG and 33 s to 30 s for RISC-V QEMU TCG!

I additionally measured the performance of this new implementation with this simple benchmark:
https://gist.github.com/spholz/b06ea737b435ecc181069cf0d911faa4

Based on to this benchmark, an unroll level 8 seems like a good choice for all tested systems.

Here are the speedups for n=0x10000:

System Speedup Old runtime New runtime
Raspberry Pi 5 7.6 81984 ns 10748 ns
Raspberry Pi 4 3.3 131197 ns 39704 ns
StarFive VisionFive 2 5.5 279107 ns 50650 ns
AArch64 QEMU TCG 6.8 374287 ns 54847 ns
RISC-V QEMU TCG 6.7 354195 ns 52615 ns
x86-64 QEMU KVM 3.8 32443 ns 8542 ns

This decreases the boot time on my x86-64 host system from 15 s to 13 s
for AArch64 QEMU TCG and 33 s to 30 s for RISC-V QEMU TCG!

I additionally measured the performance of this new implementation
with this simple benchmark:
https://gist.github.com/spholz/b06ea737b435ecc181069cf0d911faa4

Based on to this benchmark, an unroll level 8 seems like a good choice
for all tested systems.

Here are the speedups for n=0x10000:
- Raspberry Pi 5:        7.6 ( 81984 ns -> 10748 ns)
- Raspberry Pi 4:        3.3 (131197 ns -> 39704 ns)
- StarFive VisionFive 2: 5.5 (279107 ns -> 50650 ns)
- AArch64 QEMU TCG:      6.8 (374287 ns -> 54847 ns)
- RISC-V QEMU TCG:       6.7 (354195 ns -> 52615 ns)
- x86-64 QEMU KVM:       3.8 ( 32443 ns ->  8542 ns)
@github-actions github-actions bot added the 👀 pr-needs-review PR needs review from a maintainer or community member label Nov 27, 2025
@spholz
Copy link
Member Author

spholz commented Nov 27, 2025

The generic LibC memset also just uses byte-sized writes currently. Maybe we can deduplicate these two implementations and therefore also use this new implementation in userland as a follow-up.

@Hendiadyoin1
Copy link
Contributor

(note on clang you can use the nobuiltin attribute instead of the other options, or more fine-grained nobuiltin(memset))

@spholz
Copy link
Member Author

spholz commented Nov 27, 2025

Clang doesn't perform this "optimization", only GCC does. So this attribute doesn't help.

@Hendiadyoin1
Copy link
Contributor

The generic LibC memset also just uses byte-sized writes currently. Maybe we can deduplicate these two implementations and therefore also use this new implementation in userland as a follow-up.

On x86 we already special case on whether the fast stosb micorcode is available and use a handwritten SSE alternative instead

@spholz
Copy link
Member Author

spholz commented Nov 27, 2025

That's why I said generic memset. We can still use this implementation in AArch64 and RISC-V userland.

(We also have a rep stos* impl in the kernel)

@Hendiadyoin1
Copy link
Contributor

Hendiadyoin1 commented Nov 27, 2025

Playing around with godbolt I have found:

  • Clang on x86 also wants to unroll 8 times, without needing to be asked to do so
  • Making the alignment portion of the loop use a per-calculated length gives mixed results, so maybe looking at perf is worth it (I hope my math is right):
    • x86 gets longer, possibly worse on both gcc and clang
    • not sure about riscv (also same-y on clang)
    • aarch64 is also odd
    • tested on https://godbolt.org/z/enzb7a3nW with clang/gcc trunk on x86_64 + mnosse, riscv-64, arm64/armv8a + mgeneral-regs-only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

👀 pr-needs-review PR needs review from a maintainer or community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants