-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Optimize stackalloc zeroing on arm64 via STORE_BLK #121986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@EgorBot -arm using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
public class Benchmarks
{
[Benchmark] public void Stackalloc64() => Consume(stackalloc byte[64]);
[Benchmark] public void Stackalloc128() => Consume(stackalloc byte[128]);
[Benchmark] public void Stackalloc256() => Consume(stackalloc byte[256]);
[Benchmark] public void Stackalloc512() => Consume(stackalloc byte[512]);
[Benchmark] public void Stackalloc1024() => Consume(stackalloc byte[1024]);
[Benchmark] public void Stackalloc16384() => Consume(stackalloc byte[16384]);
[MethodImpl(MethodImplOptions.NoInlining)]
static void Consume(Span<byte> x){}
} |
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes stackalloc zeroing on ARM64 by enabling the same STORE_BLK optimization that already exists for X64. When the allocation size is a constant, the lowering phase now takes responsibility for clearing memory via an unrolled STORE_BLK node, allowing the backend to skip loop-based zeroing and use more efficient SIMD instructions.
Key changes:
- Enables Lower's STORE_BLK optimization for constant-sized stackalloc on ARM64
- Introduces
clearMemorylocal variable to track whether backend should clear memory - Updates register allocation and code generation to skip clearing when Lower handles it
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/coreclr/jit/lower.cpp | Extends the constant-sized LCLHEAP optimization to TARGET_ARM64 |
| src/coreclr/jit/lsraarm64.cpp | Updates register allocation to track when Lower handles memory clearing |
| src/coreclr/jit/codegenarm64.cpp | Updates code generation to skip clearing when Lower took responsibility |
|
The superpmi-replay asserts look related |
c61e795 to
8746f45
Compare
8746f45 to
006fe15
Compare
|
@jakobbotsch @dotnet/jit-contrib PTAL So today if the Size is a constant and it's contained it means it's either already cleared by GT_STORE_BLK or initMem is false. It may be not contained if it's too big (GT_STORE_BLK is effectively limited with 4GB while LCLHEAP accepts For all size it seems to be a clear win (for 32b and less we don't emit LCLHEAP and convert it to locals instead) |
jakobbotsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM beyond the nits
Co-authored-by: Jakob Botsch Nielsen <[email protected]>
Enable X64's optimization where we clear LCLHEAP via STORE_BLK inserted in Lower on arm64.
was:
now:
Also, for larger sizes the previous logic used to emit a slow loop (e.g. 1024 bytes):
Now it will emit a call to
CORINFO_HELP_MEMZEROBenchmarks.