Optimize stackalloc zeroing on arm64 via STORE_BLK #121986

EgorBo · 2025-11-26T12:17:00Z

Enable X64's optimization where we clear LCLHEAP via STORE_BLK inserted in Lower on arm64.

static void Test128() => Consume(stackalloc char[128]);

was:

            stp     xzr, xzr, [sp, #-0x10]!
            stp     xzr, xzr, [sp, #-0xF0]!
            stp     xzr, xzr, [sp, #0x10]
            stp     xzr, xzr, [sp, #0x20]
            stp     xzr, xzr, [sp, #0x30]
            stp     xzr, xzr, [sp, #0x40]
            stp     xzr, xzr, [sp, #0x50]
            stp     xzr, xzr, [sp, #0x60]
            stp     xzr, xzr, [sp, #0x70]
            stp     xzr, xzr, [sp, #0x80]
            stp     xzr, xzr, [sp, #0x90]
            stp     xzr, xzr, [sp, #0xA0]
            stp     xzr, xzr, [sp, #0xB0]
            stp     xzr, xzr, [sp, #0xC0]
            stp     xzr, xzr, [sp, #0xD0]
            stp     xzr, xzr, [sp, #0xE0]

now:

            movi    v16.16b, #0
            stp     q16, q16, [x0]
            stp     q16, q16, [x0, #0x20]
            stp     q16, q16, [x0, #0x40]
            stp     q16, q16, [x0, #0x60]
            stp     q16, q16, [x0, #0x80]
            stp     q16, q16, [x0, #0xA0]
            stp     q16, q16, [x0, #0xC0]
            stp     q16, q16, [x0, #0xE0]

Also, for larger sizes the previous logic used to emit a slow loop (e.g. 1024 bytes):

            mov     w0, #0x400
G_M30953_IG03:
            stp     xzr, xzr, [sp, #-0x10]!
            subs    x0, x0, #16
            bne     G_M30953_IG03

Now it will emit a call to CORINFO_HELP_MEMZERO

Benchmarks.

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;

public class Benchmarks
{
    [Benchmark] public void Stackalloc64() => Consume(stackalloc byte[64]);
    [Benchmark] public void Stackalloc128() => Consume(stackalloc byte[128]);
    [Benchmark] public void Stackalloc256() => Consume(stackalloc byte[256]);
    [Benchmark] public void Stackalloc512() => Consume(stackalloc byte[512]);
    [Benchmark] public void Stackalloc1024() => Consume(stackalloc byte[1024]);
    [Benchmark] public void Stackalloc16384() => Consume(stackalloc byte[16384]);

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Consume(Span<byte> x){}
}

Method	Toolchain	Mean	Error	Ratio
Stackalloc64	Main	3.425 ns	0.0004 ns	1.00
Stackalloc64	PR	2.559 ns	0.0008 ns	0.75

Stackalloc128	Main	3.999 ns	0.0002 ns	1.00
Stackalloc128	PR	2.404 ns	0.0003 ns	0.60

Stackalloc256	Main	5.431 ns	0.0005 ns	1.00
Stackalloc256	PR	2.754 ns	0.0003 ns	0.51

Stackalloc512	Main	12.661 ns	0.2744 ns	1.00
Stackalloc512	PR	7.423 ns	0.0008 ns	0.59

Stackalloc1024	Main	24.958 ns	0.5326 ns	1.00
Stackalloc1024	PR	14.031 ns	0.0040 ns	0.56

Stackalloc16384	Main	374.899 ns	0.0130 ns	1.00
Stackalloc16384	PR	111.029 ns	1.2123 ns	0.30

EgorBo · 2025-11-26T12:24:37Z

@EgorBot -arm

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;

public class Benchmarks
{
    [Benchmark] public void Stackalloc64() => Consume(stackalloc byte[64]);
    [Benchmark] public void Stackalloc128() => Consume(stackalloc byte[128]);
    [Benchmark] public void Stackalloc256() => Consume(stackalloc byte[256]);
    [Benchmark] public void Stackalloc512() => Consume(stackalloc byte[512]);
    [Benchmark] public void Stackalloc1024() => Consume(stackalloc byte[1024]);
    [Benchmark] public void Stackalloc16384() => Consume(stackalloc byte[16384]);

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Consume(Span<byte> x){}
}

dotnet-policy-service · 2025-11-26T13:02:30Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

This PR optimizes stackalloc zeroing on ARM64 by enabling the same STORE_BLK optimization that already exists for X64. When the allocation size is a constant, the lowering phase now takes responsibility for clearing memory via an unrolled STORE_BLK node, allowing the backend to skip loop-based zeroing and use more efficient SIMD instructions.

Key changes:

Enables Lower's STORE_BLK optimization for constant-sized stackalloc on ARM64
Introduces clearMemory local variable to track whether backend should clear memory
Updates register allocation and code generation to skip clearing when Lower handles it

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
src/coreclr/jit/lower.cpp	Extends the constant-sized LCLHEAP optimization to TARGET_ARM64
src/coreclr/jit/lsraarm64.cpp	Updates register allocation to track when Lower handles memory clearing
src/coreclr/jit/codegenarm64.cpp	Updates code generation to skip clearing when Lower took responsibility

src/coreclr/jit/lsraarm64.cpp

src/coreclr/jit/codegenarm64.cpp

jakobbotsch · 2025-11-26T15:31:48Z

The superpmi-replay asserts look related

EgorBo · 2025-11-27T10:56:58Z

@jakobbotsch @dotnet/jit-contrib PTAL

So today if the Size is a constant and it's contained it means it's either already cleared by GT_STORE_BLK or initMem is false. It may be not contained if it's too big (GT_STORE_BLK is effectively limited with 4GB while LCLHEAP accepts size_t length) or it's unused (this can be handled by removing unused LCLHEAP in Lower, but it's a separate issue).

For all size it seems to be a clear win (for 32b and less we don't emit LCLHEAP and convert it to locals instead)

src/coreclr/jit/codegenarm64.cpp

jakobbotsch

LGTM beyond the nits

Co-authored-by: Jakob Botsch Nielsen <[email protected]>

github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Nov 26, 2025

dotnet-policy-service bot assigned EgorBo Nov 26, 2025

EgorBot mentioned this pull request Nov 26, 2025

Benchmarks for #121986 (EgorBo) EgorBot/runtime-utils#553

Open

EgorBo marked this pull request as ready for review November 26, 2025 13:01

Copilot AI review requested due to automatic review settings November 26, 2025 13:01

EgorBo added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Nov 26, 2025

Copilot started reviewing on behalf of EgorBo November 26, 2025 13:02 View session

Copilot finished reviewing on behalf of EgorBo November 26, 2025 13:03

Copilot AI reviewed Nov 26, 2025

View reviewed changes

src/coreclr/jit/lsraarm64.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/codegenarm64.cpp Outdated Show resolved Hide resolved

EgorBo force-pushed the optimize-stackalloc-zeroing-arm64 branch from c61e795 to 8746f45 Compare November 26, 2025 20:41

fix issues

006fe15

EgorBo force-pushed the optimize-stackalloc-zeroing-arm64 branch from 8746f45 to 006fe15 Compare November 27, 2025 00:11

build-analysis bot mentioned this pull request Nov 27, 2025

[android] Android.Device_Emulator.JIT.Test failing on emulators with CoreCLR #112633

Open

EgorBo requested a review from jakobbotsch November 27, 2025 10:57

jakobbotsch reviewed Nov 27, 2025

View reviewed changes

src/coreclr/jit/codegenarm64.cpp Outdated Show resolved Hide resolved

jakobbotsch reviewed Nov 27, 2025

View reviewed changes

src/coreclr/jit/codegenarm64.cpp Outdated Show resolved Hide resolved

jakobbotsch approved these changes Nov 27, 2025

View reviewed changes

EgorBo and others added 2 commits November 27, 2025 13:20

Apply suggestions from code review

5ee2091

Co-authored-by: Jakob Botsch Nielsen <[email protected]>

Update lsraarm64.cpp

984d942

EgorBo enabled auto-merge (squash) November 27, 2025 13:30

EgorBo merged commit ffb52e9 into dotnet:main Nov 27, 2025
110 of 117 checks passed

build-analysis bot mentioned this pull request Nov 27, 2025

AF: *(_UNCHECKED_OBJECTREF *)handle == NULL (HndCreateHandle called by getJitHandleForObject) #117138

Open

EgorBo deleted the optimize-stackalloc-zeroing-arm64 branch November 27, 2025 15:48

dotnet-maestro bot mentioned this pull request Nov 28, 2025

[main] Source code updates from dotnet/runtime dotnet/dotnet#3448

Merged

github-actions bot locked and limited conversation to collaborators Dec 28, 2025

EgorBo added the reduce-unsafe label Jan 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize stackalloc zeroing on arm64 via STORE_BLK #121986

Optimize stackalloc zeroing on arm64 via STORE_BLK #121986

Uh oh!

EgorBo commented Nov 26, 2025 •

edited

Loading

Uh oh!

EgorBo commented Nov 26, 2025

Uh oh!

dotnet-policy-service bot commented Nov 26, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

jakobbotsch commented Nov 26, 2025

Uh oh!

EgorBo commented Nov 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jakobbotsch left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize stackalloc zeroing on arm64 via STORE_BLK #121986

Optimize stackalloc zeroing on arm64 via STORE_BLK #121986

Uh oh!

Conversation

EgorBo commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EgorBo commented Nov 26, 2025

Uh oh!

dotnet-policy-service bot commented Nov 26, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

jakobbotsch commented Nov 26, 2025

Uh oh!

EgorBo commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jakobbotsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EgorBo commented Nov 26, 2025 •

edited

Loading

EgorBo commented Nov 27, 2025 •

edited

Loading