Skip to content

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Nov 26, 2025

Enable X64's optimization where we clear LCLHEAP via STORE_BLK inserted in Lower on arm64.

static void Test128() => Consume(stackalloc char[128]);

was:

            stp     xzr, xzr, [sp, #-0x10]!
            stp     xzr, xzr, [sp, #-0xF0]!
            stp     xzr, xzr, [sp, #0x10]
            stp     xzr, xzr, [sp, #0x20]
            stp     xzr, xzr, [sp, #0x30]
            stp     xzr, xzr, [sp, #0x40]
            stp     xzr, xzr, [sp, #0x50]
            stp     xzr, xzr, [sp, #0x60]
            stp     xzr, xzr, [sp, #0x70]
            stp     xzr, xzr, [sp, #0x80]
            stp     xzr, xzr, [sp, #0x90]
            stp     xzr, xzr, [sp, #0xA0]
            stp     xzr, xzr, [sp, #0xB0]
            stp     xzr, xzr, [sp, #0xC0]
            stp     xzr, xzr, [sp, #0xD0]
            stp     xzr, xzr, [sp, #0xE0]

now:

            movi    v16.16b, #0
            stp     q16, q16, [x0]
            stp     q16, q16, [x0, #0x20]
            stp     q16, q16, [x0, #0x40]
            stp     q16, q16, [x0, #0x60]
            stp     q16, q16, [x0, #0x80]
            stp     q16, q16, [x0, #0xA0]
            stp     q16, q16, [x0, #0xC0]
            stp     q16, q16, [x0, #0xE0]

Also, for larger sizes the previous logic used to emit a slow loop (e.g. 1024 bytes):

            mov     w0, #0x400
G_M30953_IG03:
            stp     xzr, xzr, [sp, #-0x10]!
            subs    x0, x0, #16
            bne     G_M30953_IG03

Now it will emit a call to CORINFO_HELP_MEMZERO

Benchmarks.

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;

public class Benchmarks
{
    [Benchmark] public void Stackalloc64() => Consume(stackalloc byte[64]);
    [Benchmark] public void Stackalloc128() => Consume(stackalloc byte[128]);
    [Benchmark] public void Stackalloc256() => Consume(stackalloc byte[256]);
    [Benchmark] public void Stackalloc512() => Consume(stackalloc byte[512]);
    [Benchmark] public void Stackalloc1024() => Consume(stackalloc byte[1024]);
    [Benchmark] public void Stackalloc16384() => Consume(stackalloc byte[16384]);

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Consume(Span<byte> x){}
}
Method Toolchain Mean Error Ratio
Stackalloc64 Main 3.425 ns 0.0004 ns 1.00
Stackalloc64 PR 2.559 ns 0.0008 ns 0.75
Stackalloc128 Main 3.999 ns 0.0002 ns 1.00
Stackalloc128 PR 2.404 ns 0.0003 ns 0.60
Stackalloc256 Main 5.431 ns 0.0005 ns 1.00
Stackalloc256 PR 2.754 ns 0.0003 ns 0.51
Stackalloc512 Main 12.661 ns 0.2744 ns 1.00
Stackalloc512 PR 7.423 ns 0.0008 ns 0.59
Stackalloc1024 Main 24.958 ns 0.5326 ns 1.00
Stackalloc1024 PR 14.031 ns 0.0040 ns 0.56
Stackalloc16384 Main 374.899 ns 0.0130 ns 1.00
Stackalloc16384 PR 111.029 ns 1.2123 ns 0.30

@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Nov 26, 2025
@EgorBo
Copy link
Member Author

EgorBo commented Nov 26, 2025

@EgorBot -arm

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;

public class Benchmarks
{
    [Benchmark] public void Stackalloc64() => Consume(stackalloc byte[64]);
    [Benchmark] public void Stackalloc128() => Consume(stackalloc byte[128]);
    [Benchmark] public void Stackalloc256() => Consume(stackalloc byte[256]);
    [Benchmark] public void Stackalloc512() => Consume(stackalloc byte[512]);
    [Benchmark] public void Stackalloc1024() => Consume(stackalloc byte[1024]);
    [Benchmark] public void Stackalloc16384() => Consume(stackalloc byte[16384]);

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Consume(Span<byte> x){}
}

@EgorBo EgorBo marked this pull request as ready for review November 26, 2025 13:01
Copilot AI review requested due to automatic review settings November 26, 2025 13:01
@EgorBo EgorBo added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Nov 26, 2025
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes stackalloc zeroing on ARM64 by enabling the same STORE_BLK optimization that already exists for X64. When the allocation size is a constant, the lowering phase now takes responsibility for clearing memory via an unrolled STORE_BLK node, allowing the backend to skip loop-based zeroing and use more efficient SIMD instructions.

Key changes:

  • Enables Lower's STORE_BLK optimization for constant-sized stackalloc on ARM64
  • Introduces clearMemory local variable to track whether backend should clear memory
  • Updates register allocation and code generation to skip clearing when Lower handles it

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/coreclr/jit/lower.cpp Extends the constant-sized LCLHEAP optimization to TARGET_ARM64
src/coreclr/jit/lsraarm64.cpp Updates register allocation to track when Lower handles memory clearing
src/coreclr/jit/codegenarm64.cpp Updates code generation to skip clearing when Lower took responsibility

@jakobbotsch
Copy link
Member

The superpmi-replay asserts look related

@EgorBo EgorBo force-pushed the optimize-stackalloc-zeroing-arm64 branch from c61e795 to 8746f45 Compare November 26, 2025 20:41
@EgorBo
Copy link
Member Author

EgorBo commented Nov 27, 2025

@jakobbotsch @dotnet/jit-contrib PTAL

So today if the Size is a constant and it's contained it means it's either already cleared by GT_STORE_BLK or initMem is false. It may be not contained if it's too big (GT_STORE_BLK is effectively limited with 4GB while LCLHEAP accepts size_t length) or it's unused (this can be handled by removing unused LCLHEAP in Lower, but it's a separate issue).

For all size it seems to be a clear win (for 32b and less we don't emit LCLHEAP and convert it to locals instead)

@EgorBo EgorBo requested a review from jakobbotsch November 27, 2025 10:57
Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM beyond the nits

@EgorBo EgorBo enabled auto-merge (squash) November 27, 2025 13:30
@EgorBo EgorBo merged commit ffb52e9 into dotnet:main Nov 27, 2025
110 of 117 checks passed
@EgorBo EgorBo deleted the optimize-stackalloc-zeroing-arm64 branch November 27, 2025 15:48
@github-actions github-actions bot locked and limited conversation to collaborators Dec 28, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI reduce-unsafe

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants