Skip to content

[Arm64] Planned JIT work in .NET 6 #43629

@echesakov

Description

@echesakov

Background

In .NET 5, the .NET team made a non-trivial effort to bring parity between Arm64 and X86 platforms support. As an example, we added 384 methods to System.Runtime.Intrinsics.Arm allowing our customers to use Advanced SIMD instructions on Arm64, optimized libraries code using these intrinsics, and made the Arm64 targeted performance improvements in the CodeGen.

In .NET 6 we will continue the effort. In particular, as a part of .NET 6 planning the JIT team identified the following items as our next short-term goals:

Conditional instructions/branch elimination

One of the examples of such code transformations can be found in LLVM that transforms cbz/cbnz/tbz/tbnz instructions into a conditional branch (b.cond). For example, you can compare the outputs of the latest clang compiling the C++ snippet

void TransformsIntoCondBr(int& op1, int& op2) {
    if (op1 & op2) {
        op1 = op2;
    } else {
        op2 = op1;
    }
}

with such optimization disabled
-O2 -mllvm -aarch64-enable-cond-br-tune=false

TransformsIntoCondBr(int&, int&):           // @TransformsIntoCondBr(int&, int&)
        ldr     w8, [x0]
        ldr     w9, [x1]
        and     w10, w9, w8
        cbz     w10, .LBB0_2
        str     w9, [x0]
        ret
.LBB0_2:
        str     w8, [x1]
        ret

and with the optimization enabled
-O2 -mllvm -aarch64-enable-cond-br-tune=true

TransformsIntoCondBr(int&, int&):           // @TransformsIntoCondBr(int&, int&)
        ldr     w8, [x0]
        ldr     w9, [x1]
        tst     w9, w8
        b.eq    .LBB0_2
        str     w9, [x0]
        ret
.LBB0_2:
        str     w8, [x1]
        ret

and w10, w9, w8; cbz w10, .LBB0_2 has been replaced with tst w9, w8; b.eq .LBB0_2 that freed w10 register.

The JIT team will research the optimization area and make decision on what optimizations can be implemented in .NET 6.

Some related issues:

Presumably, some parts of the analysis can be implemented in platform agnostic way and benefit both Arm64 and X86 platforms.

Next steps:

  • Identify the optimizations and estimate their potential impact
  • See what could be implemented in platform agnostic way and do this as a next step
  • Implement Arm64 specific optimizations

Hardware Intrinsics on Arm64

  1. We need to address the known inefficiencies/suboptimal code generation:
  1. Implementation of new APIs is also on the table. The following are some instances of the proposed work:

Atomic instructions

Currently, JIT emits ARMv8.1-LSE atomic instructions in the following cases:

Another potential work is to support ARMv8.4-LSE atomic instructions in the JIT.

Examples of Arm64 specific JIT backlog issues

Stretch goal

Note: For all the above peephole work items, there is a pre-requisite work-item that is needed to enable the codegen to update previously emitted instruction. There is no separate tracking issue for it, and one of the first optimization we do will have to do that infrastructure work first.

@dotnet/jit-contrib @TamarChristinaArm @tannergooding

category:planning
theme:planning
skill-level:expert
cost:large

Metadata

Metadata

Assignees

Labels

Bottom Up WorkNot part of a theme, epic, or user storyUser StoryA single user-facing feature. Can be grouped under an epic.arch-arm64area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Type

No type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions