[Arm64] Planned JIT work in .NET 6

### Background

In .NET 5, the .NET team made a non-trivial effort to bring parity between Arm64 and X86 platforms support. As an example, we added 384 methods to System.Runtime.Intrinsics.Arm allowing our customers to use Advanced SIMD instructions on Arm64, optimized libraries code using these intrinsics, and made the Arm64 targeted performance improvements in the CodeGen. 

In .NET 6 we will continue the effort. In particular, as a part of .NET 6 planning the JIT team identified the following items as our next short-term goals:

### Conditional instructions/branch elimination
One of the examples of such code transformations can be found in LLVM that *transforms `cbz`/`cbnz`/`tbz`/`tbnz` instructions into a conditional branch (`b.cond`)*. For example, you can compare the outputs of the latest clang compiling the C++ snippet
```c++
void TransformsIntoCondBr(int& op1, int& op2) {
    if (op1 & op2) {
        op1 = op2;
    } else {
        op2 = op1;
    }
}
```
with such optimization **disabled**
[-O2 -mllvm -aarch64-enable-cond-br-tune=false](https://godbolt.org/z/qMP1qb)
```asm
TransformsIntoCondBr(int&, int&):           // @TransformsIntoCondBr(int&, int&)
        ldr     w8, [x0]
        ldr     w9, [x1]
        and     w10, w9, w8
        cbz     w10, .LBB0_2
        str     w9, [x0]
        ret
.LBB0_2:
        str     w8, [x1]
        ret
```
and with the optimization **enabled**
[-O2 -mllvm -aarch64-enable-cond-br-tune=true](https://godbolt.org/z/fjxvnn)
```asm
TransformsIntoCondBr(int&, int&):           // @TransformsIntoCondBr(int&, int&)
        ldr     w8, [x0]
        ldr     w9, [x1]
        tst     w9, w8
        b.eq    .LBB0_2
        str     w9, [x0]
        ret
.LBB0_2:
        str     w8, [x1]
        ret
```
`and w10, w9, w8; cbz w10, .LBB0_2` has been replaced with `tst w9, w8; b.eq .LBB0_2` that freed `w10` register.

The JIT team will **research** the optimization area and make decision on what optimizations can be implemented in .NET 6. 

**Some related issues:**
- [ ] https://github.com/dotnet/runtime/issues/6749 RyuJit: avoid conditional jumps using cmov and similar instructions
- [ ] https://github.com/dotnet/runtime/issues/32632 Branchless code generation for ternaries
- [ ] https://github.com/dotnet/runtime/issues/41549 RyuJIT: Optimize "X / POW2_CNS" via cmovns 
- [ ] https://github.com/dotnet/runtime/issues/43440 [RyuJIT][arm64] Optimize "x<0" and "x>=0"

Presumably, some parts of the analysis can be implemented in platform agnostic way and benefit both Arm64 and X86 platforms.

**Next steps:**
- [ ] Identify the optimizations and estimate their potential impact
- [ ] See what could be implemented in platform agnostic way and do this as a next step
- [ ] Implement Arm64 specific optimizations

### [Hardware Intrinsics on Arm64](https://github.com/dotnet/runtime/projects/21)

1. We need to address the known inefficiencies/suboptimal code generation:

- [ ] https://github.com/dotnet/runtime/issues/40489 Consider optimizing more intrinsics that have move/copy semantics as it is done in 833aaba5f38a66756e721914e889db32bfacec48 **Stretch Goal**

- [ ] https://github.com/dotnet/runtime/issues/33975 Investigate unnecessary vector register moves around helper calls. As one of the potential solutions we can implement custom helpers in assembly and guarantee calling conventions not altering upper 64bit of SIMD registers (https://github.com/dotnet/runtime/issues/33975#issuecomment-707857923) **Stretch Goal**

- [ ] https://github.com/dotnet/runtime/issues/13617 JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase **Stretch Goal**

- [ ] https://github.com/dotnet/runtime/issues/33972 Use `cmeq`, `cmge`, `cmgt` (zero) when one of the operands is `Vector64/128<T>.Zero` **Stretch Goal**

2. Implementation of new APIs is also on the table. The following are some instances of the proposed work:
- [ ] https://github.com/dotnet/runtime/issues/1277 `TableVectorLookup` and `TableVectorExtension` intrinsics (multiple register) **Moved to 7.0**
- [ ] https://github.com/dotnet/runtime/issues/39243 `LoadPairVector64` and `LoadPairVector128` **Moved to 7.0** <s>#45020</s> Superseded by #52424
- [x] https://github.com/dotnet/runtime/issues/43106 `MultiplyHigh` **Closed by** #47362
- [ ] `LoadVector64`, `LoadVector128` and `Store` with multiple registers (`ld1-4`, `st1-4`). Note that implementing these and multiple register variants of  `TableVectorLookup` and `TableVectorExtension` intrinsics would require extensive changes to LSRA. ** Moved to 7.0**
- [ ] We consider implementing Hardware Intrinsics for `ARMv8.3-CompNum`, `ARMv8.2-I8MM` and `ARMv8.2-FP16` ISAs **Stretch Goal**

### Atomic instructions

Currently, JIT emits `ARMv8.1-LSE` atomic instructions in the following cases:
* [`CodeGen::genLockedInstructions(GenTreeOp* treeNode)`]( https://github.com/dotnet/runtime/blob/6a0cc8e01374896b6908bee0d225f7f3a8024642/src/coreclr/src/jit/codegenarm64.cpp#L2793) for `Interlocked.Add` and `Interlocked.Exchange` methods
* [`CodeGen::genCodeForCmpXchg(GenTreeCmpXchg* treeNode)`](https://github.com/dotnet/runtime/blob/6a0cc8e01374896b6908bee0d225f7f3a8024642/src/coreclr/src/jit/codegenarm64.cpp#L2926) for `Interlocked.CompareExchange` method

- [x] There is a proposal to extend this functionality for other methods in `Interlocked` class https://github.com/dotnet/runtime/issues/32239 *"Consider making Interlocked.And and friends into JIT intrinsics"* As per https://github.com/dotnet/runtime/issues/32239#issuecomment-714862768 there is a potential to generated better code by using Armv8.1. **Closed by** #46253 - thank you @EgorBo!

Another potential work is to support `ARMv8.4-LSE` atomic instructions in the JIT.

### Examples of Arm64 specific JIT backlog issues

- [ ]  In .NET 5 we implemented stack probing procedure on all platforms except Arm64 (https://github.com/dotnet/coreclr/pull/26807 and https://github.com/dotnet/coreclr/pull/27184). This has solved some of the issues with stack unwinding https://github.com/dotnet/runtime/issues/11495 and allowed to implement https://github.com/dotnet/runtime/pull/32167 *"Display stack trace at stack overflow"*. In .NET 6 we should close the gap on Arm64 and address https://github.com/dotnet/runtime/issues/13519 *"[Arm64] Implement stack probing using helper"* 
 **Moved to 7.0** Blocked by #47810

- [x] We saw huge improvement in .NET 5 from Ben's work in https://github.com/dotnet/runtime/pull/32538 *"Use xmm for stack prolog zeroing rather than rep stos"*
We should consider implementing a similar idea and use SIMD registers for prolog zeroing on Arm64. We can employ the fact that AdvSimd `st1` instruction can store up to 4 128 bit SIMD registers to memory effectively allowing to write up to 64 bytes of zeroes to memory in one instruction. This work is tracked in https://github.com/dotnet/runtime/issues/43789. **Closed by** https://github.com/dotnet/runtime/pull/46609

- [ ] Use stp (SIMD) in `genCodeForInitBlkUnroll` and `genCodeForCpBlkUnroll` https://github.com/dotnet/runtime/issues/48934 **Stretch Goal** 

### Stretch goal

- [ ] Peephole optimization opportunities:
  - [ ] #35141 : Optimize redundant memory loads with mov
  - [ ] #35071 : Redundant load/stores for methods that operates/returns structs
  - [ ] #35136 : Optimize pair of "str wzr, [reg]" to "str xzr"
  - [ ] #35134 : Optimize pair of "str reg, [fp]" to stp
  - [ ] #35133 : Optimize pair of "str reg, [reg]" to stp
  - [ ] #35132 : Optimize pair of "ldr reg, [reg]" to ldp
  - [ ] #35130 : Optimize pair of "ldr reg, [fp]" to ldp
  - [ ] #35254 : Redundant movs done for zero extend the register

- As a follow up to #46253  we should measure performance impact of the 8.4 interlocked instructuctions going forward in .NET 7 and see if we can benefit from using those in .NET.

Note: For all the above peephole work items, there is a pre-requisite work-item that is needed to enable the codegen to update previously emitted instruction. There is **no** separate tracking issue for it, and one of the first optimization we do will have to do that infrastructure work first.

@dotnet/jit-contrib @TamarChristinaArm @tannergooding

category:planning
theme:planning
skill-level:expert
cost:large

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Arm64] Planned JIT work in .NET 6 #43629

Background

Conditional instructions/branch elimination

Hardware Intrinsics on Arm64

Atomic instructions

Examples of Arm64 specific JIT backlog issues

Stretch goal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Arm64] Planned JIT work in .NET 6 #43629

Description

Background

Conditional instructions/branch elimination

Hardware Intrinsics on Arm64

Atomic instructions

Examples of Arm64 specific JIT backlog issues

Stretch goal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions