-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Background
In .NET 5, the .NET team made a non-trivial effort to bring parity between Arm64 and X86 platforms support. As an example, we added 384 methods to System.Runtime.Intrinsics.Arm allowing our customers to use Advanced SIMD instructions on Arm64, optimized libraries code using these intrinsics, and made the Arm64 targeted performance improvements in the CodeGen.
In .NET 6 we will continue the effort. In particular, as a part of .NET 6 planning the JIT team identified the following items as our next short-term goals:
Conditional instructions/branch elimination
One of the examples of such code transformations can be found in LLVM that transforms cbz/cbnz/tbz/tbnz instructions into a conditional branch (b.cond). For example, you can compare the outputs of the latest clang compiling the C++ snippet
void TransformsIntoCondBr(int& op1, int& op2) {
if (op1 & op2) {
op1 = op2;
} else {
op2 = op1;
}
}with such optimization disabled
-O2 -mllvm -aarch64-enable-cond-br-tune=false
TransformsIntoCondBr(int&, int&): // @TransformsIntoCondBr(int&, int&)
ldr w8, [x0]
ldr w9, [x1]
and w10, w9, w8
cbz w10, .LBB0_2
str w9, [x0]
ret
.LBB0_2:
str w8, [x1]
retand with the optimization enabled
-O2 -mllvm -aarch64-enable-cond-br-tune=true
TransformsIntoCondBr(int&, int&): // @TransformsIntoCondBr(int&, int&)
ldr w8, [x0]
ldr w9, [x1]
tst w9, w8
b.eq .LBB0_2
str w9, [x0]
ret
.LBB0_2:
str w8, [x1]
retand w10, w9, w8; cbz w10, .LBB0_2 has been replaced with tst w9, w8; b.eq .LBB0_2 that freed w10 register.
The JIT team will research the optimization area and make decision on what optimizations can be implemented in .NET 6.
Some related issues:
- RyuJit: avoid conditional jumps using cmov and similar instructions #6749 RyuJit: avoid conditional jumps using cmov and similar instructions
- Branchless code generation for ternaries #32632 Branchless code generation for ternaries
- RyuJIT: Optimize "X / POW2_CNS" via cmovns #41549 RyuJIT: Optimize "X / POW2_CNS" via cmovns
- [RyuJIT][arm64] Optimize "x<0" and "x>=0" #43440 [RyuJIT][arm64] Optimize "x<0" and "x>=0"
Presumably, some parts of the analysis can be implemented in platform agnostic way and benefit both Arm64 and X86 platforms.
Next steps:
- Identify the optimizations and estimate their potential impact
- See what could be implemented in platform agnostic way and do this as a next step
- Implement Arm64 specific optimizations
Hardware Intrinsics on Arm64
- We need to address the known inefficiencies/suboptimal code generation:
-
Consider optimizing more intrinsics that have move/copy semantics #40489 Consider optimizing more intrinsics that have move/copy semantics as it is done in 833aaba Stretch Goal
-
Investigate unnecessary vector register moves #33975 Investigate unnecessary vector register moves around helper calls. As one of the potential solutions we can implement custom helpers in assembly and guarantee calling conventions not altering upper 64bit of SIMD registers (Investigate unnecessary vector register moves #33975 (comment)) Stretch Goal
-
JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase #13617 JIT Hardware Intrinsics low compiler throughput due to inefficient intrinsic identification algorithm during import phase Stretch Goal
-
Use cmeq, cmge, cmgt (zero) when one of the operands is Vector64/128<T>.Zero #33972 Use
cmeq,cmge,cmgt(zero) when one of the operands isVector64/128<T>.ZeroStretch Goal
- Implementation of new APIs is also on the table. The following are some instances of the proposed work:
- API Proposal : Arm TableVectorLookup and TableVectorExtension intrinsics #1277
TableVectorLookupandTableVectorExtensionintrinsics (multiple register) Moved to 7.0 - [Arm64] LoadPairVector64 and LoadPairVector128 #39243
LoadPairVector64andLoadPairVector128Moved to 7.0[Arm64] AdvSIMD LoadPairVector64 and LoadPairVector128 #45020Superseded by [Arm64] Implement LoadPairVector64 and LoadPairVector128 #52424 - [Arm64] MultiplyHigh #43106
MultiplyHighClosed by [Arm64] Implement MultiplyHigh #47362 -
LoadVector64,LoadVector128andStorewith multiple registers (ld1-4,st1-4). Note that implementing these and multiple register variants ofTableVectorLookupandTableVectorExtensionintrinsics would require extensive changes to LSRA. ** Moved to 7.0** - We consider implementing Hardware Intrinsics for
ARMv8.3-CompNum,ARMv8.2-I8MMandARMv8.2-FP16ISAs Stretch Goal
Atomic instructions
Currently, JIT emits ARMv8.1-LSE atomic instructions in the following cases:
CodeGen::genLockedInstructions(GenTreeOp* treeNode)forInterlocked.AddandInterlocked.ExchangemethodsCodeGen::genCodeForCmpXchg(GenTreeCmpXchg* treeNode)forInterlocked.CompareExchangemethod
- There is a proposal to extend this functionality for other methods in
Interlockedclass Consider making Interlocked.And and friends into JIT intrinsics #32239 "Consider making Interlocked.And and friends into JIT intrinsics" As per Consider making Interlocked.And and friends into JIT intrinsics #32239 (comment) there is a potential to generated better code by using Armv8.1. Closed by [RyuJIT] Implement Interlocked.And and Interlocked.Or for arm64-v8.1 #46253 - thank you @EgorBo!
Another potential work is to support ARMv8.4-LSE atomic instructions in the JIT.
Examples of Arm64 specific JIT backlog issues
-
In .NET 5 we implemented stack probing procedure on all platforms except Arm64 (Implement stack probing using helpers coreclr#26807 and Implement stack probing using helpers coreclr#27184). This has solved some of the issues with stack unwinding Cannot unwind stack when stack probing hits the stack limit on Unix #11495 and allowed to implement Display stack trace at stack overflow #32167 "Display stack trace at stack overflow". In .NET 6 we should close the gap on Arm64 and address [Arm64] Implement stack probing using helper #13519 "[Arm64] Implement stack probing using helper"
Moved to 7.0 Blocked by [Arm64] Extend Compiler::lvaFrameAddress() and JIT to allow using SP as base register #47810 -
We saw huge improvement in .NET 5 from Ben's work in Use xmm for stack prolog zeroing rather than rep stos #32538 "Use xmm for stack prolog zeroing rather than rep stos"
We should consider implementing a similar idea and use SIMD registers for prolog zeroing on Arm64. We can employ the fact that AdvSimdst1instruction can store up to 4 128 bit SIMD registers to memory effectively allowing to write up to 64 bytes of zeroes to memory in one instruction. This work is tracked in [Arm64] Use stp and str (SIMD) for stack prolog zeroing #43789. Closed by [Arm64] Use SIMD register to zero init frame #46609 -
Use stp (SIMD) in
genCodeForInitBlkUnrollandgenCodeForCpBlkUnroll[Arm64] Use stp (SIMD) in genCodeForInitBlkUnroll and genCodeForCpBlkUnroll #48934 Stretch Goal
Stretch goal
-
Peephole optimization opportunities:
- ARM64: Optimize redundant memory loads with mov #35141 : Optimize redundant memory loads with mov
- ARM64 Redundant load/stores for methods that operates/returns structs #35071 : Redundant load/stores for methods that operates/returns structs
- ARM64: Optimize pair of "str wzr, [reg]" to "str xzr" #35136 : Optimize pair of "str wzr, [reg]" to "str xzr"
- ARM64: Optimize pair of "str reg, [fp]" to stp #35134 : Optimize pair of "str reg, [fp]" to stp
- ARM64: Optimize pair of "str reg, [reg]" to stp #35133 : Optimize pair of "str reg, [reg]" to stp
- ARM64: Optimize pair of "ldr reg, [reg]" to ldp #35132 : Optimize pair of "ldr reg, [reg]" to ldp
- ARM64: Optimize pair of "ldr reg, [fp]" to ldp #35130 : Optimize pair of "ldr reg, [fp]" to ldp
- ARM64: Redundant movs done for zero extend the register #35254 : Redundant movs done for zero extend the register
-
As a follow up to [RyuJIT] Implement Interlocked.And and Interlocked.Or for arm64-v8.1 #46253 we should measure performance impact of the 8.4 interlocked instructuctions going forward in .NET 7 and see if we can benefit from using those in .NET.
Note: For all the above peephole work items, there is a pre-requisite work-item that is needed to enable the codegen to update previously emitted instruction. There is no separate tracking issue for it, and one of the first optimization we do will have to do that infrastructure work first.
@dotnet/jit-contrib @TamarChristinaArm @tannergooding
category:planning
theme:planning
skill-level:expert
cost:large
Metadata
Metadata
Assignees
Labels
Type
Projects
Status