Skip to content

Conversation

@khushal1996
Copy link
Member

@khushal1996 khushal1996 commented Feb 13, 2025

Overview

This PR tracks optimizing x64 min/max floating point using the new saturating instructions introduced in AVX10.2. We are following the spec doc to add the new instructions and optimize the x64/x86 conversions.

Addresses #109081

Testing


Step 1: Run superpmi.exe on library mch files using JITLateDisasm to check if any errors occur. Use JITLateDisasm to check for a valid decoding of the byte stream through LLVM disasmbler

For this step, a new coredistools was used built from the LLVM repo. After running superpmi with JITLateDisasm, no decoding failures were detected. Please contact for getting access to the superpmi logs.


Step 2: Run superpmi and check for asmdiffs and assert errors.

Below is the summary of superpmi run

Top file improvements (bytes):
         -62 : 9311.dasm (-12.47% of base)

1 total files with Code Size differences (1 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -62 (-12.47% of base) : 9311.dasm - System.Threading.ProcessorIdCache:ProcessorNumberSpeedCheck():ubyte (FullOpts)

Top method improvements (percentages):
         -62 (-12.47% of base) : 9311.dasm - System.Threading.ProcessorIdCache:ProcessorNumberSpeedCheck():ubyte (FullOpts)

1 total methods with Code Size differences (1 improved, 0 regressed).

diff
@@ -7,8 +7,8 @@
 ; partially interruptible
 ; Final local variable assignments
 ;
-;  V00 loc0         [V00,T18] (  4,  9.50)  double  ->  mm7        
-;  V01 loc1         [V01,T19] (  4,  9.50)  double  ->  mm6        
+;  V00 loc0         [V00,T13] (  4,  9.50)  double  ->  mm7        
+;  V01 loc1         [V01,T14] (  4,  9.50)  double  ->  mm6        
 ;  V02 loc2         [V02,T04] (  3, 64.25)    long  ->  rbx        
 ;* V03 loc3         [V03,T12] (  0,  0   )     int  ->  zero-ref   
 ;  V04 loc4         [V04,T00] ( 10,264   )    long  ->  rbp        
@@ -16,23 +16,18 @@
 ;* V06 loc6         [V06,T05] (  0,  0   )     int  ->  zero-ref   
 ;* V07 loc7         [V07,T06] (  0,  0   )     int  ->  zero-ref   
 ;  V08 OutArgs      [V08    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-;  V09 tmp1         [V09,T13] (  3, 24   )  simd16  ->  mm0         "Cloning op2 for Math.Max/Min"
-;  V10 tmp2         [V10,T14] (  3, 24   )  simd16  ->  mm7         "Cloning op1 for Math.Max/Min"
-;  V11 tmp3         [V11,T15] (  3, 24   )  simd16  ->  mm0         "Cloning op2 for Math.Max/Min"
-;  V12 tmp4         [V12,T16] (  3, 24   )  simd16  ->  mm6         "Cloning op1 for Math.Max/Min"
-;  V13 tmp5         [V13,T11] (  2,  1   )     int  ->  rax         "Inline return value spill temp"
-;  V14 tmp6         [V14,T08] (  3,  3   )     int  ->  rax         "Inlining Arg"
-;  V15 cse0         [V15,T17] (  5, 16.25)  simd16  ->  mm8         hoist "CSE #02: aggressive"
-;  V16 cse1         [V16,T20] (  3,  3   )  double  ->  mm6         "CSE #01: aggressive"
-;  V17 rat0         [V17,T07] (  4, 12.25)     int  ->  rsi         "Trip count IV"
-;  V18 rat1         [V18,T02] (  4,196   )     int  ->  r14         "Trip count IV"
-;  V19 rat2         [V19,T03] (  4,196   )     int  ->  r14         "Trip count IV"
-;  V20 rat3         [V20,T09] (  3,  1.50)    long  ->  rbx         "fgMakeTemp is creating a new local variable"
-;  V21 rat4         [V21,T10] (  3,  1.50)    long  ->  rdx         "ReplaceWithLclVar is creating a new local variable"
-;  V22 rat5         [V22,T21] (  3,  3   )  double  ->  mm0         "ReplaceWithLclVar is creating a new local variable"
-;  V23 rat6         [V23,T22] (  3,  3   )  simd16  ->  mm0         "ReplaceWithLclVar is creating a new local variable"
+;  V09 tmp1         [V09,T11] (  2,  1   )     int  ->  rax         "Inline return value spill temp"
+;  V10 tmp2         [V10,T08] (  3,  3   )     int  ->  rax         "Inlining Arg"
+;  V11 cse0         [V11,T15] (  3,  3   )  double  ->  mm6         "CSE #01: aggressive"
+;  V12 rat0         [V12,T07] (  4, 12.25)     int  ->  rsi         "Trip count IV"
+;  V13 rat1         [V13,T02] (  4,196   )     int  ->  r14         "Trip count IV"
+;  V14 rat2         [V14,T03] (  4,196   )     int  ->  r14         "Trip count IV"
+;  V15 rat3         [V15,T09] (  3,  1.50)    long  ->  rbx         "fgMakeTemp is creating a new local variable"
+;  V16 rat4         [V16,T10] (  3,  1.50)    long  ->  rdx         "ReplaceWithLclVar is creating a new local variable"
+;  V17 rat5         [V17,T16] (  3,  3   )  double  ->  mm0         "ReplaceWithLclVar is creating a new local variable"
+;  V18 rat6         [V18,T17] (  3,  3   )  simd16  ->  mm0         "ReplaceWithLclVar is creating a new local variable"
 ;
-; Lcl frame size = 80
+; Lcl frame size = 64
 
 G_M1452_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
        push     r14
@@ -40,11 +35,10 @@ G_M1452_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref,
        push     rsi
        push     rbp
        push     rbx
-       sub      rsp, 80
-       vmovaps  xmmword ptr [rsp+0x40], xmm6
-       vmovaps  xmmword ptr [rsp+0x30], xmm7
-       vmovaps  xmmword ptr [rsp+0x20], xmm8
-						;; size=28 bbWeight=1 PerfScore 11.25
+       sub      rsp, 64
+       vmovaps  xmmword ptr [rsp+0x30], xmm6
+       vmovaps  xmmword ptr [rsp+0x20], xmm7
+						;; size=22 bbWeight=1 PerfScore 9.25
 G_M1452_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        vmovsd   xmm6, qword ptr [reloc @RWD00]
        vmovaps  xmm7, xmm6
@@ -68,9 +62,8 @@ G_M1452_IG04:        ; bbWeight=0.25, gcrefRegs=0000 {}, byrefRegs=0000 {}, byre
        shr      rax, 63
        sar      rdx, 18
        lea      rbx, [rdx+rax+0x01]
-       vmovups  xmm8, xmmword ptr [reloc @RWD16]
        mov      esi, 10
-						;; size=45 bbWeight=0.25 PerfScore 3.00
+						;; size=37 bbWeight=0.25 PerfScore 2.25
 G_M1452_IG05:        ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        mov      edi, 8
 						;; size=5 bbWeight=4 PerfScore 1.00
@@ -105,17 +98,14 @@ G_M1452_IG10:        ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        vxorps   xmm1, xmm1, xmm1
        vcvtsi2sd xmm1, xmm1, edi
        vdivsd   xmm0, xmm0, xmm1
-       vrangesd xmm1, xmm7, xmm0, 4
-       vfixupimmsd xmm7, xmm0, xmm8, 0
-       vfixupimmsd xmm1, xmm7, xmm8, 0
-       vmovaps  xmm7, xmm1
+       vminmaxsd xmm7, xmm7, xmm0, 4
        mov      eax, edi
        sar      eax, 31
        and      eax, 3
        add      eax, edi
        mov      edi, eax
        sar      edi, 2
-						;; size=61 bbWeight=4 PerfScore 143.67
+						;; size=43 bbWeight=4 PerfScore 118.67
 G_M1452_IG11:        ; bbWeight=32, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
        add      edi, edi
        call     System.Diagnostics.Stopwatch:QueryPerformanceCounter():long
@@ -147,21 +137,18 @@ G_M1452_IG15:        ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        vxorps   xmm1, xmm1, xmm1
        vcvtsi2sd xmm1, xmm1, edi
        vdivsd   xmm0, xmm0, xmm1
-       vrangesd xmm1, xmm6, xmm0, 4
-       vfixupimmsd xmm6, xmm0, xmm8, 0
-       vfixupimmsd xmm1, xmm6, xmm8, 0
-       vmovaps  xmm6, xmm1
+       vminmaxsd xmm6, xmm6, xmm0, 4
        dec      esi
        jne      G_M1452_IG05
-						;; size=54 bbWeight=4 PerfScore 140.67
+						;; size=36 bbWeight=4 PerfScore 115.67
 G_M1452_IG16:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
-       vmulsd   xmm0, xmm7, qword ptr [reloc @RWD32]
+       vmulsd   xmm0, xmm7, qword ptr [reloc @RWD08]
        vdivsd   xmm0, xmm0, xmm6
-       vfixupimmsd xmm0, xmm0, qword ptr [reloc @RWD48], 0
-       vcmppd   k1, xmm0, xmmword ptr [reloc @RWD64], 13
+       vfixupimmsd xmm0, xmm0, qword ptr [reloc @RWD16], 0
+       vcmppd   k1, xmm0, xmmword ptr [reloc @RWD32], 13
        vcvttsd2si eax, xmm0
        vpbroadcastd xmm0, eax
-       vpblendmd xmm0 {k1}, xmm0, dword ptr [reloc @RWD80] {1to4}
+       vpblendmd xmm0 {k1}, xmm0, dword ptr [reloc @RWD48] {1to4}
        vmovd    eax, xmm0
        mov      ecx, 0x1388
        cmp      eax, 0x1388
@@ -172,49 +159,44 @@ G_M1452_IG16:        ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byre
        movzx    rax, al
 						;; size=90 bbWeight=0.50 PerfScore 21.00
 G_M1452_IG17:        ; bbWeight=0.50, epilog, nogc, extend
-       vmovaps  xmm6, xmmword ptr [rsp+0x40]
-       vmovaps  xmm7, xmmword ptr [rsp+0x30]
-       vmovaps  xmm8, xmmword ptr [rsp+0x20]
-       add      rsp, 80
+       vmovaps  xmm6, xmmword ptr [rsp+0x30]
+       vmovaps  xmm7, xmmword ptr [rsp+0x20]
+       add      rsp, 64
        pop      rbx
        pop      rbp
        pop      rsi
        pop      rdi
        pop      r14
        ret      
-						;; size=29 bbWeight=0.50 PerfScore 7.88
+						;; size=23 bbWeight=0.50 PerfScore 5.88
 G_M1452_IG18:        ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
        mov      dword ptr [(reloc)], 0xFFFF      ; static handle
        xor      eax, eax
 						;; size=12 bbWeight=0.50 PerfScore 0.62
 G_M1452_IG19:        ; bbWeight=0.50, epilog, nogc, extend
-       vmovaps  xmm6, xmmword ptr [rsp+0x40]
-       vmovaps  xmm7, xmmword ptr [rsp+0x30]
-       vmovaps  xmm8, xmmword ptr [rsp+0x20]
-       add      rsp, 80
+       vmovaps  xmm6, xmmword ptr [rsp+0x30]
+       vmovaps  xmm7, xmmword ptr [rsp+0x20]
+       add      rsp, 64
        pop      rbx
        pop      rbp
        pop      rsi
        pop      rdi
        pop      r14
        ret      
-						;; size=29 bbWeight=0.50 PerfScore 7.88
+						;; size=23 bbWeight=0.50 PerfScore 5.88
 G_M1452_IG20:        ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
        call     CORINFO_HELP_READYTORUN_NONGCSTATIC_BASE
        ; gcr arg pop 0
        jmp      G_M1452_IG04
 						;; size=10 bbWeight=0 PerfScore 0.00
 RWD00  	dq	7FEFFFFFFFFFFFFFh	; 1.79769313e+308
-RWD08  	dd	00000000h, 00000000h
-RWD16  	dq	0000000000000001h, 0000000000000000h
-RWD32  	dq	4014000000000000h	;            5
-RWD40  	dd	00000000h, 00000000h
-RWD48  	dq	0000000000000088h, 0000000000000000h
-RWD64  	dq	41DFFFFFFFC00000h, 41DFFFFFFFC00000h
-RWD80  	dd	7FFFFFFFh
+RWD08  	dq	4014000000000000h	;            5
+RWD16  	dq	0000000000000088h, 0000000000000000h
+RWD32  	dq	41DFFFFFFFC00000h, 41DFFFFFFFC00000h
+RWD48  	dd	7FFFFFFFh

Since these diffs are expected, we can conclude that the superpmi run is successful


Step 3: Run the JIT test suite using a stable subset of tests on SDE

Results
image

Optimized ASM


Note: Below is a case by case basis of comparison between asm generated for Avx512 vs Avx10.2. The Avx10v2 asm has been collected in sde.

Case: Math.Min

** Test code**

public class Program
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static float MinScalar(float val1, float val2)
    {
        return Math.Min(val1, val2);
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(MinScalar(1.2f, 3.5f));
    }
}

Left Side is base (main, AVX512F) vs Right Side is diff (this PR, AVX10.2)
image

Case: Vector128.Min

** Test code**

public class Program
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static Vector128<float> Min128(Vector128<float> val1, Vector128<float> val2)
    {
        return Vector128.Min(val1, val2);
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(Min128(val1, val2));
    }
}

Left Side is base (main, AVX512F) vs Right Side is diff (this PR, AVX10.2)
image

Case: Math.Max

** Test code**

public class Program
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static float MaxScalar(float val1, float val2)
    {
        return Math.Max(val1, val2);
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(MaxScalar(1.2f, 3.5f));
    }
}

Left Side is base (main, AVX512F) vs Right Side is diff (this PR, AVX10.2)
image

Case: Vector512.Max

** Test code**

public class Program
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static Vector512<float> Max512(Vector512<float> val1, Vector512<float> val2)
    {
        return Vector512.Max(val1, val2);
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(Max512(val3, val4));
    }
}

Left Side is base (main, AVX512F) vs Right Side is diff (this PR, AVX10.2)
image

Case: Math.MinMagnitude

** Test code**

public class Program
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static double MinMagnitudeScalar(double val1, double val2)
    {
        return Math.MinMagnitude(val1, val2);
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(MinMagnitudeScalar(1.2, -3.5));
    }
}

Left Side is base (main, AVX512F) vs Right Side is diff (this PR, AVX10.2)
image

Case: MathF.MinMagnitude

** Test code**

public class Program
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static float MinMagnitudeScalarFloat(float val1, float val2)
    {
        return MathF.MinMagnitude(val1, val2);
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(MinMagnitudeScalarFloat(1.2f, -3.5f));
    }
}

Left Side is base (main, AVX512F) vs Right Side is diff (this PR, AVX10.2)
image

Case: Math.MaxMagnitude

** Test code**

public class Program
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static double MaxMagnitudeScalar(double val1, double val2)
    {
        return Math.MaxMagnitude(val1, val2);
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(MaxMagnitudeScalar(1.2, -3.5));
    }
}

Left Side is base (main, AVX512F) vs Right Side is diff (this PR, AVX10.2)
image

Case: MathF.MaxMagnitude

** Test code**

public class Program
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    static float MaxMagnitudeScalarFloat(float val1, float val2)
    {
        return MathF.MaxMagnitude(val1, val2);
    }

    public static void Main(string[] args)
    {
        Console.WriteLine(MaxMagnitudeScalarFloat(1.2f, -3.5f));
    }
}

Left Side is base (main, AVX512F) vs Right Side is diff (this PR, AVX10.2)
image

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 13, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 13, 2025
@BruceForstall
Copy link
Contributor

For some reason, I can't see any of the linked images in the PR description.

@khushal1996
Copy link
Member Author

For some reason, I can't see any of the linked images in the PR description.

Can you refresh and try again? Sometimes github does this. Whenever this happens to me, I just refresh the page and I am able to open the images.

@BruceForstall
Copy link
Contributor

Can you refresh and try again? Sometimes github does this. Whenever this happens to me, I just refresh the page and I am able to open the images.

Doesn't help. For example, if I click on the first "image" link, under section 3, it is: https://github.com/user-attachments/assets/993fa944-06fd-4590-8fea-7abd5a8f3888 and I get a 404 "Page not found" error. I think this has been true for all of your PRs in the past, as well.

I notice that your profile (https://github.com/khushal1996) doesn't show you as a member of the ".NET Platform" organization, like, for example, Deepak (https://github.com/DeepakRajendrakumaran) and Anthony (https://github.com/anthonycanino). Maybe there's a permissions issue due to that.

@En3Tho
Copy link
Contributor

En3Tho commented Feb 14, 2025

It's the same for me (image link above does lead to 404)

@khushal1996
Copy link
Member Author

@BruceForstall @En3Tho
My bad. I think it is due to the images being uploaded from the Intel's internal github repo. The review is first opened internally and hence github privileges are set for internal. I have updated the images and you should be able to review the PR now. I will rectify this on other open PRs as well.

@BruceForstall
Copy link
Contributor

@BruceForstall @En3Tho My bad. I think it is due to the images being uploaded from the Intel's internal github repo. The review is first opened internally and hence github privileges are set for internal. I have updated the images and you should be able to review the PR now. I will rectify this on other open PRs as well.

Thanks; I can see them now.

@khushal1996 khushal1996 marked this pull request as ready for review February 19, 2025 19:28
@khushal1996
Copy link
Member Author

@tannergooding can you help review this PR? This PR uses the AVX10.2 instructions for min/max computations in JIT.

@tannergooding
Copy link
Member

CC. @dotnet/jit-contrib for secondary review

@khushal1996
Copy link
Member Author

@tannergooding @BruceForstall Can you help move this review forward? Looks like this has been stuck in approved state since a week.

@tannergooding
Copy link
Member

CC. @EgorBo, this is ready for secondary review

@EgorBo EgorBo merged commit 4fb4020 into dotnet:main Mar 4, 2025
112 checks passed
@khushal1996 khushal1996 deleted the kcm-avx102-opt2-public-pr branch March 4, 2025 01:48
@khushal1996
Copy link
Member Author

Thanks @EgorBo

@github-actions github-actions bot locked and limited conversation to collaborators Apr 3, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants