Skip to content

Conversation

@valadaptive
Copy link
Contributor

@valadaptive valadaptive commented Dec 1, 2025

I got somewhat nerd-sniped when looking at a Rust issue and seeing this comment about how various min/max operations are compiled on various architectures. The current minimum/maximum/minimumnum/maximumnum code is very branchy because of the signed-zero handling. Even though we emit select operations, LLVM really prefers to lower them to branches, to the point of scalarizing vector code to do so, even if blendv is supported. (Should I open a separate issue for that? It seems concerning that LLVM would rather scalarize a vector operation than emit a couple blendv operations in a row.)

It turns out that handling signed zero operands properly can be done using a couple bitwise operations, which is branchless and easily vectorizable, by taking advantage of the following properties:

  • When you take the maximum of two floats, the output sign bit will be the bitwise AND of the input sign bits (since 0 means positive, and the maximum always prefers the positive number).
  • When you take the minimum of two floats, the output sign bit will be the bitwise OR of the input sign bits (since 1 means negative, and the minimum always prefers the negative number).

We can further optimize this by taking advantage of the fact that x86's min/max instructions operate like a floating-point compare+select, returning the second operand if both are (positive or negative) zero. Altogether, the operations go as follows:

  • For taking the minimum:

    • Call minps/minpd/etc. on the input operands. This will return the minimum, unless both are zero, in which case it will return the second operand.
    • Take the bitwise AND of the first operand and the highest bit, so that everything is zero except the sign bit.
    • Finally, OR that with the minimum from earlier. The only incorrect case was when the second operand was +0.0 and the first operand was -0.0. By OR-ing the first operand's sign bit with the existing minimum, we correct this.
  • Analogously, for taking the maximum:

    • Call maxps/maxpd/etc. on the input operands. This will return the maximum, unless both are zero, in which case it will return the second operand.
    • Take the bitwise OR of the first operand and a bit pattern which is all ones except for the highest bit, so that everything is ones except the sign bit.
    • Finally, AND that with the maximum from earlier.

In the case of NaNs, this approach might change the output NaN's sign bit. We don't have to worry about this for a couple reasons: firstly, LLVM's language reference allows NaNs to have a nondeterministic sign bit; secondly, there's already a step after this that selects one of the input NaNs anyway.

Here's an Alive2 proof. It obviously can't verify that the implementation is sound, but shows that at least the theory is.

I believe this approach is faster than even properly-vectorized blendv operations because it eliminates a data dependency chain. Furthermore on AVX-512, the load, AND, and OR can become a single vpternlogd. My highly-unrepresentative microbenchmarks (compiled for x86-64-v2, so SSE4.1) say ~7.5%-10% faster than blendv, which makes me confident this is at least not a regression.

As another minor optimization (I could split it into a separate PR if you prefer, but I'm significantly reworking the code here anyway), I've changed the NaN selection for the "prefer numeric" operations from SDValue NaNSrc = IsNum ? MinMax : NewX; to SDValue NaNSrc = IsNum ? NewY : NewX;. This has identical behavior, as detailed in the comment I added, but eliminates the cmpunord instruction's data dependency on the min/max operation immediately preceding it. This optimization is included in the Alive2 proof.

One thing I didn't change here: if we're comparing scalar f16s, the target supports AVX-512, and we know neither is NaN, there's a different signed-zero-correction path that uses VFPCLASSS. I haven't analyzed how it works, but it might be worth just removing that and using the bitwise version in all cases.

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Dec 1, 2025

@llvm/pr-subscribers-backend-x86

Author: None (valadaptive)

Changes

I got somewhat nerd-sniped when looking at a Rust issue and seeing this comment about how various min/max operations are compiled on various architectures. The current minimum/maximum/minimumnum/maximumnum code is very branchy because of the signed-zero handling. Even though we emit select operations, LLVM really prefers to lower them to branches, to the point of scalarizing vector code to do so.

It turns out that handling signed zero operands properly can be done using a couple bitwise operations, which is branchless and easily vectorizable, by taking advantage of the following properties:

  • When you take the maximum of two floats, the output sign bit will be the bitwise AND of the input sign bits (since 0 means positive, and the maximum always prefers the positive number).
  • When you take the minimum of two floats, the output sign bit will be the bitwise OR of the input sign bits (since 1 means negative, and the minimum always prefers the negative number).

We can further optimize this by taking advantage of the fact that x86's min/max instructions operate like a floating-point compare+select, returning the second operand if both are (positive or negative) zero. Altogether, the operations go as follows:

  • For taking the minimum:

    • Call minps/minpd/etc. on the input operands. This will return the minimum, unless both are zero, in which case it will return the second operand.
    • Take the bitwise AND of the first operand and the highest bit, so that everything is zero except the sign bit.
    • Finally, OR that with the minimum from earlier. The only incorrect case was when the second operand was +0.0 and the first operand was -0.0. By OR-ing the first operand's sign bit with the existing minimum, we correct this.
  • Analogously, for taking the maximum:

    • Call maxps/maxpd/etc. on the input operands. This will return the maximum, unless both are zero, in which case it will return the second operand.
    • Take the bitwise OR of the first operand and a bit pattern which is all ones except for the highest bit, so that everything is ones except the sign bit.
    • Finally, AND that with the maximum from earlier.

In the case of NaNs, this approach might change the output NaN's sign bit. We don't have to worry about this for a couple reasons: firstly, LLVM's language reference allows NaNs to have a nondeterministic sign bit; secondly, there's already a step after this that selects one of the input NaNs anyway.

Here's an Alive2 proof. It obviously can't verify that the implementation is sound, but shows that at least the theory is.

I believe this approach is faster than even properly-vectorized blendv operations because it eliminates a data dependency chain. Furthermore on AVX-512, the load, AND, and OR can become a single vpternlogd. My highly-unrepresentative microbenchmarks (compiled for x86-64-v2, so SSE4.1) say ~7.5%-10% faster than blendv, which makes me confident this is at least not a regression.

As another minor optimization (I could split it into a separate PR if you prefer, but I'm significantly reworking the code here anyway), I've changed the NaN selection for the "prefer numeric" operations from SDValue NaNSrc = IsNum ? MinMax : NewX; to SDValue NaNSrc = IsNum ? NewY : NewX;. This has identical behavior, as detailed in the comment I added, but eliminates the cmpunord instruction's data dependency on the min/max operation immediately preceding it. This optimization is included in the Alive2 proof.

One thing I didn't change here: if we're comparing scalar f16s, the target supports AVX-512, and we know neither is NaN, there's a different signed-zero-correction path that uses VFPCLASSS. I haven't analyzed how it works, but it might be worth just removing that and using the bitwise version in all cases.


Patch is 293.52 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/170069.diff

6 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+81-40)
  • (modified) llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll (+16-26)
  • (modified) llvm/test/CodeGen/X86/extractelement-fp.ll (+36-96)
  • (modified) llvm/test/CodeGen/X86/fminimum-fmaximum.ll (+500-851)
  • (modified) llvm/test/CodeGen/X86/fminimumnum-fmaximumnum.ll (+619-1056)
  • (modified) llvm/test/CodeGen/X86/vector-reduce-fmaximum.ll (+877-1430)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 1b0bf6823e390..b6db3706c5d3a 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -29480,7 +29480,6 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget,
   uint64_t SizeInBits = VT.getScalarSizeInBits();
   APInt PreferredZero = APInt::getZero(SizeInBits);
   APInt OppositeZero = PreferredZero;
-  EVT IVT = VT.changeTypeToInteger();
   X86ISD::NodeType MinMaxOp;
   if (IsMaxOp) {
     MinMaxOp = X86ISD::FMAX;
@@ -29492,8 +29491,8 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget,
   EVT SetCCType =
       TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT);
 
-  // The tables below show the expected result of Max in cases of NaN and
-  // signed zeros.
+  // The tables below show the expected result of Max in cases of NaN and signed
+  // zeros.
   //
   //                 Y                       Y
   //             Num   xNaN              +0     -0
@@ -29503,12 +29502,9 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget,
   //    xNaN  |   X  |  X/Y |     -0  |  +0  |  -0  |
   //          ---------------         ---------------
   //
-  // It is achieved by means of FMAX/FMIN with preliminary checks and operand
-  // reordering.
-  //
-  // We check if any of operands is NaN and return NaN. Then we check if any of
-  // operands is zero or negative zero (for fmaximum and fminimum respectively)
-  // to ensure the correct zero is returned.
+  // It is achieved by means of FMAX/FMIN with preliminary checks, operand
+  // reordering if one operand is a constant, and bitwise operations and selects
+  // to handle signed zero and NaN operands otherwise.
   auto MatchesZero = [](SDValue Op, APInt Zero) {
     Op = peekThroughBitcasts(Op);
     if (auto *CstOp = dyn_cast<ConstantFPSDNode>(Op))
@@ -29539,15 +29535,17 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget,
                           Op->getFlags().hasNoSignedZeros() ||
                           DAG.isKnownNeverZeroFloat(X) ||
                           DAG.isKnownNeverZeroFloat(Y);
-  SDValue NewX, NewY;
+  bool ShouldHandleZeros = true;
+  SDValue NewX = X;
+  SDValue NewY = Y;
   if (IgnoreSignedZero || MatchesZero(Y, PreferredZero) ||
       MatchesZero(X, OppositeZero)) {
     // Operands are already in right order or order does not matter.
-    NewX = X;
-    NewY = Y;
+    ShouldHandleZeros = false;
   } else if (MatchesZero(X, PreferredZero) || MatchesZero(Y, OppositeZero)) {
     NewX = Y;
     NewY = X;
+    ShouldHandleZeros = false;
   } else if (!VT.isVector() && (VT == MVT::f16 || Subtarget.hasDQI()) &&
              (Op->getFlags().hasNoNaNs() || IsXNeverNaN || IsYNeverNaN)) {
     if (IsXNeverNaN)
@@ -29569,33 +29567,6 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget,
     NewX = DAG.getSelect(DL, VT, NeedSwap, Y, X);
     NewY = DAG.getSelect(DL, VT, NeedSwap, X, Y);
     return DAG.getNode(MinMaxOp, DL, VT, NewX, NewY, Op->getFlags());
-  } else {
-    SDValue IsXSigned;
-    if (Subtarget.is64Bit() || VT != MVT::f64) {
-      SDValue XInt = DAG.getNode(ISD::BITCAST, DL, IVT, X);
-      SDValue ZeroCst = DAG.getConstant(0, DL, IVT);
-      IsXSigned = DAG.getSetCC(DL, SetCCType, XInt, ZeroCst, ISD::SETLT);
-    } else {
-      assert(VT == MVT::f64);
-      SDValue Ins = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, MVT::v2f64,
-                                DAG.getConstantFP(0, DL, MVT::v2f64), X,
-                                DAG.getVectorIdxConstant(0, DL));
-      SDValue VX = DAG.getNode(ISD::BITCAST, DL, MVT::v4f32, Ins);
-      SDValue Hi = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, VX,
-                               DAG.getVectorIdxConstant(1, DL));
-      Hi = DAG.getBitcast(MVT::i32, Hi);
-      SDValue ZeroCst = DAG.getConstant(0, DL, MVT::i32);
-      EVT SetCCType = TLI.getSetCCResultType(DAG.getDataLayout(),
-                                             *DAG.getContext(), MVT::i32);
-      IsXSigned = DAG.getSetCC(DL, SetCCType, Hi, ZeroCst, ISD::SETLT);
-    }
-    if (MinMaxOp == X86ISD::FMAX) {
-      NewX = DAG.getSelect(DL, VT, IsXSigned, X, Y);
-      NewY = DAG.getSelect(DL, VT, IsXSigned, Y, X);
-    } else {
-      NewX = DAG.getSelect(DL, VT, IsXSigned, Y, X);
-      NewY = DAG.getSelect(DL, VT, IsXSigned, X, Y);
-    }
   }
 
   bool IgnoreNaN = DAG.getTarget().Options.NoNaNsFPMath ||
@@ -29612,10 +29583,80 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget,
 
   SDValue MinMax = DAG.getNode(MinMaxOp, DL, VT, NewX, NewY, Op->getFlags());
 
+  // We handle signed-zero ordering by taking the larger (or smaller) sign bit.
+  if (ShouldHandleZeros) {
+    const fltSemantics &Sem = VT.getFltSemantics();
+    unsigned EltBits = VT.getScalarSizeInBits();
+    bool IsFakeVector = !VT.isVector();
+    MVT LogicVT = VT.getSimpleVT();
+    if (IsFakeVector)
+      LogicVT = (VT == MVT::f64)   ? MVT::v2f64
+                : (VT == MVT::f32) ? MVT::v4f32
+                                   : MVT::v8f16;
+
+    // We take the sign bit from the first operand and combine it with the
+    // output sign bit (see below). Right now, if ShouldHandleZeros is true, the
+    // operands will never have been swapped. If you add another optimization
+    // that swaps the input operands if one is a known value, make sure this
+    // logic stays correct!
+    SDValue LogicX = NewX;
+    SDValue LogicMinMax = MinMax;
+    if (IsFakeVector) {
+      // Promote scalars to vectors for bitwise operations.
+      LogicX = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, LogicVT, NewX);
+      LogicMinMax = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, LogicVT, MinMax);
+    }
+
+    // x86's min/max operations return the second operand if both inputs are
+    // signed zero. For the maximum operation, we want to "and" the sign bit of
+    // the output with the sign bit of the first operand--that means that if the
+    // first operand is +0.0, the output will be too. For the minimum, it's the
+    // opposite: we "or" the output sign bit with the sign bit of the first
+    // operand, ensuring that if the first operand is -0.0, the output will be
+    // too.
+    SDValue Result;
+    if (IsMaxOp) {
+      // getSignedMaxValue returns a bit pattern of all ones but the highest
+      // bit. We "or" that with the first operand, then "and" that with the max
+      // operation's result. That clears only the sign bit, and only if the
+      // first operand is positive.
+      SDValue OrMask = DAG.getConstantFP(
+          APFloat(Sem, APInt::getSignedMaxValue(EltBits)), DL, LogicVT);
+      SDValue MaskedSignBit =
+          DAG.getNode(X86ISD::FOR, DL, LogicVT, LogicX, OrMask);
+      Result =
+          DAG.getNode(X86ISD::FAND, DL, LogicVT, MaskedSignBit, LogicMinMax);
+    } else {
+      // Likewise, getSignedMinValue returns a bit pattern with only the highest
+      // bit set. This one *sets* only the sign bit, and only if the first
+      // operand is *negative*.
+      SDValue AndMask = DAG.getConstantFP(
+          APFloat(Sem, APInt::getSignMask(EltBits)), DL, LogicVT);
+      SDValue MaskedSignBit =
+          DAG.getNode(X86ISD::FAND, DL, LogicVT, LogicX, AndMask);
+      Result =
+          DAG.getNode(X86ISD::FOR, DL, LogicVT, MaskedSignBit, LogicMinMax);
+    }
+
+    // Extract scalar back from vector.
+    if (IsFakeVector)
+      MinMax = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT, Result,
+                           DAG.getVectorIdxConstant(0, DL));
+    else
+      MinMax = Result;
+  }
+
   if (IgnoreNaN || DAG.isKnownNeverNaN(IsNum ? NewY : NewX))
     return MinMax;
 
-  SDValue NaNSrc = IsNum ? MinMax : NewX;
+  // The x86 min/max return the second operand if either is NaN, which doesn't
+  // match the numeric or non-numeric semantics. For the non-numeric versions,
+  // we want to return NaN if either operand is NaN. To do that, we check if
+  // NewX (the first operand) is NaN, and select it if so. For the numeric
+  // versions, we want to return the non-NaN operand if there is one. So we
+  // check if NewY (the second operand) is NaN, and again select the first
+  // operand if so.
+  SDValue NaNSrc = IsNum ? NewY : NewX;
   SDValue IsNaN = DAG.getSetCC(DL, SetCCType, NaNSrc, NaNSrc, ISD::SETUO);
 
   return DAG.getSelect(DL, VT, IsNaN, NewX, MinMax);
diff --git a/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll b/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll
index 53ac283170a5f..59cf38f82b7c0 100644
--- a/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll
+++ b/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll
@@ -13,14 +13,9 @@ declare <32 x half> @llvm.maximum.v32f16(<32 x half>, <32 x half>)
 define half @test_fminimum(half %x, half %y) {
 ; CHECK-LABEL: test_fminimum:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vmovw %xmm0, %eax
-; CHECK-NEXT:    testw %ax, %ax
-; CHECK-NEXT:    sets %al
-; CHECK-NEXT:    kmovd %eax, %k1
-; CHECK-NEXT:    vmovaps %xmm1, %xmm2
-; CHECK-NEXT:    vmovsh %xmm0, %xmm0, %xmm2 {%k1}
-; CHECK-NEXT:    vmovsh %xmm1, %xmm0, %xmm0 {%k1}
-; CHECK-NEXT:    vminsh %xmm2, %xmm0, %xmm1
+; CHECK-NEXT:    vminsh %xmm1, %xmm0, %xmm2
+; CHECK-NEXT:    vpbroadcastw {{.*#+}} xmm1 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; CHECK-NEXT:    vpternlogq {{.*#+}} xmm1 = (xmm1 & xmm0) | xmm2
 ; CHECK-NEXT:    vcmpunordsh %xmm0, %xmm0, %k1
 ; CHECK-NEXT:    vmovsh %xmm0, %xmm0, %xmm1 {%k1}
 ; CHECK-NEXT:    vmovaps %xmm1, %xmm0
@@ -92,16 +87,12 @@ define half @test_fminimum_combine_cmps(half %x, half %y) {
 define half @test_fmaximum(half %x, half %y) {
 ; CHECK-LABEL: test_fmaximum:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vmovw %xmm0, %eax
-; CHECK-NEXT:    testw %ax, %ax
-; CHECK-NEXT:    sets %al
-; CHECK-NEXT:    kmovd %eax, %k1
-; CHECK-NEXT:    vmovaps %xmm0, %xmm2
-; CHECK-NEXT:    vmovsh %xmm1, %xmm0, %xmm2 {%k1}
+; CHECK-NEXT:    vmaxsh %xmm1, %xmm0, %xmm2
+; CHECK-NEXT:    vpbroadcastw {{.*#+}} xmm1 = [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]
+; CHECK-NEXT:    vpternlogq {{.*#+}} xmm1 = xmm2 & (xmm1 | xmm0)
+; CHECK-NEXT:    vcmpunordsh %xmm0, %xmm0, %k1
 ; CHECK-NEXT:    vmovsh %xmm0, %xmm0, %xmm1 {%k1}
-; CHECK-NEXT:    vmaxsh %xmm2, %xmm1, %xmm0
-; CHECK-NEXT:    vcmpunordsh %xmm1, %xmm1, %k1
-; CHECK-NEXT:    vmovsh %xmm1, %xmm0, %xmm0 {%k1}
+; CHECK-NEXT:    vmovaps %xmm1, %xmm0
 ; CHECK-NEXT:    retq
   %r = call half @llvm.maximum.f16(half %x, half %y)
   ret half %r
@@ -196,10 +187,9 @@ define <16 x half> @test_fmaximum_v16f16_nans(<16 x half> %x, <16 x half> %y) "n
 define <32 x half> @test_fminimum_v32f16_szero(<32 x half> %x, <32 x half> %y) "no-nans-fp-math"="true" {
 ; CHECK-LABEL: test_fminimum_v32f16_szero:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vpmovw2m %zmm0, %k1
-; CHECK-NEXT:    vpblendmw %zmm0, %zmm1, %zmm2 {%k1}
-; CHECK-NEXT:    vmovdqu16 %zmm1, %zmm0 {%k1}
-; CHECK-NEXT:    vminph %zmm2, %zmm0, %zmm0
+; CHECK-NEXT:    vminph %zmm1, %zmm0, %zmm1
+; CHECK-NEXT:    vpbroadcastw {{.*#+}} zmm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; CHECK-NEXT:    vpternlogq {{.*#+}} zmm0 = (zmm0 & zmm2) | zmm1
 ; CHECK-NEXT:    retq
   %r = call <32 x half> @llvm.minimum.v32f16(<32 x half> %x, <32 x half> %y)
   ret <32 x half> %r
@@ -208,12 +198,12 @@ define <32 x half> @test_fminimum_v32f16_szero(<32 x half> %x, <32 x half> %y) "
 define <32 x half> @test_fmaximum_v32f16_nans_szero(<32 x half> %x, <32 x half> %y) {
 ; CHECK-LABEL: test_fmaximum_v32f16_nans_szero:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vpmovw2m %zmm0, %k1
-; CHECK-NEXT:    vpblendmw %zmm1, %zmm0, %zmm2 {%k1}
+; CHECK-NEXT:    vmaxph %zmm1, %zmm0, %zmm2
+; CHECK-NEXT:    vpbroadcastw {{.*#+}} zmm1 = [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]
+; CHECK-NEXT:    vpternlogq {{.*#+}} zmm1 = zmm2 & (zmm1 | zmm0)
+; CHECK-NEXT:    vcmpunordph %zmm0, %zmm0, %k1
 ; CHECK-NEXT:    vmovdqu16 %zmm0, %zmm1 {%k1}
-; CHECK-NEXT:    vmaxph %zmm2, %zmm1, %zmm0
-; CHECK-NEXT:    vcmpunordph %zmm1, %zmm1, %k1
-; CHECK-NEXT:    vmovdqu16 %zmm1, %zmm0 {%k1}
+; CHECK-NEXT:    vmovdqa64 %zmm1, %zmm0
 ; CHECK-NEXT:    retq
   %r = call <32 x half> @llvm.maximum.v32f16(<32 x half> %x, <32 x half> %y)
   ret <32 x half> %r
diff --git a/llvm/test/CodeGen/X86/extractelement-fp.ll b/llvm/test/CodeGen/X86/extractelement-fp.ll
index 1706f17eac165..9ce1fd6f20976 100644
--- a/llvm/test/CodeGen/X86/extractelement-fp.ll
+++ b/llvm/test/CodeGen/X86/extractelement-fp.ll
@@ -675,37 +675,23 @@ define double @fminnum_v4f64(<4 x double> %x, <4 x double> %y) nounwind {
 define float @fmaximum_v4f32(<4 x float> %x, <4 x float> %y) nounwind {
 ; X64-LABEL: fmaximum_v4f32:
 ; X64:       # %bb.0:
-; X64-NEXT:    vmovd %xmm0, %eax
-; X64-NEXT:    testl %eax, %eax
-; X64-NEXT:    js .LBB30_1
-; X64-NEXT:  # %bb.2:
-; X64-NEXT:    vmovdqa %xmm0, %xmm2
-; X64-NEXT:    jmp .LBB30_3
-; X64-NEXT:  .LBB30_1:
-; X64-NEXT:    vmovdqa %xmm1, %xmm2
-; X64-NEXT:    vmovdqa %xmm0, %xmm1
-; X64-NEXT:  .LBB30_3:
-; X64-NEXT:    vmaxss %xmm2, %xmm1, %xmm0
-; X64-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
-; X64-NEXT:    vblendvps %xmm2, %xmm1, %xmm0, %xmm0
+; X64-NEXT:    vbroadcastss {{.*#+}} xmm2 = [NaN,NaN,NaN,NaN]
+; X64-NEXT:    vorps %xmm2, %xmm0, %xmm2
+; X64-NEXT:    vmaxss %xmm1, %xmm0, %xmm1
+; X64-NEXT:    vandps %xmm1, %xmm2, %xmm1
+; X64-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
+; X64-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
 ; X64-NEXT:    retq
 ;
 ; X86-LABEL: fmaximum_v4f32:
 ; X86:       # %bb.0:
-; X86-NEXT:    vmovd %xmm0, %eax
-; X86-NEXT:    testl %eax, %eax
-; X86-NEXT:    js .LBB30_1
-; X86-NEXT:  # %bb.2:
-; X86-NEXT:    vmovdqa %xmm0, %xmm2
-; X86-NEXT:    jmp .LBB30_3
-; X86-NEXT:  .LBB30_1:
-; X86-NEXT:    vmovdqa %xmm1, %xmm2
-; X86-NEXT:    vmovdqa %xmm0, %xmm1
-; X86-NEXT:  .LBB30_3:
 ; X86-NEXT:    pushl %eax
-; X86-NEXT:    vmaxss %xmm2, %xmm1, %xmm0
-; X86-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
-; X86-NEXT:    vblendvps %xmm2, %xmm1, %xmm0, %xmm0
+; X86-NEXT:    vbroadcastss {{.*#+}} xmm2 = [NaN,NaN,NaN,NaN]
+; X86-NEXT:    vorps %xmm2, %xmm0, %xmm2
+; X86-NEXT:    vmaxss %xmm1, %xmm0, %xmm1
+; X86-NEXT:    vandps %xmm1, %xmm2, %xmm1
+; X86-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
+; X86-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
 ; X86-NEXT:    vmovss %xmm0, (%esp)
 ; X86-NEXT:    flds (%esp)
 ; X86-NEXT:    popl %eax
@@ -718,41 +704,25 @@ define float @fmaximum_v4f32(<4 x float> %x, <4 x float> %y) nounwind {
 define double @fmaximum_v4f64(<4 x double> %x, <4 x double> %y) nounwind {
 ; X64-LABEL: fmaximum_v4f64:
 ; X64:       # %bb.0:
-; X64-NEXT:    vmovq %xmm0, %rax
-; X64-NEXT:    testq %rax, %rax
-; X64-NEXT:    js .LBB31_1
-; X64-NEXT:  # %bb.2:
-; X64-NEXT:    vmovdqa %xmm0, %xmm2
-; X64-NEXT:    jmp .LBB31_3
-; X64-NEXT:  .LBB31_1:
-; X64-NEXT:    vmovdqa %xmm1, %xmm2
-; X64-NEXT:    vmovdqa %xmm0, %xmm1
-; X64-NEXT:  .LBB31_3:
-; X64-NEXT:    vmaxsd %xmm2, %xmm1, %xmm0
-; X64-NEXT:    vcmpunordsd %xmm1, %xmm1, %xmm2
-; X64-NEXT:    vblendvpd %xmm2, %xmm1, %xmm0, %xmm0
+; X64-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; X64-NEXT:    vorpd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm2
+; X64-NEXT:    vandpd %xmm1, %xmm2, %xmm1
+; X64-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm2
+; X64-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm0
 ; X64-NEXT:    vzeroupper
 ; X64-NEXT:    retq
 ;
 ; X86-LABEL: fmaximum_v4f64:
 ; X86:       # %bb.0:
-; X86-NEXT:    vextractps $1, %xmm0, %eax
-; X86-NEXT:    testl %eax, %eax
-; X86-NEXT:    js .LBB31_1
-; X86-NEXT:  # %bb.2:
-; X86-NEXT:    vmovapd %xmm0, %xmm2
-; X86-NEXT:    jmp .LBB31_3
-; X86-NEXT:  .LBB31_1:
-; X86-NEXT:    vmovapd %xmm1, %xmm2
-; X86-NEXT:    vmovapd %xmm0, %xmm1
-; X86-NEXT:  .LBB31_3:
 ; X86-NEXT:    pushl %ebp
 ; X86-NEXT:    movl %esp, %ebp
 ; X86-NEXT:    andl $-8, %esp
 ; X86-NEXT:    subl $8, %esp
-; X86-NEXT:    vmaxsd %xmm2, %xmm1, %xmm0
-; X86-NEXT:    vcmpunordsd %xmm1, %xmm1, %xmm2
-; X86-NEXT:    vblendvpd %xmm2, %xmm1, %xmm0, %xmm0
+; X86-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; X86-NEXT:    vorpd {{\.?LCPI[0-9]+_[0-9]+}}, %xmm0, %xmm2
+; X86-NEXT:    vandpd %xmm1, %xmm2, %xmm1
+; X86-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm2
+; X86-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm0
 ; X86-NEXT:    vmovlpd %xmm0, (%esp)
 ; X86-NEXT:    fldl (%esp)
 ; X86-NEXT:    movl %ebp, %esp
@@ -767,35 +737,21 @@ define double @fmaximum_v4f64(<4 x double> %x, <4 x double> %y) nounwind {
 define float @fminimum_v4f32(<4 x float> %x, <4 x float> %y) nounwind {
 ; X64-LABEL: fminimum_v4f32:
 ; X64:       # %bb.0:
-; X64-NEXT:    vmovd %xmm0, %eax
-; X64-NEXT:    testl %eax, %eax
-; X64-NEXT:    js .LBB32_1
-; X64-NEXT:  # %bb.2:
-; X64-NEXT:    vmovdqa %xmm1, %xmm2
-; X64-NEXT:    jmp .LBB32_3
-; X64-NEXT:  .LBB32_1:
-; X64-NEXT:    vmovdqa %xmm0, %xmm2
-; X64-NEXT:    vmovdqa %xmm1, %xmm0
-; X64-NEXT:  .LBB32_3:
-; X64-NEXT:    vminss %xmm2, %xmm0, %xmm1
+; X64-NEXT:    vbroadcastss {{.*#+}} xmm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; X64-NEXT:    vandps %xmm2, %xmm0, %xmm2
+; X64-NEXT:    vminss %xmm1, %xmm0, %xmm1
+; X64-NEXT:    vorps %xmm1, %xmm2, %xmm1
 ; X64-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
 ; X64-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
 ; X64-NEXT:    retq
 ;
 ; X86-LABEL: fminimum_v4f32:
 ; X86:       # %bb.0:
-; X86-NEXT:    vmovd %xmm0, %eax
-; X86-NEXT:    testl %eax, %eax
-; X86-NEXT:    js .LBB32_1
-; X86-NEXT:  # %bb.2:
-; X86-NEXT:    vmovdqa %xmm1, %xmm2
-; X86-NEXT:    jmp .LBB32_3
-; X86-NEXT:  .LBB32_1:
-; X86-NEXT:    vmovdqa %xmm0, %xmm2
-; X86-NEXT:    vmovdqa %xmm1, %xmm0
-; X86-NEXT:  .LBB32_3:
 ; X86-NEXT:    pushl %eax
-; X86-NEXT:    vminss %xmm2, %xmm0, %xmm1
+; X86-NEXT:    vbroadcastss {{.*#+}} xmm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; X86-NEXT:    vandps %xmm2, %xmm0, %xmm2
+; X86-NEXT:    vminss %xmm1, %xmm0, %xmm1
+; X86-NEXT:    vorps %xmm1, %xmm2, %xmm1
 ; X86-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
 ; X86-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
 ; X86-NEXT:    vmovss %xmm0, (%esp)
@@ -810,17 +766,9 @@ define float @fminimum_v4f32(<4 x float> %x, <4 x float> %y) nounwind {
 define double @fminimum_v4f64(<4 x double> %x, <4 x double> %y) nounwind {
 ; X64-LABEL: fminimum_v4f64:
 ; X64:       # %bb.0:
-; X64-NEXT:    vmovq %xmm0, %rax
-; X64-NEXT:    testq %rax, %rax
-; X64-NEXT:    js .LBB33_1
-; X64-NEXT:  # %bb.2:
-; X64-NEXT:    vmovdqa %xmm1, %xmm2
-; X64-NEXT:    jmp .LBB33_3
-; X64-NEXT:  .LBB33_1:
-; X64-NEXT:    vmovdqa %xmm0, %xmm2
-; X64-NEXT:    vmovdqa %xmm1, %xmm0
-; X64-NEXT:  .LBB33_3:
-; X64-NEXT:    vminsd %xmm2, %xmm0, %xmm1
+; X64-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; X64-NEXT:    vandpd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm2
+; X64-NEXT:    vorpd %xmm1, %xmm2, %xmm1
 ; X64-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm2
 ; X64-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm0
 ; X64-NEXT:    vzeroupper
@@ -828,21 +776,13 @@ define double @fminimum_v4f64(<4 x double> %x, <4 x double> %y) nounwind {
 ;
 ; X86-LABEL: fminimum_v4f64:
 ; X86:       # %bb.0:
-; X86-NEXT:    vextractps $1, %xmm0, %eax
-; X86-NEXT:    testl %eax, %eax
-; X86-NEXT:    js .LBB33_1
-; X86-NEXT:  # %bb.2:
-; X86-NEXT:    vmovapd %xmm1, %xmm2
-; X86-NEXT:    jmp .LBB33_3
-; X86-NEXT:  .LBB33_1:
-; X86-NEXT:    vmovapd %xmm0, %xmm2
-; X86-NEXT:    vmovapd %xmm1, %xmm0
-; X86-NEXT:  .LBB33_3:
 ; X86-NEXT:    pushl %ebp
 ; X86-NEXT:    movl %esp, %ebp
 ; X86-NEXT:    andl $-8, %esp
 ; X86-NEXT:    subl $8, %esp
-; X86-NEXT:    vminsd %xmm2, %xmm0, %xmm1
+; X86-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; X86-NEXT:    vandpd {{\.?LCPI[0-9]+_[0-9]+}}, %xmm0, %xmm2
+; X86-NEXT:    vorpd %xmm1, %xmm2, %xmm1
 ; X86-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm2
 ; X86-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm0
 ; X86-NEXT:    vmovlpd %xmm0, (%esp)
diff --git a/llvm/test/CodeGen/X86/fminimum-fmaximum.ll b/llvm/test/CodeGen/X86/fminimum-fmaximum.ll
index 06515e4f82687..910acd6d82ae2 100644
--- a/llvm/test/CodeGen/X86/fminimum-fmaximum.ll
+++ b/llvm/test/C...
[truncated]

; SSE2-NEXT: maxss %xmm1, %xmm0
; SSE2-NEXT: andnps %xmm0, %xmm2
; SSE2-NEXT: orps %xmm3, %xmm2
; SSE2-NEXT: movaps %xmm2, %xmm0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more instruction here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a result of changing the NaN handling to check the source operand for NaN in minimumnum and maximumnum instead of the result operand. There's an extra movaps because we can't reuse a register; I'm not a microarchitecture expert, but I suspect it's "free" due to register renaming.

; X86-NEXT: vmaxss {{[0-9]+}}(%esp), %xmm0, %xmm1
; X86-NEXT: vcmpunordss %xmm1, %xmm1, %xmm2
; X86-NEXT: vblendvps %xmm2, %xmm0, %xmm1, %xmm0
; X86-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the same answer as above applies here.

@@ -1782,4 +1228,5 @@ declare double @llvm.vector.reduce.fmaximum.v4f64(<4 x double>)
declare double @llvm.vector.reduce.fmaximum.v8f64(<8 x double>)
declare double @llvm.vector.reduce.fmaximum.v16f64(<16 x double>)
;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
; AVX512: {{.*}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the comment above says, this list of prefixes is autogenerated. Is there some incantation I can use to have it not generate this list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, remove it from the RUN lines.

Copy link
Contributor Author

@valadaptive valadaptive Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some other issues with the RUN lines; the SSE prefix is also unused and there's a warning about conflicting output for the two AVX lines (I think the +avx2 one was meant to use the AVX2 prefix.) Should I fix those in a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be great.

Copy link
Contributor

@phoebewang phoebewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ActuallyaDeviloper
Copy link

ActuallyaDeviloper commented Dec 9, 2025

I am also concerned about the memory operation. Is this really the best way to implement these?

Some ideas:

  • AVX-512 has VRANGEPS which despite it's name computes "minimumnum"/"maximumnum" like IEEE 2008.
  • I think minimum can be implemented like minimum(x,y) = minps(x,y)|minps(y,x). If one of x or y is NaN, then one of the two will be NaN and so will be the result. If both are not NaN or zero, then they agree and hence the "or" is the "minps". Finally, if both are zero, then minps(x,y)|minps(y,x)=y|x which prefers the signed zero.
  • The other variants can be generated by a similar construction for maximumnum and NaN fixup I think.

@valadaptive
Copy link
Contributor Author

  • AVX-512 has VRANGEPS which despite it's name computes "minimumnum"/"maximumnum" like IEEE 2008.

VRANGEPS does handle signed zeroes properly, but IEEE754-2008 semantics are not what we want here. If one argument is a signaling NaN, VRANGEPS and the IEEE754-2008 operations both return a quiet NaN, but we want to return the non-NaN operand no matter what. On AArch64, where the FMINNM and FMAXNM instructions also implement IEEE754-2008, I believe we insert some "canonicalize" operations on the inputs to quiet any NaNs. We could do something similar here, but that's best left for a future PR.

  • I think minimum can be implemented like minimum(x,y) = minps(x,y)|minps(y,x). If one of x or y is NaN, then one of the two will be NaN and so will be the result. If both are not NaN or zero, then they agree and hence the "or" is the "minps". Finally, if both are zero, then minps(x,y)|minps(y,x)=y|x which prefers the signed zero.

Indeed, this is what Cranelift does. I'd have to check how it performs against the bitwise approach. They insert a bunch of extra operations to ensure the result is a quiet NaN, which LLVM doesn't care about.

@ActuallyaDeviloper
Copy link

ActuallyaDeviloper commented Dec 9, 2025

VRANGEPS does handle signed zeroes properly, but IEEE754-2008 semantics are not what we want here. If one argument is a signaling NaN, VRANGEPS and the IEEE754-2008 operations both return a quiet NaN, but we want to return the non-NaN operand no matter what. With AArch64, where the FMINNM and FMAXNM instructions also implement IEEE754-2008, I believe we insert some "canonicalize" operations with the inputs to quiet any NaNs. We could do something similar here, but that's best left for a future PR.

I agree, I just have also added some comments at the associated NaN discussion. I am not sure if a consus was reached in that regard although I personally not against that approach.

AVX 512 can be pushed to a future update. Also it would be really nice if we could eliminate the "canonicalize operations" if we can proof that we don't need them.

Indeed, this is what Cranelift does. I'd have to check how it performs against the bitwise approach. They insert a bunch of extra operations to ensure the result is a quiet NaN, which LLVM doesn't care about.

Ah so the idea is not novel. Based on the instruction timings it should be quite competitive especially if there is that memory load otherwise.

@valadaptive
Copy link
Contributor Author

Thinking about this a bit more, I think doing two min/max operations and combining them is a lot more specialized and tricky than what I've implemented here, and can be deferred to the future. The operations are asymmetric, and you have to make sure you follow LLVM's NaN semantics.

  • minps(x, y) | minps(y, x) for a non-numeric minimum is asymmetric with regards to the maximum. Getting the non-numeric maximum requires more operations, as seen in Cranelift.

  • minps(x, y) | minps(y, x) also violates LLVM's NaN payload semantics (emphasis mine):

    Floating-point math operations that return a NaN are an exception from the general principle that LLVM implements IEEE-754 semantics. Unless specified otherwise, the following rules apply whenever the IEEE-754 semantics say that a NaN value is returned: the result has a non-deterministic sign; the quiet bit and payload are non-deterministically chosen from the following set of options:

    • The quiet bit is set and the payload is all-zero. (“Preferred NaN” case)

    • The quiet bit is set and the payload is copied from any input operand that is a NaN. (“Quieting NaN propagation” case)

    • The quiet bit and payload are copied from any input operand that is a NaN. (“Unchanged NaN propagation” case)

    • The quiet bit is set and the payload is picked from a target-specific set of “extra” possible NaN payloads. The set can depend on the input operand values. This set is empty on x86 and ARM, but can be non-empty on other architectures. (For instance, on wasm, if any input NaN does not have the preferred all-zero payload or any input NaN is an SNaN, then this set contains all possible payloads; otherwise, it is empty. On SPARC, this set consists of the all-one payload.)

    The bitwise "or" copies the mantissa bits from the non-NaN operand, which LLVM explicitly does not allow. You can fix this with an extra fixup step afterwards like Cranelift does, but the speed tradeoff becomes even more murky.

  • Extending those operations to prefer the non-NaN operand seems tricky. Not only do you now have to check against two source operands (if you do only one minps/maxps, you only have to check if one source operand is NaN), but whatever bitwise tricks you perform must not create signaling NaN values. For LLVM's semantics, it's fine to pass through a signaling NaN input, but you may not create a signaling NaN if none of the input operands were signaling NaNs.

I think loading the sign masks from memory can be avoided. LLVM already "materializes" an all-ones value using something like pcmpeqd xmm0, xmm0, and you can extend that to "materialize" any mask value (a run of ones followed by zeroes, or vice versa) by performing a left or right shift afterwards. That's a different part of the codebase, however, and it'd have much broader performance implications.

@ActuallyaDeviloper
Copy link

Thinking about this a bit more, I think doing two min/max operations and combining them is a lot more specialized and tricky than what I've implemented here, and can be deferred to the future. The operations are asymmetric, and you have to make sure you follow LLVM's NaN semantics.

Alright, thanks for looking into it. The NaN guarantee is unfortunate.

I hope we can improve the code generation further at later time.

@valadaptive
Copy link
Contributor Author

Ping

@phoebewang
Copy link
Contributor

Ping

Do you want me to merge it or wait someone's approval?

@valadaptive
Copy link
Contributor Author

I'll see if I can get another review on this. Probably should wait until after Christmas now that I think about it.

bool IsFakeVector = !VT.isVector();
MVT LogicVT = VT.getSimpleVT();
if (IsFakeVector)
LogicVT = (VT == MVT::f64) ? MVT::v2f64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we assert that VT is f64/f32/f16 here? Is there anyway that fp80/bf16 code can arrive here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to double-check; apparently bf16 code can call this lowering function, but I added an assert before here and it's not hit.

Just to make sure, I added tests for the scalar bf16 versions of minimum/maximum/minimumnum/maximumnum, and added CHECK lines for AVX512-BF16.

declare float @llvm.minimumnum.f32(float, float)
declare double @llvm.minimumnum.f64(double, double)
declare bfloat @llvm.minimumnum.bf16(bfloat, bfloat)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never know what to do with these. I know declarations aren't required in regression tests anymore, but it feels wrong to leave them out for only new functions, or to remove them all entirely.

@github-actions
Copy link

github-actions bot commented Jan 1, 2026

✅ With the latest revision this PR passed the C/C++ code formatter.

@github-actions
Copy link

github-actions bot commented Jan 1, 2026

🪟 Windows x64 Test Results

  • 129244 tests passed
  • 2852 tests skipped

✅ The build succeeded and all tests passed.

@github-actions
Copy link

github-actions bot commented Jan 1, 2026

🐧 Linux x64 Test Results

  • 188237 tests passed
  • 4994 tests skipped

✅ The build succeeded and all tests passed.

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix CI fminimumnum-fmaximumnum.ll failures

@valadaptive
Copy link
Contributor Author

Regenerated the tests; I think #140193 changed the codegen there.

@RKSimon RKSimon self-requested a review January 8, 2026 10:27
@arsenm arsenm added the floating-point Floating-point math label Jan 8, 2026
Comment on lines 29785 to 29786
SDValue OrMask = DAG.getConstantFP(
APFloat(Sem, APInt::getSignedMaxValue(EltBits)), DL, LogicVT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is getConstantFP(APFloat::getInf()) with extra steps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I don't understand what getInf() does, it's different. Infinity is an all-ones exponent with an all-zeroes mantissa. This mask value is all ones in the exponent and mantissa, that is, everywhere but the sign bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use getNan() with an all 1 payload

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we have getAllOnesValue, could maybe add another partner with a cleared sign bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied what LowerFCOPYSIGN does already. I think ~APInt::getSignMask(EltSizeInBits) would be clearest if I were to change it, but I'm not sure it's worth diverging from what LowerFCOPYSIGN does. Should I just change them both?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess leave it as-is. One of my long term annoyances is the lack of useful APFloat constructors, so we end up doing complex calculations to synthesize trivial values (e.g., some places do string conversions just to get a simple constant)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to do this as a bitcasted integer value? I hate the way that the APFloat(Sem, APInt) constructor doesn't make it obvious if its bitcasting or converting the APInt....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could cover nearly all uses with a signbit mask float, and a getPowerOfTwo(Log2) constructor

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arsenm ping?

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I don't know x86

Comment on lines 29785 to 29786
SDValue OrMask = DAG.getConstantFP(
APFloat(Sem, APInt::getSignedMaxValue(EltBits)), DL, LogicVT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess leave it as-is. One of my long term annoyances is the lack of useful APFloat constructors, so we end up doing complex calculations to synthesize trivial values (e.g., some places do string conversions just to get a simple constant)

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - cheers

@RKSimon RKSimon enabled auto-merge (squash) January 14, 2026 18:35
@RKSimon
Copy link
Collaborator

RKSimon commented Jan 14, 2026

Thanks @valadaptive - if you're still interested in looking at tweaking the NaN handling please raise another patch

@RKSimon RKSimon merged commit 7d5d4db into llvm:main Jan 14, 2026
8 of 10 checks passed
@github-actions
Copy link

@valadaptive Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

Priyanshu3820 pushed a commit to Priyanshu3820/llvm-project that referenced this pull request Jan 18, 2026
…se operations (llvm#170069)

I got somewhat nerd-sniped when looking at a Rust issue and seeing [this
comment about how various min/max operations are compiled on various
architectures](rust-lang/rust#91079 (comment)).
The current `minimum`/`maximum`/`minimumnum`/`maximumnum` code is very
branchy because of the signed-zero handling. Even though we emit select
operations, LLVM *really* prefers to lower them to branches, to the
point of scalarizing vector code to do so, even if `blendv` is
supported. (Should I open a separate issue for that? It seems concerning
that LLVM would rather scalarize a vector operation than emit a couple
`blendv` operations in a row.)

It turns out that handling signed zero operands properly can be done
using a couple bitwise operations, which is branchless and easily
vectorizable, by taking advantage of the following properties:
- When you take the maximum of two floats, the output sign bit will be
the bitwise AND of the input sign bits (since 0 means positive, and the
maximum always prefers the positive number).
- When you take the minimum of two floats, the output sign bit will be
the bitwise OR of the input sign bits (since 1 means negative, and the
minimum always prefers the negative number).

We can further optimize this by taking advantage of the fact that x86's
min/max instructions operate like a floating-point compare+select,
returning the second operand if both are (positive or negative) zero.
Altogether, the operations go as follows:

- For taking the minimum:
- Call `minps`/`minpd`/etc. on the input operands. This will return the
minimum, unless both are zero, in which case it will return the second
operand.
- Take the bitwise AND of the first operand and the highest bit, so that
everything is zero except the sign bit.
- Finally, OR that with the minimum from earlier. The only incorrect
case was when the second operand was +0.0 and the first operand was
-0.0. By OR-ing the first operand's sign bit with the existing minimum,
we correct this.

- Analogously, for taking the maximum:
- Call `maxps`/`maxpd`/etc. on the input operands. This will return the
maximum, unless both are zero, in which case it will return the second
operand.
- Take the bitwise OR of the first operand and a bit pattern which is
all ones except for the highest bit, so that everything is ones except
the sign bit.
  - Finally, AND that with the maximum from earlier.

In the case of NaNs, this approach might change the output NaN's sign
bit. We don't have to worry about this for a couple reasons: firstly,
LLVM's language reference [allows NaNs to have a nondeterministic sign
bit](https://llvm.org/docs/LangRef.html#floatnan); secondly, there's
already a step after this that selects one of the input NaNs anyway.

[Here's an Alive2 proof.](https://alive2.llvm.org/ce/z/EfQZ-G) It
obviously can't verify that the implementation is sound, but shows that
at least the theory is.

I believe this approach is faster than even properly-vectorized `blendv`
operations because it eliminates a data dependency chain. Furthermore on
AVX-512, the load, AND, and OR can become a single `vpternlogd`. My
highly-unrepresentative microbenchmarks (compiled for x86-64-v2, so
SSE4.1) say ~7.5%-10% faster than `blendv`, which makes me confident
this is at least not a regression.
BStott6 pushed a commit to BStott6/llvm-project that referenced this pull request Jan 22, 2026
…se operations (llvm#170069)

I got somewhat nerd-sniped when looking at a Rust issue and seeing [this
comment about how various min/max operations are compiled on various
architectures](rust-lang/rust#91079 (comment)).
The current `minimum`/`maximum`/`minimumnum`/`maximumnum` code is very
branchy because of the signed-zero handling. Even though we emit select
operations, LLVM *really* prefers to lower them to branches, to the
point of scalarizing vector code to do so, even if `blendv` is
supported. (Should I open a separate issue for that? It seems concerning
that LLVM would rather scalarize a vector operation than emit a couple
`blendv` operations in a row.)

It turns out that handling signed zero operands properly can be done
using a couple bitwise operations, which is branchless and easily
vectorizable, by taking advantage of the following properties:
- When you take the maximum of two floats, the output sign bit will be
the bitwise AND of the input sign bits (since 0 means positive, and the
maximum always prefers the positive number).
- When you take the minimum of two floats, the output sign bit will be
the bitwise OR of the input sign bits (since 1 means negative, and the
minimum always prefers the negative number).

We can further optimize this by taking advantage of the fact that x86's
min/max instructions operate like a floating-point compare+select,
returning the second operand if both are (positive or negative) zero.
Altogether, the operations go as follows:

- For taking the minimum:
- Call `minps`/`minpd`/etc. on the input operands. This will return the
minimum, unless both are zero, in which case it will return the second
operand.
- Take the bitwise AND of the first operand and the highest bit, so that
everything is zero except the sign bit.
- Finally, OR that with the minimum from earlier. The only incorrect
case was when the second operand was +0.0 and the first operand was
-0.0. By OR-ing the first operand's sign bit with the existing minimum,
we correct this.

- Analogously, for taking the maximum:
- Call `maxps`/`maxpd`/etc. on the input operands. This will return the
maximum, unless both are zero, in which case it will return the second
operand.
- Take the bitwise OR of the first operand and a bit pattern which is
all ones except for the highest bit, so that everything is ones except
the sign bit.
  - Finally, AND that with the maximum from earlier.

In the case of NaNs, this approach might change the output NaN's sign
bit. We don't have to worry about this for a couple reasons: firstly,
LLVM's language reference [allows NaNs to have a nondeterministic sign
bit](https://llvm.org/docs/LangRef.html#floatnan); secondly, there's
already a step after this that selects one of the input NaNs anyway.

[Here's an Alive2 proof.](https://alive2.llvm.org/ce/z/EfQZ-G) It
obviously can't verify that the implementation is sound, but shows that
at least the theory is.

I believe this approach is faster than even properly-vectorized `blendv`
operations because it eliminates a data dependency chain. Furthermore on
AVX-512, the load, AND, and OR can become a single `vpternlogd`. My
highly-unrepresentative microbenchmarks (compiled for x86-64-v2, so
SSE4.1) say ~7.5%-10% faster than `blendv`, which makes me confident
this is at least not a regression.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants