[X86] Lower `minimum`/`maximum`/`minimumnum`/`maximumnum` using bitwise operations #170069

valadaptive · 2025-12-01T03:39:34Z

I got somewhat nerd-sniped when looking at a Rust issue and seeing this comment about how various min/max operations are compiled on various architectures. The current minimum/maximum/minimumnum/maximumnum code is very branchy because of the signed-zero handling. Even though we emit select operations, LLVM really prefers to lower them to branches, to the point of scalarizing vector code to do so, even if blendv is supported. (Should I open a separate issue for that? It seems concerning that LLVM would rather scalarize a vector operation than emit a couple blendv operations in a row.)

It turns out that handling signed zero operands properly can be done using a couple bitwise operations, which is branchless and easily vectorizable, by taking advantage of the following properties:

When you take the maximum of two floats, the output sign bit will be the bitwise AND of the input sign bits (since 0 means positive, and the maximum always prefers the positive number).
When you take the minimum of two floats, the output sign bit will be the bitwise OR of the input sign bits (since 1 means negative, and the minimum always prefers the negative number).

We can further optimize this by taking advantage of the fact that x86's min/max instructions operate like a floating-point compare+select, returning the second operand if both are (positive or negative) zero. Altogether, the operations go as follows:

For taking the minimum:
- Call minps/minpd/etc. on the input operands. This will return the minimum, unless both are zero, in which case it will return the second operand.
- Take the bitwise AND of the first operand and the highest bit, so that everything is zero except the sign bit.
- Finally, OR that with the minimum from earlier. The only incorrect case was when the second operand was +0.0 and the first operand was -0.0. By OR-ing the first operand's sign bit with the existing minimum, we correct this.
Analogously, for taking the maximum:
- Call maxps/maxpd/etc. on the input operands. This will return the maximum, unless both are zero, in which case it will return the second operand.
- Take the bitwise OR of the first operand and a bit pattern which is all ones except for the highest bit, so that everything is ones except the sign bit.
- Finally, AND that with the maximum from earlier.

In the case of NaNs, this approach might change the output NaN's sign bit. We don't have to worry about this for a couple reasons: firstly, LLVM's language reference allows NaNs to have a nondeterministic sign bit; secondly, there's already a step after this that selects one of the input NaNs anyway.

Here's an Alive2 proof. It obviously can't verify that the implementation is sound, but shows that at least the theory is.

I believe this approach is faster than even properly-vectorized blendv operations because it eliminates a data dependency chain. Furthermore on AVX-512, the load, AND, and OR can become a single vpternlogd. My highly-unrepresentative microbenchmarks (compiled for x86-64-v2, so SSE4.1) say ~7.5%-10% faster than blendv, which makes me confident this is at least not a regression.

As another minor optimization (I could split it into a separate PR if you prefer, but I'm significantly reworking the code here anyway), I've changed the NaN selection for the "prefer numeric" operations from SDValue NaNSrc = IsNum ? MinMax : NewX; to SDValue NaNSrc = IsNum ? NewY : NewX;. This has identical behavior, as detailed in the comment I added, but eliminates the cmpunord instruction's data dependency on the min/max operation immediately preceding it. This optimization is included in the Alive2 proof.

One thing I didn't change here: if we're comparing scalar f16s, the target supports AVX-512, and we know neither is NaN, there's a different signed-zero-correction path that uses VFPCLASSS. I haven't analyzed how it works, but it might be worth just removing that and using the bitwise version in all cases.

github-actions · 2025-12-01T03:39:51Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-12-01T03:40:23Z

@llvm/pr-subscribers-backend-x86

Author: None (valadaptive)

Changes

I got somewhat nerd-sniped when looking at a Rust issue and seeing this comment about how various min/max operations are compiled on various architectures. The current minimum/maximum/minimumnum/maximumnum code is very branchy because of the signed-zero handling. Even though we emit select operations, LLVM really prefers to lower them to branches, to the point of scalarizing vector code to do so.

It turns out that handling signed zero operands properly can be done using a couple bitwise operations, which is branchless and easily vectorizable, by taking advantage of the following properties:

When you take the maximum of two floats, the output sign bit will be the bitwise AND of the input sign bits (since 0 means positive, and the maximum always prefers the positive number).
When you take the minimum of two floats, the output sign bit will be the bitwise OR of the input sign bits (since 1 means negative, and the minimum always prefers the negative number).

We can further optimize this by taking advantage of the fact that x86's min/max instructions operate like a floating-point compare+select, returning the second operand if both are (positive or negative) zero. Altogether, the operations go as follows:

For taking the minimum:
- Call minps/minpd/etc. on the input operands. This will return the minimum, unless both are zero, in which case it will return the second operand.
- Take the bitwise AND of the first operand and the highest bit, so that everything is zero except the sign bit.
- Finally, OR that with the minimum from earlier. The only incorrect case was when the second operand was +0.0 and the first operand was -0.0. By OR-ing the first operand's sign bit with the existing minimum, we correct this.
Analogously, for taking the maximum:
- Call maxps/maxpd/etc. on the input operands. This will return the maximum, unless both are zero, in which case it will return the second operand.
- Take the bitwise OR of the first operand and a bit pattern which is all ones except for the highest bit, so that everything is ones except the sign bit.
- Finally, AND that with the maximum from earlier.

In the case of NaNs, this approach might change the output NaN's sign bit. We don't have to worry about this for a couple reasons: firstly, LLVM's language reference allows NaNs to have a nondeterministic sign bit; secondly, there's already a step after this that selects one of the input NaNs anyway.

Here's an Alive2 proof. It obviously can't verify that the implementation is sound, but shows that at least the theory is.

I believe this approach is faster than even properly-vectorized blendv operations because it eliminates a data dependency chain. Furthermore on AVX-512, the load, AND, and OR can become a single vpternlogd. My highly-unrepresentative microbenchmarks (compiled for x86-64-v2, so SSE4.1) say ~7.5%-10% faster than blendv, which makes me confident this is at least not a regression.

As another minor optimization (I could split it into a separate PR if you prefer, but I'm significantly reworking the code here anyway), I've changed the NaN selection for the "prefer numeric" operations from SDValue NaNSrc = IsNum ? MinMax : NewX; to SDValue NaNSrc = IsNum ? NewY : NewX;. This has identical behavior, as detailed in the comment I added, but eliminates the cmpunord instruction's data dependency on the min/max operation immediately preceding it. This optimization is included in the Alive2 proof.

One thing I didn't change here: if we're comparing scalar f16s, the target supports AVX-512, and we know neither is NaN, there's a different signed-zero-correction path that uses VFPCLASSS. I haven't analyzed how it works, but it might be worth just removing that and using the bitwise version in all cases.

Patch is 293.52 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/170069.diff

6 Files Affected:

(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+81-40)
(modified) llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll (+16-26)
(modified) llvm/test/CodeGen/X86/extractelement-fp.ll (+36-96)
(modified) llvm/test/CodeGen/X86/fminimum-fmaximum.ll (+500-851)
(modified) llvm/test/CodeGen/X86/fminimumnum-fmaximumnum.ll (+619-1056)
(modified) llvm/test/CodeGen/X86/vector-reduce-fmaximum.ll (+877-1430)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 1b0bf6823e390..b6db3706c5d3a 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -29480,7 +29480,6 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget,
   uint64_t SizeInBits = VT.getScalarSizeInBits();
   APInt PreferredZero = APInt::getZero(SizeInBits);
   APInt OppositeZero = PreferredZero;
-  EVT IVT = VT.changeTypeToInteger();
   X86ISD::NodeType MinMaxOp;
   if (IsMaxOp) {
     MinMaxOp = X86ISD::FMAX;
@@ -29492,8 +29491,8 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget,
   EVT SetCCType =
       TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT);
 
-  // The tables below show the expected result of Max in cases of NaN and
-  // signed zeros.
+  // The tables below show the expected result of Max in cases of NaN and signed
+  // zeros.
   //
   //                 Y                       Y
   //             Num   xNaN              +0     -0
@@ -29503,12 +29502,9 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget,
   //    xNaN  |   X  |  X/Y |     -0  |  +0  |  -0  |
   //          ---------------         ---------------
   //
-  // It is achieved by means of FMAX/FMIN with preliminary checks and operand
-  // reordering.
-  //
-  // We check if any of operands is NaN and return NaN. Then we check if any of
-  // operands is zero or negative zero (for fmaximum and fminimum respectively)
-  // to ensure the correct zero is returned.
+  // It is achieved by means of FMAX/FMIN with preliminary checks, operand
+  // reordering if one operand is a constant, and bitwise operations and selects
+  // to handle signed zero and NaN operands otherwise.
   auto MatchesZero = [](SDValue Op, APInt Zero) {
     Op = peekThroughBitcasts(Op);
     if (auto *CstOp = dyn_cast<ConstantFPSDNode>(Op))
@@ -29539,15 +29535,17 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget,
                           Op->getFlags().hasNoSignedZeros() ||
                           DAG.isKnownNeverZeroFloat(X) ||
                           DAG.isKnownNeverZeroFloat(Y);
-  SDValue NewX, NewY;
+  bool ShouldHandleZeros = true;
+  SDValue NewX = X;
+  SDValue NewY = Y;
   if (IgnoreSignedZero || MatchesZero(Y, PreferredZero) ||
       MatchesZero(X, OppositeZero)) {
     // Operands are already in right order or order does not matter.
-    NewX = X;
-    NewY = Y;
+    ShouldHandleZeros = false;
   } else if (MatchesZero(X, PreferredZero) || MatchesZero(Y, OppositeZero)) {
     NewX = Y;
     NewY = X;
+    ShouldHandleZeros = false;
   } else if (!VT.isVector() && (VT == MVT::f16 || Subtarget.hasDQI()) &&
              (Op->getFlags().hasNoNaNs() || IsXNeverNaN || IsYNeverNaN)) {
     if (IsXNeverNaN)
@@ -29569,33 +29567,6 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget,
     NewX = DAG.getSelect(DL, VT, NeedSwap, Y, X);
     NewY = DAG.getSelect(DL, VT, NeedSwap, X, Y);
     return DAG.getNode(MinMaxOp, DL, VT, NewX, NewY, Op->getFlags());
-  } else {
-    SDValue IsXSigned;
-    if (Subtarget.is64Bit() || VT != MVT::f64) {
-      SDValue XInt = DAG.getNode(ISD::BITCAST, DL, IVT, X);
-      SDValue ZeroCst = DAG.getConstant(0, DL, IVT);
-      IsXSigned = DAG.getSetCC(DL, SetCCType, XInt, ZeroCst, ISD::SETLT);
-    } else {
-      assert(VT == MVT::f64);
-      SDValue Ins = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, MVT::v2f64,
-                                DAG.getConstantFP(0, DL, MVT::v2f64), X,
-                                DAG.getVectorIdxConstant(0, DL));
-      SDValue VX = DAG.getNode(ISD::BITCAST, DL, MVT::v4f32, Ins);
-      SDValue Hi = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, VX,
-                               DAG.getVectorIdxConstant(1, DL));
-      Hi = DAG.getBitcast(MVT::i32, Hi);
-      SDValue ZeroCst = DAG.getConstant(0, DL, MVT::i32);
-      EVT SetCCType = TLI.getSetCCResultType(DAG.getDataLayout(),
-                                             *DAG.getContext(), MVT::i32);
-      IsXSigned = DAG.getSetCC(DL, SetCCType, Hi, ZeroCst, ISD::SETLT);
-    }
-    if (MinMaxOp == X86ISD::FMAX) {
-      NewX = DAG.getSelect(DL, VT, IsXSigned, X, Y);
-      NewY = DAG.getSelect(DL, VT, IsXSigned, Y, X);
-    } else {
-      NewX = DAG.getSelect(DL, VT, IsXSigned, Y, X);
-      NewY = DAG.getSelect(DL, VT, IsXSigned, X, Y);
-    }
   }
 
   bool IgnoreNaN = DAG.getTarget().Options.NoNaNsFPMath ||
@@ -29612,10 +29583,80 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget,
 
   SDValue MinMax = DAG.getNode(MinMaxOp, DL, VT, NewX, NewY, Op->getFlags());
 
+  // We handle signed-zero ordering by taking the larger (or smaller) sign bit.
+  if (ShouldHandleZeros) {
+    const fltSemantics &Sem = VT.getFltSemantics();
+    unsigned EltBits = VT.getScalarSizeInBits();
+    bool IsFakeVector = !VT.isVector();
+    MVT LogicVT = VT.getSimpleVT();
+    if (IsFakeVector)
+      LogicVT = (VT == MVT::f64)   ? MVT::v2f64
+                : (VT == MVT::f32) ? MVT::v4f32
+                                   : MVT::v8f16;
+
+    // We take the sign bit from the first operand and combine it with the
+    // output sign bit (see below). Right now, if ShouldHandleZeros is true, the
+    // operands will never have been swapped. If you add another optimization
+    // that swaps the input operands if one is a known value, make sure this
+    // logic stays correct!
+    SDValue LogicX = NewX;
+    SDValue LogicMinMax = MinMax;
+    if (IsFakeVector) {
+      // Promote scalars to vectors for bitwise operations.
+      LogicX = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, LogicVT, NewX);
+      LogicMinMax = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, LogicVT, MinMax);
+    }
+
+    // x86's min/max operations return the second operand if both inputs are
+    // signed zero. For the maximum operation, we want to "and" the sign bit of
+    // the output with the sign bit of the first operand--that means that if the
+    // first operand is +0.0, the output will be too. For the minimum, it's the
+    // opposite: we "or" the output sign bit with the sign bit of the first
+    // operand, ensuring that if the first operand is -0.0, the output will be
+    // too.
+    SDValue Result;
+    if (IsMaxOp) {
+      // getSignedMaxValue returns a bit pattern of all ones but the highest
+      // bit. We "or" that with the first operand, then "and" that with the max
+      // operation's result. That clears only the sign bit, and only if the
+      // first operand is positive.
+      SDValue OrMask = DAG.getConstantFP(
+          APFloat(Sem, APInt::getSignedMaxValue(EltBits)), DL, LogicVT);
+      SDValue MaskedSignBit =
+          DAG.getNode(X86ISD::FOR, DL, LogicVT, LogicX, OrMask);
+      Result =
+          DAG.getNode(X86ISD::FAND, DL, LogicVT, MaskedSignBit, LogicMinMax);
+    } else {
+      // Likewise, getSignedMinValue returns a bit pattern with only the highest
+      // bit set. This one *sets* only the sign bit, and only if the first
+      // operand is *negative*.
+      SDValue AndMask = DAG.getConstantFP(
+          APFloat(Sem, APInt::getSignMask(EltBits)), DL, LogicVT);
+      SDValue MaskedSignBit =
+          DAG.getNode(X86ISD::FAND, DL, LogicVT, LogicX, AndMask);
+      Result =
+          DAG.getNode(X86ISD::FOR, DL, LogicVT, MaskedSignBit, LogicMinMax);
+    }
+
+    // Extract scalar back from vector.
+    if (IsFakeVector)
+      MinMax = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT, Result,
+                           DAG.getVectorIdxConstant(0, DL));
+    else
+      MinMax = Result;
+  }
+
   if (IgnoreNaN || DAG.isKnownNeverNaN(IsNum ? NewY : NewX))
     return MinMax;
 
-  SDValue NaNSrc = IsNum ? MinMax : NewX;
+  // The x86 min/max return the second operand if either is NaN, which doesn't
+  // match the numeric or non-numeric semantics. For the non-numeric versions,
+  // we want to return NaN if either operand is NaN. To do that, we check if
+  // NewX (the first operand) is NaN, and select it if so. For the numeric
+  // versions, we want to return the non-NaN operand if there is one. So we
+  // check if NewY (the second operand) is NaN, and again select the first
+  // operand if so.
+  SDValue NaNSrc = IsNum ? NewY : NewX;
   SDValue IsNaN = DAG.getSetCC(DL, SetCCType, NaNSrc, NaNSrc, ISD::SETUO);
 
   return DAG.getSelect(DL, VT, IsNaN, NewX, MinMax);
diff --git a/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll b/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll
index 53ac283170a5f..59cf38f82b7c0 100644
--- a/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll
+++ b/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll
@@ -13,14 +13,9 @@ declare <32 x half> @llvm.maximum.v32f16(<32 x half>, <32 x half>)
 define half @test_fminimum(half %x, half %y) {
 ; CHECK-LABEL: test_fminimum:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vmovw %xmm0, %eax
-; CHECK-NEXT:    testw %ax, %ax
-; CHECK-NEXT:    sets %al
-; CHECK-NEXT:    kmovd %eax, %k1
-; CHECK-NEXT:    vmovaps %xmm1, %xmm2
-; CHECK-NEXT:    vmovsh %xmm0, %xmm0, %xmm2 {%k1}
-; CHECK-NEXT:    vmovsh %xmm1, %xmm0, %xmm0 {%k1}
-; CHECK-NEXT:    vminsh %xmm2, %xmm0, %xmm1
+; CHECK-NEXT:    vminsh %xmm1, %xmm0, %xmm2
+; CHECK-NEXT:    vpbroadcastw {{.*#+}} xmm1 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; CHECK-NEXT:    vpternlogq {{.*#+}} xmm1 = (xmm1 & xmm0) | xmm2
 ; CHECK-NEXT:    vcmpunordsh %xmm0, %xmm0, %k1
 ; CHECK-NEXT:    vmovsh %xmm0, %xmm0, %xmm1 {%k1}
 ; CHECK-NEXT:    vmovaps %xmm1, %xmm0
@@ -92,16 +87,12 @@ define half @test_fminimum_combine_cmps(half %x, half %y) {
 define half @test_fmaximum(half %x, half %y) {
 ; CHECK-LABEL: test_fmaximum:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vmovw %xmm0, %eax
-; CHECK-NEXT:    testw %ax, %ax
-; CHECK-NEXT:    sets %al
-; CHECK-NEXT:    kmovd %eax, %k1
-; CHECK-NEXT:    vmovaps %xmm0, %xmm2
-; CHECK-NEXT:    vmovsh %xmm1, %xmm0, %xmm2 {%k1}
+; CHECK-NEXT:    vmaxsh %xmm1, %xmm0, %xmm2
+; CHECK-NEXT:    vpbroadcastw {{.*#+}} xmm1 = [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]
+; CHECK-NEXT:    vpternlogq {{.*#+}} xmm1 = xmm2 & (xmm1 | xmm0)
+; CHECK-NEXT:    vcmpunordsh %xmm0, %xmm0, %k1
 ; CHECK-NEXT:    vmovsh %xmm0, %xmm0, %xmm1 {%k1}
-; CHECK-NEXT:    vmaxsh %xmm2, %xmm1, %xmm0
-; CHECK-NEXT:    vcmpunordsh %xmm1, %xmm1, %k1
-; CHECK-NEXT:    vmovsh %xmm1, %xmm0, %xmm0 {%k1}
+; CHECK-NEXT:    vmovaps %xmm1, %xmm0
 ; CHECK-NEXT:    retq
   %r = call half @llvm.maximum.f16(half %x, half %y)
   ret half %r
@@ -196,10 +187,9 @@ define <16 x half> @test_fmaximum_v16f16_nans(<16 x half> %x, <16 x half> %y) "n
 define <32 x half> @test_fminimum_v32f16_szero(<32 x half> %x, <32 x half> %y) "no-nans-fp-math"="true" {
 ; CHECK-LABEL: test_fminimum_v32f16_szero:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vpmovw2m %zmm0, %k1
-; CHECK-NEXT:    vpblendmw %zmm0, %zmm1, %zmm2 {%k1}
-; CHECK-NEXT:    vmovdqu16 %zmm1, %zmm0 {%k1}
-; CHECK-NEXT:    vminph %zmm2, %zmm0, %zmm0
+; CHECK-NEXT:    vminph %zmm1, %zmm0, %zmm1
+; CHECK-NEXT:    vpbroadcastw {{.*#+}} zmm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; CHECK-NEXT:    vpternlogq {{.*#+}} zmm0 = (zmm0 & zmm2) | zmm1
 ; CHECK-NEXT:    retq
   %r = call <32 x half> @llvm.minimum.v32f16(<32 x half> %x, <32 x half> %y)
   ret <32 x half> %r
@@ -208,12 +198,12 @@ define <32 x half> @test_fminimum_v32f16_szero(<32 x half> %x, <32 x half> %y) "
 define <32 x half> @test_fmaximum_v32f16_nans_szero(<32 x half> %x, <32 x half> %y) {
 ; CHECK-LABEL: test_fmaximum_v32f16_nans_szero:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vpmovw2m %zmm0, %k1
-; CHECK-NEXT:    vpblendmw %zmm1, %zmm0, %zmm2 {%k1}
+; CHECK-NEXT:    vmaxph %zmm1, %zmm0, %zmm2
+; CHECK-NEXT:    vpbroadcastw {{.*#+}} zmm1 = [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]
+; CHECK-NEXT:    vpternlogq {{.*#+}} zmm1 = zmm2 & (zmm1 | zmm0)
+; CHECK-NEXT:    vcmpunordph %zmm0, %zmm0, %k1
 ; CHECK-NEXT:    vmovdqu16 %zmm0, %zmm1 {%k1}
-; CHECK-NEXT:    vmaxph %zmm2, %zmm1, %zmm0
-; CHECK-NEXT:    vcmpunordph %zmm1, %zmm1, %k1
-; CHECK-NEXT:    vmovdqu16 %zmm1, %zmm0 {%k1}
+; CHECK-NEXT:    vmovdqa64 %zmm1, %zmm0
 ; CHECK-NEXT:    retq
   %r = call <32 x half> @llvm.maximum.v32f16(<32 x half> %x, <32 x half> %y)
   ret <32 x half> %r
diff --git a/llvm/test/CodeGen/X86/extractelement-fp.ll b/llvm/test/CodeGen/X86/extractelement-fp.ll
index 1706f17eac165..9ce1fd6f20976 100644
--- a/llvm/test/CodeGen/X86/extractelement-fp.ll
+++ b/llvm/test/CodeGen/X86/extractelement-fp.ll
@@ -675,37 +675,23 @@ define double @fminnum_v4f64(<4 x double> %x, <4 x double> %y) nounwind {
 define float @fmaximum_v4f32(<4 x float> %x, <4 x float> %y) nounwind {
 ; X64-LABEL: fmaximum_v4f32:
 ; X64:       # %bb.0:
-; X64-NEXT:    vmovd %xmm0, %eax
-; X64-NEXT:    testl %eax, %eax
-; X64-NEXT:    js .LBB30_1
-; X64-NEXT:  # %bb.2:
-; X64-NEXT:    vmovdqa %xmm0, %xmm2
-; X64-NEXT:    jmp .LBB30_3
-; X64-NEXT:  .LBB30_1:
-; X64-NEXT:    vmovdqa %xmm1, %xmm2
-; X64-NEXT:    vmovdqa %xmm0, %xmm1
-; X64-NEXT:  .LBB30_3:
-; X64-NEXT:    vmaxss %xmm2, %xmm1, %xmm0
-; X64-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
-; X64-NEXT:    vblendvps %xmm2, %xmm1, %xmm0, %xmm0
+; X64-NEXT:    vbroadcastss {{.*#+}} xmm2 = [NaN,NaN,NaN,NaN]
+; X64-NEXT:    vorps %xmm2, %xmm0, %xmm2
+; X64-NEXT:    vmaxss %xmm1, %xmm0, %xmm1
+; X64-NEXT:    vandps %xmm1, %xmm2, %xmm1
+; X64-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
+; X64-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
 ; X64-NEXT:    retq
 ;
 ; X86-LABEL: fmaximum_v4f32:
 ; X86:       # %bb.0:
-; X86-NEXT:    vmovd %xmm0, %eax
-; X86-NEXT:    testl %eax, %eax
-; X86-NEXT:    js .LBB30_1
-; X86-NEXT:  # %bb.2:
-; X86-NEXT:    vmovdqa %xmm0, %xmm2
-; X86-NEXT:    jmp .LBB30_3
-; X86-NEXT:  .LBB30_1:
-; X86-NEXT:    vmovdqa %xmm1, %xmm2
-; X86-NEXT:    vmovdqa %xmm0, %xmm1
-; X86-NEXT:  .LBB30_3:
 ; X86-NEXT:    pushl %eax
-; X86-NEXT:    vmaxss %xmm2, %xmm1, %xmm0
-; X86-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
-; X86-NEXT:    vblendvps %xmm2, %xmm1, %xmm0, %xmm0
+; X86-NEXT:    vbroadcastss {{.*#+}} xmm2 = [NaN,NaN,NaN,NaN]
+; X86-NEXT:    vorps %xmm2, %xmm0, %xmm2
+; X86-NEXT:    vmaxss %xmm1, %xmm0, %xmm1
+; X86-NEXT:    vandps %xmm1, %xmm2, %xmm1
+; X86-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
+; X86-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
 ; X86-NEXT:    vmovss %xmm0, (%esp)
 ; X86-NEXT:    flds (%esp)
 ; X86-NEXT:    popl %eax
@@ -718,41 +704,25 @@ define float @fmaximum_v4f32(<4 x float> %x, <4 x float> %y) nounwind {
 define double @fmaximum_v4f64(<4 x double> %x, <4 x double> %y) nounwind {
 ; X64-LABEL: fmaximum_v4f64:
 ; X64:       # %bb.0:
-; X64-NEXT:    vmovq %xmm0, %rax
-; X64-NEXT:    testq %rax, %rax
-; X64-NEXT:    js .LBB31_1
-; X64-NEXT:  # %bb.2:
-; X64-NEXT:    vmovdqa %xmm0, %xmm2
-; X64-NEXT:    jmp .LBB31_3
-; X64-NEXT:  .LBB31_1:
-; X64-NEXT:    vmovdqa %xmm1, %xmm2
-; X64-NEXT:    vmovdqa %xmm0, %xmm1
-; X64-NEXT:  .LBB31_3:
-; X64-NEXT:    vmaxsd %xmm2, %xmm1, %xmm0
-; X64-NEXT:    vcmpunordsd %xmm1, %xmm1, %xmm2
-; X64-NEXT:    vblendvpd %xmm2, %xmm1, %xmm0, %xmm0
+; X64-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; X64-NEXT:    vorpd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm2
+; X64-NEXT:    vandpd %xmm1, %xmm2, %xmm1
+; X64-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm2
+; X64-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm0
 ; X64-NEXT:    vzeroupper
 ; X64-NEXT:    retq
 ;
 ; X86-LABEL: fmaximum_v4f64:
 ; X86:       # %bb.0:
-; X86-NEXT:    vextractps $1, %xmm0, %eax
-; X86-NEXT:    testl %eax, %eax
-; X86-NEXT:    js .LBB31_1
-; X86-NEXT:  # %bb.2:
-; X86-NEXT:    vmovapd %xmm0, %xmm2
-; X86-NEXT:    jmp .LBB31_3
-; X86-NEXT:  .LBB31_1:
-; X86-NEXT:    vmovapd %xmm1, %xmm2
-; X86-NEXT:    vmovapd %xmm0, %xmm1
-; X86-NEXT:  .LBB31_3:
 ; X86-NEXT:    pushl %ebp
 ; X86-NEXT:    movl %esp, %ebp
 ; X86-NEXT:    andl $-8, %esp
 ; X86-NEXT:    subl $8, %esp
-; X86-NEXT:    vmaxsd %xmm2, %xmm1, %xmm0
-; X86-NEXT:    vcmpunordsd %xmm1, %xmm1, %xmm2
-; X86-NEXT:    vblendvpd %xmm2, %xmm1, %xmm0, %xmm0
+; X86-NEXT:    vmaxsd %xmm1, %xmm0, %xmm1
+; X86-NEXT:    vorpd {{\.?LCPI[0-9]+_[0-9]+}}, %xmm0, %xmm2
+; X86-NEXT:    vandpd %xmm1, %xmm2, %xmm1
+; X86-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm2
+; X86-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm0
 ; X86-NEXT:    vmovlpd %xmm0, (%esp)
 ; X86-NEXT:    fldl (%esp)
 ; X86-NEXT:    movl %ebp, %esp
@@ -767,35 +737,21 @@ define double @fmaximum_v4f64(<4 x double> %x, <4 x double> %y) nounwind {
 define float @fminimum_v4f32(<4 x float> %x, <4 x float> %y) nounwind {
 ; X64-LABEL: fminimum_v4f32:
 ; X64:       # %bb.0:
-; X64-NEXT:    vmovd %xmm0, %eax
-; X64-NEXT:    testl %eax, %eax
-; X64-NEXT:    js .LBB32_1
-; X64-NEXT:  # %bb.2:
-; X64-NEXT:    vmovdqa %xmm1, %xmm2
-; X64-NEXT:    jmp .LBB32_3
-; X64-NEXT:  .LBB32_1:
-; X64-NEXT:    vmovdqa %xmm0, %xmm2
-; X64-NEXT:    vmovdqa %xmm1, %xmm0
-; X64-NEXT:  .LBB32_3:
-; X64-NEXT:    vminss %xmm2, %xmm0, %xmm1
+; X64-NEXT:    vbroadcastss {{.*#+}} xmm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; X64-NEXT:    vandps %xmm2, %xmm0, %xmm2
+; X64-NEXT:    vminss %xmm1, %xmm0, %xmm1
+; X64-NEXT:    vorps %xmm1, %xmm2, %xmm1
 ; X64-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
 ; X64-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
 ; X64-NEXT:    retq
 ;
 ; X86-LABEL: fminimum_v4f32:
 ; X86:       # %bb.0:
-; X86-NEXT:    vmovd %xmm0, %eax
-; X86-NEXT:    testl %eax, %eax
-; X86-NEXT:    js .LBB32_1
-; X86-NEXT:  # %bb.2:
-; X86-NEXT:    vmovdqa %xmm1, %xmm2
-; X86-NEXT:    jmp .LBB32_3
-; X86-NEXT:  .LBB32_1:
-; X86-NEXT:    vmovdqa %xmm0, %xmm2
-; X86-NEXT:    vmovdqa %xmm1, %xmm0
-; X86-NEXT:  .LBB32_3:
 ; X86-NEXT:    pushl %eax
-; X86-NEXT:    vminss %xmm2, %xmm0, %xmm1
+; X86-NEXT:    vbroadcastss {{.*#+}} xmm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
+; X86-NEXT:    vandps %xmm2, %xmm0, %xmm2
+; X86-NEXT:    vminss %xmm1, %xmm0, %xmm1
+; X86-NEXT:    vorps %xmm1, %xmm2, %xmm1
 ; X86-NEXT:    vcmpunordss %xmm0, %xmm0, %xmm2
 ; X86-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
 ; X86-NEXT:    vmovss %xmm0, (%esp)
@@ -810,17 +766,9 @@ define float @fminimum_v4f32(<4 x float> %x, <4 x float> %y) nounwind {
 define double @fminimum_v4f64(<4 x double> %x, <4 x double> %y) nounwind {
 ; X64-LABEL: fminimum_v4f64:
 ; X64:       # %bb.0:
-; X64-NEXT:    vmovq %xmm0, %rax
-; X64-NEXT:    testq %rax, %rax
-; X64-NEXT:    js .LBB33_1
-; X64-NEXT:  # %bb.2:
-; X64-NEXT:    vmovdqa %xmm1, %xmm2
-; X64-NEXT:    jmp .LBB33_3
-; X64-NEXT:  .LBB33_1:
-; X64-NEXT:    vmovdqa %xmm0, %xmm2
-; X64-NEXT:    vmovdqa %xmm1, %xmm0
-; X64-NEXT:  .LBB33_3:
-; X64-NEXT:    vminsd %xmm2, %xmm0, %xmm1
+; X64-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; X64-NEXT:    vandpd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm2
+; X64-NEXT:    vorpd %xmm1, %xmm2, %xmm1
 ; X64-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm2
 ; X64-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm0
 ; X64-NEXT:    vzeroupper
@@ -828,21 +776,13 @@ define double @fminimum_v4f64(<4 x double> %x, <4 x double> %y) nounwind {
 ;
 ; X86-LABEL: fminimum_v4f64:
 ; X86:       # %bb.0:
-; X86-NEXT:    vextractps $1, %xmm0, %eax
-; X86-NEXT:    testl %eax, %eax
-; X86-NEXT:    js .LBB33_1
-; X86-NEXT:  # %bb.2:
-; X86-NEXT:    vmovapd %xmm1, %xmm2
-; X86-NEXT:    jmp .LBB33_3
-; X86-NEXT:  .LBB33_1:
-; X86-NEXT:    vmovapd %xmm0, %xmm2
-; X86-NEXT:    vmovapd %xmm1, %xmm0
-; X86-NEXT:  .LBB33_3:
 ; X86-NEXT:    pushl %ebp
 ; X86-NEXT:    movl %esp, %ebp
 ; X86-NEXT:    andl $-8, %esp
 ; X86-NEXT:    subl $8, %esp
-; X86-NEXT:    vminsd %xmm2, %xmm0, %xmm1
+; X86-NEXT:    vminsd %xmm1, %xmm0, %xmm1
+; X86-NEXT:    vandpd {{\.?LCPI[0-9]+_[0-9]+}}, %xmm0, %xmm2
+; X86-NEXT:    vorpd %xmm1, %xmm2, %xmm1
 ; X86-NEXT:    vcmpunordsd %xmm0, %xmm0, %xmm2
 ; X86-NEXT:    vblendvpd %xmm2, %xmm0, %xmm1, %xmm0
 ; X86-NEXT:    vmovlpd %xmm0, (%esp)
diff --git a/llvm/test/CodeGen/X86/fminimum-fmaximum.ll b/llvm/test/CodeGen/X86/fminimum-fmaximum.ll
index 06515e4f82687..910acd6d82ae2 100644
--- a/llvm/test/CodeGen/X86/fminimum-fmaximum.ll
+++ b/llvm/test/C...
[truncated]

… ops

phoebewang · 2025-12-01T08:09:50Z

llvm/test/CodeGen/X86/fminimumnum-fmaximumnum.ll

+; SSE2-NEXT:    maxss %xmm1, %xmm0
+; SSE2-NEXT:    andnps %xmm0, %xmm2
+; SSE2-NEXT:    orps %xmm3, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm0


One more instruction here.

I think this is a result of changing the NaN handling to check the source operand for NaN in minimumnum and maximumnum instead of the result operand. There's an extra movaps because we can't reuse a register; I'm not a microarchitecture expert, but I suspect it's "free" due to register renaming.

phoebewang · 2025-12-01T08:10:01Z

llvm/test/CodeGen/X86/fminimumnum-fmaximumnum.ll

-; X86-NEXT:    vmaxss {{[0-9]+}}(%esp), %xmm0, %xmm1
-; X86-NEXT:    vcmpunordss %xmm1, %xmm1, %xmm2
-; X86-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
+; X86-NEXT:    vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero


I think the same answer as above applies here.

phoebewang · 2025-12-01T08:10:32Z

llvm/test/CodeGen/X86/vector-reduce-fmaximum.ll

@@ -1782,4 +1228,5 @@ declare double @llvm.vector.reduce.fmaximum.v4f64(<4 x double>)
 declare double @llvm.vector.reduce.fmaximum.v8f64(<8 x double>)
 declare double @llvm.vector.reduce.fmaximum.v16f64(<16 x double>)
 ;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; AVX512: {{.*}}


Remove unused prefix.

As the comment above says, this list of prefixes is autogenerated. Is there some incantation I can use to have it not generate this list?

Yes, remove it from the RUN lines.

There are some other issues with the RUN lines; the SSE prefix is also unused and there's a warning about conflicting output for the two AVX lines (I think the +avx2 one was meant to use the AVX2 prefix.) Should I fix those in a separate PR?

Yes, that would be great.

llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll

phoebewang

LGTM.

ActuallyaDeviloper · 2025-12-09T21:02:47Z

I am also concerned about the memory operation. Is this really the best way to implement these?

Some ideas:

AVX-512 has VRANGEPS which despite it's name computes "minimumnum"/"maximumnum" like IEEE 2008.
I think minimum can be implemented like minimum(x,y) = minps(x,y)|minps(y,x). If one of x or y is NaN, then one of the two will be NaN and so will be the result. If both are not NaN or zero, then they agree and hence the "or" is the "minps". Finally, if both are zero, then minps(x,y)|minps(y,x)=y|x which prefers the signed zero.
The other variants can be generated by a similar construction for maximumnum and NaN fixup I think.

valadaptive · 2025-12-09T21:47:44Z

AVX-512 has VRANGEPS which despite it's name computes "minimumnum"/"maximumnum" like IEEE 2008.

VRANGEPS does handle signed zeroes properly, but IEEE754-2008 semantics are not what we want here. If one argument is a signaling NaN, VRANGEPS and the IEEE754-2008 operations both return a quiet NaN, but we want to return the non-NaN operand no matter what. On AArch64, where the FMINNM and FMAXNM instructions also implement IEEE754-2008, I believe we insert some "canonicalize" operations on the inputs to quiet any NaNs. We could do something similar here, but that's best left for a future PR.

I think minimum can be implemented like minimum(x,y) = minps(x,y)|minps(y,x). If one of x or y is NaN, then one of the two will be NaN and so will be the result. If both are not NaN or zero, then they agree and hence the "or" is the "minps". Finally, if both are zero, then minps(x,y)|minps(y,x)=y|x which prefers the signed zero.

Indeed, this is what Cranelift does. I'd have to check how it performs against the bitwise approach. They insert a bunch of extra operations to ensure the result is a quiet NaN, which LLVM doesn't care about.

ActuallyaDeviloper · 2025-12-09T22:04:18Z

VRANGEPS does handle signed zeroes properly, but IEEE754-2008 semantics are not what we want here. If one argument is a signaling NaN, VRANGEPS and the IEEE754-2008 operations both return a quiet NaN, but we want to return the non-NaN operand no matter what. With AArch64, where the FMINNM and FMAXNM instructions also implement IEEE754-2008, I believe we insert some "canonicalize" operations with the inputs to quiet any NaNs. We could do something similar here, but that's best left for a future PR.

I agree, I just have also added some comments at the associated NaN discussion. I am not sure if a consus was reached in that regard although I personally not against that approach.

AVX 512 can be pushed to a future update. Also it would be really nice if we could eliminate the "canonicalize operations" if we can proof that we don't need them.

Indeed, this is what Cranelift does. I'd have to check how it performs against the bitwise approach. They insert a bunch of extra operations to ensure the result is a quiet NaN, which LLVM doesn't care about.

Ah so the idea is not novel. Based on the instruction timings it should be quite competitive especially if there is that memory load otherwise.

valadaptive · 2025-12-10T12:08:00Z

Thinking about this a bit more, I think doing two min/max operations and combining them is a lot more specialized and tricky than what I've implemented here, and can be deferred to the future. The operations are asymmetric, and you have to make sure you follow LLVM's NaN semantics.

minps(x, y) | minps(y, x) for a non-numeric minimum is asymmetric with regards to the maximum. Getting the non-numeric maximum requires more operations, as seen in Cranelift.
minps(x, y) | minps(y, x) also violates LLVM's NaN payload semantics (emphasis mine):
Floating-point math operations that return a NaN are an exception from the general principle that LLVM implements IEEE-754 semantics. Unless specified otherwise, the following rules apply whenever the IEEE-754 semantics say that a NaN value is returned: the result has a non-deterministic sign; the quiet bit and payload are non-deterministically chosen from the following set of options:
- The quiet bit is set and the payload is all-zero. (“Preferred NaN” case)
- The quiet bit is set and the payload is copied from any input operand that is a NaN. (“Quieting NaN propagation” case)
- The quiet bit and payload are copied from any input operand that is a NaN. (“Unchanged NaN propagation” case)
- The quiet bit is set and the payload is picked from a target-specific set of “extra” possible NaN payloads. The set can depend on the input operand values. This set is empty on x86 and ARM, but can be non-empty on other architectures. (For instance, on wasm, if any input NaN does not have the preferred all-zero payload or any input NaN is an SNaN, then this set contains all possible payloads; otherwise, it is empty. On SPARC, this set consists of the all-one payload.)
The bitwise "or" copies the mantissa bits from the non-NaN operand, which LLVM explicitly does not allow. You can fix this with an extra fixup step afterwards like Cranelift does, but the speed tradeoff becomes even more murky.
Extending those operations to prefer the non-NaN operand seems tricky. Not only do you now have to check against two source operands (if you do only one minps/maxps, you only have to check if one source operand is NaN), but whatever bitwise tricks you perform must not create signaling NaN values. For LLVM's semantics, it's fine to pass through a signaling NaN input, but you may not create a signaling NaN if none of the input operands were signaling NaNs.

I think loading the sign masks from memory can be avoided. LLVM already "materializes" an all-ones value using something like pcmpeqd xmm0, xmm0, and you can extend that to "materialize" any mask value (a run of ones followed by zeroes, or vice versa) by performing a left or right shift afterwards. That's a different part of the codebase, however, and it'd have much broader performance implications.

ActuallyaDeviloper · 2025-12-10T21:10:46Z

Thinking about this a bit more, I think doing two min/max operations and combining them is a lot more specialized and tricky than what I've implemented here, and can be deferred to the future. The operations are asymmetric, and you have to make sure you follow LLVM's NaN semantics.

Alright, thanks for looking into it. The NaN guarantee is unfortunate.

I hope we can improve the code generation further at later time.

valadaptive · 2025-12-24T22:54:24Z

Ping

phoebewang · 2025-12-25T00:57:41Z

Ping

Do you want me to merge it or wait someone's approval?

valadaptive · 2025-12-25T01:03:42Z

I'll see if I can get another review on this. Probably should wait until after Christmas now that I think about it.

RKSimon · 2025-12-28T15:39:34Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+    bool IsFakeVector = !VT.isVector();
+    MVT LogicVT = VT.getSimpleVT();
+    if (IsFakeVector)
+      LogicVT = (VT == MVT::f64)   ? MVT::v2f64


Should we assert that VT is f64/f32/f16 here? Is there anyway that fp80/bf16 code can arrive here?

I had to double-check; apparently bf16 code can call this lowering function, but I added an assert before here and it's not hit.

Just to make sure, I added tests for the scalar bf16 versions of minimum/maximum/minimumnum/maximumnum, and added CHECK lines for AVX512-BF16.

valadaptive · 2026-01-01T00:49:08Z

llvm/test/CodeGen/X86/fminimumnum-fmaximumnum.ll

 declare float @llvm.minimumnum.f32(float, float)
 declare double @llvm.minimumnum.f64(double, double)
+declare bfloat @llvm.minimumnum.bf16(bfloat, bfloat)


I never know what to do with these. I know declarations aren't required in regression tests anymore, but it feels wrong to leave them out for only new functions, or to remove them all entirely.

…twise

github-actions · 2026-01-01T00:57:15Z

✅ With the latest revision this PR passed the C/C++ code formatter.

github-actions · 2026-01-01T01:33:24Z

🪟 Windows x64 Test Results

129244 tests passed
2852 tests skipped

✅ The build succeeded and all tests passed.

github-actions · 2026-01-01T01:33:24Z

🐧 Linux x64 Test Results

188237 tests passed
4994 tests skipped

✅ The build succeeded and all tests passed.

RKSimon

Fix CI fminimumnum-fmaximumnum.ll failures

…twise

valadaptive · 2026-01-08T10:02:08Z

Regenerated the tests; I think #140193 changed the codegen there.

llvm/lib/Target/X86/X86ISelLowering.cpp

arsenm · 2026-01-08T13:04:00Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+      SDValue OrMask = DAG.getConstantFP(
+          APFloat(Sem, APInt::getSignedMaxValue(EltBits)), DL, LogicVT);


this is getConstantFP(APFloat::getInf()) with extra steps?

Unless I don't understand what getInf() does, it's different. Infinity is an all-ones exponent with an all-zeroes mantissa. This mask value is all ones in the exponent and mantissa, that is, everywhere but the sign bit.

You can also use getNan() with an all 1 payload

And we have getAllOnesValue, could maybe add another partner with a cleared sign bit

I copied what LowerFCOPYSIGN does already. I think ~APInt::getSignMask(EltSizeInBits) would be clearest if I were to change it, but I'm not sure it's worth diverging from what LowerFCOPYSIGN does. Should I just change them both?

I guess leave it as-is. One of my long term annoyances is the lack of useful APFloat constructors, so we end up doing complex calculations to synthesize trivial values (e.g., some places do string conversions just to get a simple constant)

Would it be better to do this as a bitcasted integer value? I hate the way that the APFloat(Sem, APInt) constructor doesn't make it obvious if its bitcasting or converting the APInt....

I think we could cover nearly all uses with a signbit mask float, and a getPowerOfTwo(Log2) constructor

RKSimon

@arsenm ping?

arsenm

LGTM but I don't know x86

arsenm · 2026-01-12T20:36:15Z

llvm/lib/Target/X86/X86ISelLowering.cpp

+      SDValue OrMask = DAG.getConstantFP(
+          APFloat(Sem, APInt::getSignedMaxValue(EltBits)), DL, LogicVT);


I guess leave it as-is. One of my long term annoyances is the lack of useful APFloat constructors, so we end up doing complex calculations to synthesize trivial values (e.g., some places do string conversions just to get a simple constant)

RKSimon

LGTM - cheers

RKSimon · 2026-01-14T18:37:17Z

Thanks @valadaptive - if you're still interested in looking at tweaking the NaN handling please raise another patch

github-actions · 2026-01-14T19:07:47Z

@valadaptive Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

…se operations (llvm#170069) I got somewhat nerd-sniped when looking at a Rust issue and seeing [this comment about how various min/max operations are compiled on various architectures](rust-lang/rust#91079 (comment)). The current `minimum`/`maximum`/`minimumnum`/`maximumnum` code is very branchy because of the signed-zero handling. Even though we emit select operations, LLVM *really* prefers to lower them to branches, to the point of scalarizing vector code to do so, even if `blendv` is supported. (Should I open a separate issue for that? It seems concerning that LLVM would rather scalarize a vector operation than emit a couple `blendv` operations in a row.) It turns out that handling signed zero operands properly can be done using a couple bitwise operations, which is branchless and easily vectorizable, by taking advantage of the following properties: - When you take the maximum of two floats, the output sign bit will be the bitwise AND of the input sign bits (since 0 means positive, and the maximum always prefers the positive number). - When you take the minimum of two floats, the output sign bit will be the bitwise OR of the input sign bits (since 1 means negative, and the minimum always prefers the negative number). We can further optimize this by taking advantage of the fact that x86's min/max instructions operate like a floating-point compare+select, returning the second operand if both are (positive or negative) zero. Altogether, the operations go as follows: - For taking the minimum: - Call `minps`/`minpd`/etc. on the input operands. This will return the minimum, unless both are zero, in which case it will return the second operand. - Take the bitwise AND of the first operand and the highest bit, so that everything is zero except the sign bit. - Finally, OR that with the minimum from earlier. The only incorrect case was when the second operand was +0.0 and the first operand was -0.0. By OR-ing the first operand's sign bit with the existing minimum, we correct this. - Analogously, for taking the maximum: - Call `maxps`/`maxpd`/etc. on the input operands. This will return the maximum, unless both are zero, in which case it will return the second operand. - Take the bitwise OR of the first operand and a bit pattern which is all ones except for the highest bit, so that everything is ones except the sign bit. - Finally, AND that with the maximum from earlier. In the case of NaNs, this approach might change the output NaN's sign bit. We don't have to worry about this for a couple reasons: firstly, LLVM's language reference [allows NaNs to have a nondeterministic sign bit](https://llvm.org/docs/LangRef.html#floatnan); secondly, there's already a step after this that selects one of the input NaNs anyway. [Here's an Alive2 proof.](https://alive2.llvm.org/ce/z/EfQZ-G) It obviously can't verify that the implementation is sound, but shows that at least the theory is. I believe this approach is faster than even properly-vectorized `blendv` operations because it eliminates a data dependency chain. Furthermore on AVX-512, the load, AND, and OR can become a single `vpternlogd`. My highly-unrepresentative microbenchmarks (compiled for x86-64-v2, so SSE4.1) say ~7.5%-10% faster than `blendv`, which makes me confident this is at least not a regression.

llvmbot added the backend:X86 label Dec 1, 2025

[X86][ISelLowering] Lower minimum[num]/maximum[num] using bitwise…

5dfeb5c

… ops

valadaptive force-pushed the fminimum-maximum-bitwise branch from 3061feb to 5dfeb5c Compare December 1, 2025 05:39

valadaptive mentioned this pull request Dec 1, 2025

Tracking Issue for float_minimum_maximum rust-lang/rust#91079

Open

4 tasks

phoebewang reviewed Dec 1, 2025

View reviewed changes

llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll Show resolved Hide resolved

[X86] Remove unused check prefix from vector-reduce-fmaximum.ll

1757f2e

phoebewang approved these changes Dec 1, 2025

View reviewed changes

valadaptive mentioned this pull request Dec 3, 2025

f*::min/max do not behave as documented for signaling NaN on aarch64, or with optimizations rust-lang/rust#149537

Open

RKSimon reviewed Dec 28, 2025

View reviewed changes

valadaptive added 4 commits December 31, 2025 18:38

[X86] Assert supported vector type in LowerFMINIMUM_FMAXIMUM

5e0ff13

[X86] Comment signed-zero ordering in LowerFMINIMUM_FMAXIMUM

6459c33

[X86] bf16 and AVX512-BF16 tests for min/max

edbb401

[X86] Move assertion

695d60c

valadaptive commented Jan 1, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into fminimum-maximum-bi…

8a60b22

…twise

RKSimon requested changes Jan 1, 2026

View reviewed changes

valadaptive mentioned this pull request Jan 8, 2026

[LangRef] Clarify specification for float min/max operations #172012

Merged

valadaptive added 2 commits January 8, 2026 04:43

Merge remote-tracking branch 'upstream/main' into fminimum-maximum-bi…

5208afe

…twise

Update tests

91248fd

RKSimon self-requested a review January 8, 2026 10:27

[X86] Fix formatting

4f4584a

arsenm added the floating-point Floating-point math label Jan 8, 2026

arsenm reviewed Jan 8, 2026

View reviewed changes

[X86] Code style nits

9ddc1b5

RKSimon reviewed Jan 12, 2026

View reviewed changes

RKSimon mentioned this pull request Jan 12, 2026

[InstCombine][X86] simplifyX86FPMaxMin - allow NegZero for single operand #175552

Open

arsenm approved these changes Jan 12, 2026

View reviewed changes

RKSimon approved these changes Jan 14, 2026

View reviewed changes

Merge branch 'main' into fminimum-maximum-bitwise

055f4fe

RKSimon enabled auto-merge (squash) January 14, 2026 18:35

RKSimon merged commit 7d5d4db into llvm:main Jan 14, 2026
8 of 10 checks passed

valadaptive mentioned this pull request Jan 14, 2026

Request Commit Access For valadaptive #176032

Open

		SDValue OrMask = DAG.getConstantFP(
		APFloat(Sem, APInt::getSignedMaxValue(EltBits)), DL, LogicVT);

[X86] Lower minimum/maximum/minimumnum/maximumnum using bitwise operations #170069

[X86] Lower minimum/maximum/minimumnum/maximumnum using bitwise operations #170069

Uh oh!

Conversation

valadaptive commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 1, 2025

Uh oh!

llvmbot commented Dec 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

valadaptive Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phoebewang left a comment

Choose a reason for hiding this comment

Uh oh!

ActuallyaDeviloper commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valadaptive commented Dec 9, 2025

Uh oh!

ActuallyaDeviloper commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valadaptive commented Dec 10, 2025

Uh oh!

ActuallyaDeviloper commented Dec 10, 2025

Uh oh!

valadaptive commented Dec 24, 2025

Uh oh!

phoebewang commented Dec 25, 2025

Uh oh!

valadaptive commented Dec 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🪟 Windows x64 Test Results

Uh oh!

github-actions bot commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐧 Linux x64 Test Results

Uh oh!

RKSimon left a comment

Choose a reason for hiding this comment

Uh oh!

valadaptive commented Jan 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

[X86] Lower `minimum`/`maximum`/`minimumnum`/`maximumnum` using bitwise operations #170069

[X86] Lower `minimum`/`maximum`/`minimumnum`/`maximumnum` using bitwise operations #170069

valadaptive commented Dec 1, 2025 •

edited

Loading

valadaptive Dec 1, 2025 •

edited

Loading

ActuallyaDeviloper commented Dec 9, 2025 •

edited

Loading

ActuallyaDeviloper commented Dec 9, 2025 •

edited

Loading

github-actions bot commented Jan 1, 2026 •

edited

Loading

github-actions bot commented Jan 1, 2026 •

edited

Loading

github-actions bot commented Jan 1, 2026 •

edited

Loading