Optimize Enumerable Min/Max final reduction with shuffles by EgorBo · Pull Request #127995 · dotnet/runtime

EgorBo · 2026-05-09T16:46:36Z

Note

This PR was filed by AI (GitHub Copilot CLI).

Inspired by a similar helper in Tensors. I guess we don't want to consume Tensors in Linq due to IL size concerns. It also doesn't look beneficial to share this function via source file due to different generics being used. Maybe Vector128 will get this as public API at some point.

Benchmark

[Params(16, 64)]
public int Length;

private byte[]   _bytes;
private sbyte[]  _sbytes;
private short[]  _shorts;
private ushort[] _ushorts;
private int[]    _ints;
private uint[]   _uints;
private long[]   _longs;
private ulong[]  _ulongs;

[Benchmark] public byte   MaxByte()   => _bytes.Max();
[Benchmark] public sbyte  MaxSByte()  => _sbytes.Max();
[Benchmark] public short  MaxShort()  => _shorts.Max();
[Benchmark] public ushort MaxUShort() => _ushorts.Max();
[Benchmark] public int    MaxInt()    => _ints.Max();
[Benchmark] public uint   MaxUInt()   => _uints.Max();
[Benchmark] public long   MaxLong()   => _longs.Max();
[Benchmark] public ulong  MaxULong()  => _ulongs.Max();

Full benchmark request and raw output: EgorBot/Benchmarks#198

Results

Speed-up = main time / PR time (higher is better). Numbers below are from EgorBot.

ARM64 — Neoverse-N2 (`ubuntu24_azure_cobalt100`)

Method	Length	main	PR	Speed-up
MaxByte	16	51.54 ns	1.85 ns	27.85×
MaxSByte	16	49.58 ns	1.90 ns	26.16×
MaxShort	16	23.66 ns	0.98 ns	24.16×
MaxUShort	16	21.82 ns	1.22 ns	17.93×
MaxInt	16	1.72 ns	1.84 ns	0.93×
MaxUInt	16	1.93 ns	1.74 ns	1.11×
MaxLong	16	3.77 ns	3.91 ns	0.96×
MaxULong	16	3.81 ns	3.93 ns	0.97×
MaxByte	64	49.64 ns	2.03 ns	24.51×
MaxSByte	64	48.76 ns	2.04 ns	23.95×
MaxShort	64	22.35 ns	3.53 ns	6.33×
MaxUShort	64	22.33 ns	3.36 ns	6.64×
MaxInt	64	6.33 ns	6.21 ns	1.02×
MaxUInt	64	6.17 ns	6.51 ns	0.95×
MaxLong	64	23.55 ns	23.51 ns	1.00×
MaxULong	64	23.36 ns	23.44 ns	1.00×

Intel — Emerald Rapids / AVX-512 (`ubuntu24_azure_emeraldrapids`)

Method	Length	main	PR	Speed-up
MaxByte	16	8.88 ns	0.93 ns	9.54×
MaxSByte	16	9.86 ns	0.89 ns	11.10×
MaxShort	16	4.42 ns	1.55 ns	2.87×
MaxUShort	16	4.45 ns	1.02 ns	4.38×
MaxInt	16	1.66 ns	0.88 ns	1.89×
MaxUInt	16	1.81 ns	1.50 ns	1.20×
MaxLong	16	0.96 ns	0.89 ns	1.08×
MaxULong	16	1.00 ns	0.94 ns	1.06×
MaxByte	64	16.31 ns	1.44 ns	11.39×
MaxSByte	64	10.03 ns	1.59 ns	6.33×
MaxShort	64	6.52 ns	1.37 ns	4.77×
MaxUShort	64	6.54 ns	1.54 ns	4.26×
MaxInt	64	2.44 ns	1.68 ns	1.47×
MaxUInt	64	2.84 ns	1.94 ns	1.47×
MaxLong	64	2.72 ns	3.17 ns	0.86×
MaxULong	64	3.17 ns	2.97 ns	1.07×

AMD — EPYC 9V45 (Turin) / AVX-512 (`ubuntu24_azure_turin`)

Method	Length	main	PR	Speed-up
MaxByte	16	4.41 ns	0.70 ns	6.27×
MaxSByte	16	4.96 ns	0.83 ns	5.96×
MaxShort	16	3.01 ns	0.87 ns	3.45×
MaxUShort	16	2.09 ns	0.89 ns	2.35×
MaxInt	16	1.48 ns	0.98 ns	1.52×
MaxUInt	16	1.64 ns	0.98 ns	1.68×
MaxLong	16	0.94 ns	0.91 ns	1.03×
MaxULong	16	1.05 ns	0.91 ns	1.16×
MaxByte	64	5.86 ns	1.27 ns	4.61×
MaxSByte	64	7.57 ns	1.28 ns	5.93×
MaxShort	64	2.78 ns	1.39 ns	2.01×
MaxUShort	64	2.96 ns	1.13 ns	2.61×
MaxInt	64	2.27 ns	1.61 ns	1.41×
MaxUInt	64	1.97 ns	1.62 ns	1.22×
MaxLong	64	3.02 ns	3.00 ns	1.01×
MaxULong	64	2.87 ns	2.10 ns	1.37×

Replaces the scalar log-N reduction loop on the final 128-bit accumulator with a shuffle-based tree reduction (log2(Vector128<T>.Count) compares), introducing a HorizontalMinMax helper modeled after TensorPrimitives.HorizontalAggregate. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dotnet-policy-service · 2026-05-09T16:47:46Z

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

EgorBo · 2026-05-09T16:48:07Z

@MihuBot

EgorBo · 2026-05-09T16:48:25Z

Note

Benchmark generated by AI (GitHub Copilot CLI).

@EgorBot -linux_amd -linux_intel -linux_arm64

using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    [Params(16, 64)]
    public int Length;

    private byte[]   _bytes   = default!;
    private sbyte[]  _sbytes  = default!;
    private short[]  _shorts  = default!;
    private ushort[] _ushorts = default!;
    private int[]    _ints    = default!;
    private uint[]   _uints   = default!;
    private long[]   _longs   = default!;
    private ulong[]  _ulongs  = default!;

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);

        _bytes = new byte[Length];
        rng.NextBytes(_bytes);

        _sbytes = new sbyte[Length];
        for (int i = 0; i < Length; i++) _sbytes[i] = (sbyte)rng.Next(sbyte.MinValue, sbyte.MaxValue + 1);

        _shorts = new short[Length];
        for (int i = 0; i < Length; i++) _shorts[i] = (short)rng.Next(short.MinValue, short.MaxValue + 1);

        _ushorts = new ushort[Length];
        for (int i = 0; i < Length; i++) _ushorts[i] = (ushort)rng.Next(0, ushort.MaxValue + 1);

        _ints = new int[Length];
        for (int i = 0; i < Length; i++) _ints[i] = rng.Next();

        _uints = new uint[Length];
        for (int i = 0; i < Length; i++) _uints[i] = (uint)rng.Next();

        _longs = new long[Length];
        for (int i = 0; i < Length; i++) _longs[i] = rng.NextInt64();

        _ulongs = new ulong[Length];
        for (int i = 0; i < Length; i++) _ulongs[i] = (ulong)rng.NextInt64();
    }

    [Benchmark] public byte   MaxByte()   => _bytes.Max();
    [Benchmark] public sbyte  MaxSByte()  => _sbytes.Max();
    [Benchmark] public short  MaxShort()  => _shorts.Max();
    [Benchmark] public ushort MaxUShort() => _ushorts.Max();
    [Benchmark] public int    MaxInt()    => _ints.Max();
    [Benchmark] public uint   MaxUInt()   => _uints.Max();
    [Benchmark] public long   MaxLong()   => _longs.Max();
    [Benchmark] public ulong  MaxULong()  => _ulongs.Max();
}

Copilot

Pull request overview

This PR changes the final reduction step in System.Linq’s vectorized integer Min/Max implementation to reduce a Vector128<T> accumulator down to a single scalar using shuffle-based horizontal reduction instead of a scalar per-lane loop.

Changes:

Replaced the per-element for loop reduction of best128 with a helper (HorizontalMinMax) that performs log2(lane-count) reductions using Vector128.Shuffle.
Added HorizontalMinMax<T, TMinMax> to centralize the reduction logic for different element sizes (byte/short/int/long lane counts).

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings May 9, 2026 16:46

Copilot started reviewing on behalf of EgorBo May 9, 2026 16:46 View session

github-actions Bot added the area-System.Linq label May 9, 2026

dotnet-policy-service Bot assigned EgorBo May 9, 2026

EgorBot mentioned this pull request May 9, 2026

Benchmarks for dotnet/runtime#127995 (for @EgorBo) EgorBot/Benchmarks#198

Open

MihuBot mentioned this pull request May 9, 2026

[JitDiff X64] [EgorBo] Optimize Enumerable Min/Max final reduction with shuffles MihuBot/runtime-utils#1892

Open

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread src/libraries/System.Linq/src/System/Linq/MaxMin.cs

Comment thread src/libraries/System.Linq/src/System/Linq/MaxMin.cs Outdated

cleanup

ec3708d

EgorBo mentioned this pull request May 9, 2026

stackalloc expression without an initializer inside SkipLocalsInit may only be used in an unsafe context #127996

Closed

Merge branch 'main' into ai/linq-minmax-shuffle-reduction

c573615

Copilot AI review requested due to automatic review settings May 9, 2026 20:45

Copilot started reviewing on behalf of EgorBo May 9, 2026 20:45 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

MihaZupan approved these changes May 9, 2026

View reviewed changes

Copilot AI mentioned this pull request May 9, 2026

Fix CS9361 stackalloc unsafe context in X25519DiffieHellmanCng #127999

Merged

Merge branch 'main' into ai/linq-minmax-shuffle-reduction

7f57c02

EgorBo enabled auto-merge (squash) May 10, 2026 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Enumerable Min/Max final reduction with shuffles#127995

Optimize Enumerable Min/Max final reduction with shuffles#127995
EgorBo wants to merge 4 commits intodotnet:mainfrom
EgorBo:ai/linq-minmax-shuffle-reduction

EgorBo commented May 9, 2026 •

edited

Loading

Uh oh!

dotnet-policy-service Bot commented May 9, 2026

Uh oh!

EgorBo commented May 9, 2026

Uh oh!

EgorBo commented May 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

EgorBo commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Results

ARM64 — Neoverse-N2 (ubuntu24_azure_cobalt100)

Intel — Emerald Rapids / AVX-512 (ubuntu24_azure_emeraldrapids)

AMD — EPYC 9V45 (Turin) / AVX-512 (ubuntu24_azure_turin)

Uh oh!

dotnet-policy-service Bot commented May 9, 2026

Uh oh!

EgorBo commented May 9, 2026

Uh oh!

EgorBo commented May 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EgorBo commented May 9, 2026 •

edited

Loading

ARM64 — Neoverse-N2 (`ubuntu24_azure_cobalt100`)

Intel — Emerald Rapids / AVX-512 (`ubuntu24_azure_emeraldrapids`)

AMD — EPYC 9V45 (Turin) / AVX-512 (`ubuntu24_azure_turin`)