Speedup ideas #6387

anematode · 2025-10-30T15:25:21Z

anematode
Oct 30, 2025

Normally I chat in the Discord but the wonderful @mstembera 's not there, so writing down some things here for his visibility/input.

`partial_insertion_sort`

partial_insertion_sort takes about 2.5% of the runtime on my computer, so it seems like a good target to speed up, but I've found it very difficult to improve:

Partitioning the moves during scoring and then doing a full insertion sort on the relevant partition didn't work, neither locally nor on fishtest.
This messy idea tries to improve the situation on GCC, which on master doesn't seem to be able to load ExtMove into a single register (note sizeof(ExtMove) == 8) and instead splits it into its constituents. I get about a 1% speedup on this branch locally (p>99.99%) and some folks on the Discord reported even larger speedups, but it seems to cause serious regressions on clang.
- This line of attack feels the most promising so far, but requires changes for (at least) non-regression on clang and readability
Rearranging ExtMove so that the key appears (on little-endian systems) in the lower 32 bits of the 64-bit register didn't work.
This test which attempts to make part of the insertion sort branchless on AVX512 works ok locally (~1% speedup), but I stopped it early on fishtest because it's hellish for such a small bump, and arch-specific.

Clang does a better job than GCC here, and I'm hoping that fixing this will help close the gap between the compilers.

Ideas I'm planning to try:

Partition the moves during score<QUIETS> in a branchless manner (hypothesis for why this might help vs. the 1st idea above: the horrible branch prediction rate on the partitioning makes it run too slow)
(Perhaps in combination with the previous idea) use template-generated sorting networks on the relevant half of the partitioned data
For testing purposes only, put up a fishtest test written with inline assembly to smooth over compiler differences and see whether the 2nd approach above might work.

Fusing consecutive layers

Combining the final add/sub into the transformed features appears promising on Fishtest. On the Discord vondele reported a 1.78% ST speedup on a Zen 2 processor (p=1.0000) and micpilar reported speedups of 2.6% (ST) and 1.2% (MT) on an Ice Lake processor, and ꪖꫀᡶꫝ (dunno their GitHub alias) reported 2.6% on Zen 3. Unfortunately, I get a 1.5% slowdown locally on Zen 5 and another Zen 5 user got a 0.4% slowdown.

(The motivation behind this approach is that combining the layers allows increasing the arithmetic intensity of the loop, and potentially hiding latency from loads from weights.)

This test indicates that it is the fusion of layers which helps and not the straight-line memory accesses into transformedFeatures.

Further investigation:

By its nature the idea breaks the separation of NN layers, which impedes trying new NN architectures.
- The idea might be DoA unless we're okay with significant refactoring.
Adding find_nnz into the same loop might work.
Zen 5 regressions need investigation and more data is needed.
Fusion only happens about 75% of the time currently (when added.size() = removed.size() = 1) – maybe try more cases, but this risks i-cache blowout

i8 weights

Yoshie from PlentyChess found great promise in an i8 quantization of threat inputs (which aren't yet in SF, but being explored in earnest on fishtest/Discord).

Naïvely quantizing the master net to 9 bits costs roughly 3 Elo on fishtest, so the net would probably have to be retrained. (I'm hoping merging threat inputs will let us reset and permanently avoid net SPSA, so that this isn't prohibitive.)
- There's some indication that precision on the main net is more important than on threat inputs, which might be a problem.
I suggested some tweaks to the i8 -> i16 widening conversion but they were a regression.
Merging the shared memory patch might lead to more efficient cache usage at fishtest conditions and counteract the memory BW benefits of i8

More to follow

mstembera · 2025-11-02T00:59:01Z

mstembera
Nov 2, 2025

Thanks for the kind words. On my MSVC profiles partial_insertion_sort() is usually just below 2%. I have tried many things in the past like unrolling, partitioning in different ways, even getting rid of sort completely and using a heap instead. All without success. In my test history https://tests.stockfishchess.org/tests/user/Fisherman the word "sort" will always appear somewhere in the description of these tests but there is no good way to search Fishtest AFAIK. Going through by hand is tedious and likely not worth it just to find out what didn't work. I'm not an assembly expert so haven't tried anything register usage related.

Your fusion tests are definitely promising. Whether we are willing to impede experimenting with new architectures is above my pay grade but if you get something you are happy enough with to open a PR, we will find out what the maintainers think.

I will hold back looking at the i8 weights until threat inputs makes it into master.

I'm really glad you joined the team! You have fresh eyes and new ideas. I usually look at your tests because they are interesting and if I have something (hopefully helpful) to say I'll be happy to.

0 replies

anematode · 2025-11-05T01:27:07Z

anematode
Nov 5, 2025
Author

Still somewhat early in the SPRT but: seems like the shared memory patch seems to negate some of the benefit of combining layers. Which actually is kind of nice :) Although it doesn't comport with the speedups reported by Discord folks.

0 replies

anematode · 2025-11-22T22:49:01Z

anematode
Nov 22, 2025
Author

One thing I noticed while profiling (especially) LTC tournaments is that st->previous pointer chases take a while. e.g. upcoming_repetition and also toward the end of do_move. So one thing I'm hoping to try is move the StateInfos to a stack in the Worker, and therefore st->previous is just a pointer subtraction. Unfortunately that'll increase the register pressure (since it's no longer addressable via the stack pointer), so it might take some finagling to get working.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup ideas #6387

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Speedup ideas #6387

Uh oh!

Uh oh!

anematode Oct 30, 2025

partial_insertion_sort

Fusing consecutive layers

i8 weights

Replies: 3 comments

Uh oh!

mstembera Nov 2, 2025

Uh oh!

Uh oh!

anematode Nov 5, 2025 Author

Uh oh!

anematode Nov 22, 2025 Author

anematode
Oct 30, 2025

`partial_insertion_sort`

mstembera
Nov 2, 2025

anematode
Nov 5, 2025
Author

anematode
Nov 22, 2025
Author