Replies: 3 comments
-
|
Thanks for the kind words. On my MSVC profiles partial_insertion_sort() is usually just below 2%. I have tried many things in the past like unrolling, partitioning in different ways, even getting rid of sort completely and using a heap instead. All without success. In my test history https://tests.stockfishchess.org/tests/user/Fisherman the word "sort" will always appear somewhere in the description of these tests but there is no good way to search Fishtest AFAIK. Going through by hand is tedious and likely not worth it just to find out what didn't work. I'm not an assembly expert so haven't tried anything register usage related. Your fusion tests are definitely promising. Whether we are willing to impede experimenting with new architectures is above my pay grade but if you get something you are happy enough with to open a PR, we will find out what the maintainers think. I will hold back looking at the i8 weights until threat inputs makes it into master. I'm really glad you joined the team! You have fresh eyes and new ideas. I usually look at your tests because they are interesting and if I have something (hopefully helpful) to say I'll be happy to. |
Beta Was this translation helpful? Give feedback.
-
|
Still somewhat early in the SPRT but: seems like the shared memory patch seems to negate some of the benefit of combining layers. Which actually is kind of nice :) Although it doesn't comport with the speedups reported by Discord folks. |
Beta Was this translation helpful? Give feedback.
-
|
One thing I noticed while profiling (especially) LTC tournaments is that |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Normally I chat in the Discord but the wonderful @mstembera 's not there, so writing down some things here for his visibility/input.
partial_insertion_sortpartial_insertion_sorttakes about 2.5% of the runtime on my computer, so it seems like a good target to speed up, but I've found it very difficult to improve:masterdoesn't seem to be able to loadExtMoveinto a single register (notesizeof(ExtMove) == 8) and instead splits it into its constituents. I get about a 1% speedup on this branch locally (p>99.99%) and some folks on the Discord reported even larger speedups, but it seems to cause serious regressions on clang.Clang does a better job than GCC here, and I'm hoping that fixing this will help close the gap between the compilers.
Ideas I'm planning to try:
score<QUIETS>in a branchless manner (hypothesis for why this might help vs. the 1st idea above: the horrible branch prediction rate on the partitioning makes it run too slow)Fusing consecutive layers
Combining the final add/sub into the transformed features appears promising on Fishtest. On the Discord vondele reported a 1.78% ST speedup on a Zen 2 processor (p=1.0000) and micpilar reported speedups of 2.6% (ST) and 1.2% (MT) on an Ice Lake processor, and ꪖꫀᡶꫝ (dunno their GitHub alias) reported 2.6% on Zen 3. Unfortunately, I get a 1.5% slowdown locally on Zen 5 and another Zen 5 user got a 0.4% slowdown.
(The motivation behind this approach is that combining the layers allows increasing the arithmetic intensity of the loop, and potentially hiding latency from loads from
weights.)This test indicates that it is the fusion of layers which helps and not the straight-line memory accesses into
transformedFeatures.Further investigation:
find_nnzinto the same loop might work.added.size() = removed.size() = 1) – maybe try more cases, but this risks i-cache blowouti8 weights
Yoshie from PlentyChess found great promise in an
i8quantization of threat inputs (which aren't yet in SF, but being explored in earnest on fishtest/Discord).More to follow
Beta Was this translation helpful? Give feedback.
All reactions