Clang vs GCC performance numbers #5955

mstembera · 2025-03-29T22:06:16Z

mstembera
Mar 29, 2025

This is in reference to #5951 (comment)

On an i9-7980XE under Windows comparing Clang 20.1.1 to GCC 14.2.0 tested using Fishbench. Results for 200 tests for each version.

AVX512 non-profile:

            Base      Test      Diff      
    Mean    609321    651516    -42195    
    StDev   38949     44258     40704     

p-value: 0.85
speedup: 0.069

AVX512 profile:

            Base      Test      Diff      
    Mean    658433    667066    -8633     
    StDev   26904     28712     42494     

p-value: 0.58
speedup: 0.013

BMI2 non-profile:

            Base      Test      Diff      
    Mean    596832    639025    -42193    
    StDev   36158     38192     46404     

p-value: 0.818
speedup: 0.071

BMI2 profile:

            Base      Test      Diff      
    Mean    635954    646450    -10496    
    StDev   29286     30562     41321     

p-value: 0.6
speedup: 0.017

So a speedup for Clang of about 7% for non profile builds but only about 1.5% for profile builds. Hopefully we can collect a few more submissions from other machines. Seems GCC benefits from profile builds much more than Clang.

TheBlackPlague · 2025-03-29T22:44:03Z

TheBlackPlague
Mar 29, 2025

Is a 1.5% performance improvement statistically noticeable (i.e. can we trust that finding to be accurate)? I don't know how Fishbench works.

1 reply

mstembera Mar 29, 2025
Author

Fishbench is good https://github.com/zardav/FishBench but my machine in particular is finicky when it comes to small speedups. I am hoping for more submissions.

vondele · 2025-03-30T06:23:09Z

vondele
Mar 30, 2025
Maintainer

Thanks, I propose we focus on profile-build, since that's how we create the release binaries and test on fishtest.

0 replies

vondele · 2025-03-30T11:20:33Z

vondele
Mar 30, 2025
Maintainer

I ran on my system the new speedtest for all compilers I have available:

Best results speedtests (nps): 
               g++-9 :   13641860
              g++-10 :   13722195
              g++-11 :   13729394
              g++-12 :   13700318
              g++-13 :   13690746
          clang++-11 :   13569023
          clang++-12 :   13586853
          clang++-13 :   13652799
          clang++-14 :   13575178
          clang++-15 :   13587397
          clang++-16 :   13534090
          clang++-17 :   13472129
          clang++-18 :   13333670
          clang++-19 :   13668344
          clang++-20 :   13421081

Which would suggest that the differences are generally small, and gcc is doing fine. The corresponding best run is:

Stockfish 17.1 by the Stockfish developers (see AUTHORS file)
info string Using 32 threads
Warmup position 3/3
Position 258/258
===========================
Version                    : Stockfish 17.1
Compiled by                : g++ (GNUC) 11.4.0 on Linux
Compilation architecture   : x86-64-avx2
Compilation settings       : 64bit AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 11.4.0
Large pages                : yes
User invocation            : speedtest 
Filled invocation          : speedtest 32 4096 150
Available processors       : 0-31
Thread count               : 32
Thread binding             : none
TT size [MiB]              : 4096
Hash max, avg [per mille]  : 
    single search          : 32, 10
    single game            : 420, 227
Total nodes searched       : 2110372721
Total search time [s]      : 153.712
Nodes/second               : 13729394

script used for testing

#!/bin/bash

echo "Compiling gcc"
for comp in g++-9 g++-10  g++-11  g++-12 g++-13
do
   make -j profile-build CXX=$comp COMP=gcc >& out.compile.$comp
   mv stockfish stockfish.$comp
done


echo "Compiling clang"
for comp in clang++-11 clang++-12 clang++-13 clang++-14  clang++-15  clang++-16  clang++-17 clang++-18 clang++-19 clang++-20
do
   make -j profile-build CXX=$comp COMP=clang >& out.compile.$comp
   mv stockfish stockfish.$comp
done

echo "Verify node counts: "
for comp in g++-9 g++-10  g++-11  g++-12 g++-13 clang++-11 clang++-12 clang++-13 clang++-14  clang++-15  clang++-16  clang++-17 clang++-18 clang++-19 clang++-20
do
   nodes=`grep "Nodes searched" out.compile.$comp | awk '{print $NF}'`
   printf "%20s : %10s\n" $comp $nodes
done

echo "Running speedtests: "
for comp in g++-9 g++-10  g++-11  g++-12 g++-13 clang++-11 clang++-12 clang++-13 clang++-14  clang++-15  clang++-16  clang++-17 clang++-18 clang++-19 clang++-20
do
for iter in `seq 1 3`
do
   ./stockfish.$comp speedtest >& out.speedtest.$comp.$iter 
done
done

echo "Best results speedtests (nps): "
for comp in g++-9 g++-10  g++-11  g++-12 g++-13 clang++-11 clang++-12 clang++-13 clang++-14  clang++-15  clang++-16  clang++-17 clang++-18 clang++-19 clang++-20
do
   bestnps=`grep "Nodes/second" out.speedtest.$comp.* | sort -n -k3 | tail -n 1 | awk '{print $NF}'`
   printf "%20s : %10s\n" $comp $bestnps
done

(with clang I need a small Makefile feature to have this working, might PR separately).

0 replies

Torom · 2025-03-30T15:44:52Z

Torom
Mar 30, 2025

Version                    : Stockfish 17.1
Compiled by                : g++ (GNUC) 14.2.1 on Linux
Compilation architecture   : x86-64-bmi2
Compilation settings       : 64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 14.2.1 20250207
Large pages                : yes
User invocation            : speedtest 
Filled invocation          : speedtest 8 1024 150
Available processors       : 0-7
Thread count               : 8
Thread binding             : none
TT size [MiB]              : 1024
Hash max, avg [per mille]  : 
    single search          : 30, 14
    single game            : 421, 284
Total nodes searched       : 591053464
Total search time [s]      : 153.625
Nodes/second               : 3847378
...
Nodes/second               : 3856559
...
Nodes/second               : 3838619

Version                    : Stockfish 17.1
Compiled by                : clang++ 19.1.7 on Linux
Compilation architecture   : x86-64-bmi2
Compilation settings       : 64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : Clang 19.1.7
Large pages                : yes
User invocation            : speedtest 
Filled invocation          : speedtest 8 1024 150
Available processors       : 0-7
Thread count               : 8
Thread binding             : none
TT size [MiB]              : 1024
Hash max, avg [per mille]  : 
    single search          : 27, 14
    single game            : 435, 284
Total nodes searched       : 588332116
Total search time [s]      : 153.602
Nodes/second               : 3830237
...
Nodes/second               : 3805525
...
Nodes/second               : 3807555

0 replies

whelanh · 2025-03-30T16:15:19Z

whelanh
Mar 30, 2025

Clang 20.1.1 vs. g++ 15.0.1 only 0.8% speedup in the run I did (arbitrary choice of # of threads). I used profile-build for both:

Stockfish 17.1 by the Stockfish developers (see AUTHORS file)
info string Using 32 threads
Warmup position 3/3
Position 258/258
===========================
Version                    : Stockfish 17.1
Compiled by                : clang++ 20.1.1 on Linux
Compilation architecture   : x86-64-vnni256
Compilation settings       : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : Clang 20.1.1 (Fedora 20.1.1-1.fc43)
Large pages                : yes
User invocation            : speedtest 32 1024 140
Filled invocation          : speedtest 32 1024 140
Available processors       : 0-63
Thread count               : 32
Thread binding             : none
TT size [MiB]              : 1024
Hash max, avg [per mille]  : 
    single search          : 251, 103
    single game            : 1000, 943
Total nodes searched       : 7086405124
Total search time [s]      : 143.26
Nodes/second               : 49465343
=======================================================================================
Stockfish 17.1 by the Stockfish developers (see AUTHORS file)
info string Using 32 threads
Warmup position 3/3
Position 258/258
===========================
Version                    : Stockfish 17.1
Compiled by                : g++ (GNUC) 15.0.1 on Linux
Compilation architecture   : x86-64-vnni256
Compilation settings       : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.0.1 20250313 (Red Hat 15.0.1-0)
Large pages                : yes
User invocation            : speedtest 32 1024 140
Filled invocation          : speedtest 32 1024 140
Available processors       : 0-63
Thread count               : 32
Thread binding             : none
TT size [MiB]              : 1024
Hash max, avg [per mille]  : 
    single search          : 256, 102
    single game            : 1000, 941
Total nodes searched       : 7027642870
Total search time [s]      : 143.26
Nodes/second               : 49055164

Clang speedup 0.836%

0 replies

TheBlackPlague · 2025-03-30T23:06:38Z

TheBlackPlague
Mar 30, 2025

By the way, Clang 20, unfortunately, has a slight regression when compiled with VNNI for certain CPU architectures where it ends up using a SIMD instruction with dependencies. Normally, this is fine, but when used in a looping context (say for NNUE), it leads to it being slower than the non-dependent yet lesser throughput instructions being used.

0 replies

Torom · 2025-04-02T16:56:20Z

Torom
Apr 2, 2025

Version                    : Stockfish 17.1
Compiled by                : g++ (GNUC) 14.2.0 on Linux
Compilation architecture   : armv8-dotprod
Compilation settings       : 64bit POPCNT NEON_DOTPROD
Compiler __VERSION__ macro : 14.2.0
Large pages                : yes
User invocation            : speedtest 
Filled invocation          : speedtest 4 512 150
Available processors       : 0-3
Thread count               : 4
Thread binding             : none
TT size [MiB]              : 512
Hash max, avg [per mille]  : 
    single search          : 9, 2
    single game            : 117, 65
Total nodes searched       : 59947694
Total search time [s]      : 154.492
Nodes/second               : 388031
...
Nodes/second               : 389310
...
Nodes/second               : 388441

Version                    : Stockfish 17.1
Compiled by                : clang++ 19.1.1 on Linux
Compilation architecture   : armv8-dotprod
Compilation settings       : 64bit POPCNT NEON_DOTPROD
Compiler __VERSION__ macro : Ubuntu Clang 19.1.1 (1ubuntu1)
Large pages                : yes
User invocation            : speedtest 
Filled invocation          : speedtest 4 512 150
Available processors       : 0-3
Thread count               : 4
Thread binding             : none
TT size [MiB]              : 512
Hash max, avg [per mille]  : 
    single search          : 10, 2
    single game            : 109, 66
Total nodes searched       : 59708920
Total search time [s]      : 154.507
Nodes/second               : 386447
...
Nodes/second               : 386111
...
Nodes/second               : 382989

0 replies

mstembera · 2025-10-20T04:45:01Z

mstembera
Oct 20, 2025
Author

Latest update for GCC 15.2.0 vs Clang 21.1.1 AVX512 profile.
Results for 2000 tests for each version:

            Base      Test      Diff      
    Mean    669066    693485    -24419    
    StDev   32113     34946     47927     

p-value: 0.695
speedup: 0.036

So about a 3.6% speedup for Clang.

0 replies

Clang vs GCC performance numbers #5955

Uh oh!

Uh oh!

mstembera Mar 29, 2025

Replies: 8 comments · 1 reply

Uh oh!

TheBlackPlague Mar 29, 2025

Uh oh!

mstembera Mar 29, 2025 Author

Uh oh!

vondele Mar 30, 2025 Maintainer

Uh oh!

vondele Mar 30, 2025 Maintainer

Uh oh!

Torom Mar 30, 2025

Uh oh!

whelanh Mar 30, 2025

Uh oh!

TheBlackPlague Mar 30, 2025

Uh oh!

Torom Apr 2, 2025

Uh oh!

mstembera Oct 20, 2025 Author

mstembera
Mar 29, 2025

Replies: 8 comments 1 reply

TheBlackPlague
Mar 29, 2025

mstembera Mar 29, 2025
Author

vondele
Mar 30, 2025
Maintainer

vondele
Mar 30, 2025
Maintainer

Torom
Mar 30, 2025

whelanh
Mar 30, 2025

TheBlackPlague
Mar 30, 2025

Torom
Apr 2, 2025

mstembera
Oct 20, 2025
Author