-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Recently @kunalspathak asked me if I could produce a report similar to #66848 for x64 vs arm64 comparison.
I took .NET 7 Preview2 results provided by @AndyAyersMS, @kunalspathak and myself for #66848, hacked the tool a little bit (it was not designed to compare different architecture results) and compared x64 vs arm64 using the following configs:
- my old 4 year old macBook Pro x64: macOS Monterey 12.2.1, Intel Core i7-5557U CPU 3.10GHz (Broadwell), 1 CPU, 4 logical and 2 physical cores vs @AndyAyersMS M1 Max arm64: macOS Monterey 12.2.1, Apple M1 Max 2.40GHz, 1 CPU, 10 logical and 10 physical cores
- @kunalspathak Windows 10 (10.0.20348.587) Intel Xeon Platinum 8272CL CPU 2.60GHz, 2 CPU, 104 logical and 52 physical cores vs @kunalspathak Windows 11 (10.0.25058.1000) ARM64 machine with lots of cores
Of course it was not an apples-to-apples comparision, just the best thing we could do right now.
Full public results (without absolute values, as I don't have the permission to share them) can be found here.
Internal MS results (with absolute values) can be found here. If you don't have the access please ping me on Teams.
As usual, I've focused on the benchmarks that take longer to execute on arm64 compared to x64. If you are interested in benchmarks that take less to execute, you need to read the report linked above in the reverse order.
Benchmarks:
-
System.Numerics.Tests.Perf_BitOperations.PopCount_ulongis 5-8 time slower (most likely due to lack of vectorization).PopCount_uintis slower only on Windows.
@tannergooding @GrabYourPitchforks
- lot of
Base64Encodebenchmarks likeSystem.Buffers.Text.Tests.Base64Tests.Base64Encode(NumberOfBytes: 1000)are 6 up to 16 times slower Optimize System.Buffers for arm64 using cross-platform intrinsics #35033
- Some
RentReturnArrayPoolTestsbenchmarks are up to few times slower, but these are multi-threaded and very often multimodal benchmarks. Faster thread local statics #63619 -
System.Threading.Tests.Perf_Timer.AsynchronousContentionis 2-3 times slower.
- A lot of
SocketSendReceivePerfTestbenchmarks likeSystem.Net.WebSockets.Tests.SocketSendReceivePerfTest.ReceiveSendare 2 times slower.
@dotnet/area-system-drawing
-
System.Drawing.Tests.Perf_Image_Load.Image_FromStream_NoValidationare few times slower on Windows. Only theNoValidationbenchmarks seem to run slower.
- Few
RegularExpressionsbenchmarks likeSystem.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "(?i)Sher[a-z]+|Hol[a-z]+", Options: Compiled)are 40-50% slower. This pattern usesIndexOfAny("HOho")to find the next possible match location. It has a 256-bit vectorization path on x64 but only 128-bit on ARM64.
-
PerfLabTests.LowLevelPerf.GenericClassGenericStaticFieldbenchmark can be from 16% to x3 times slower. Same goes forPerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod.
@dotnet/jit-contrib
-
System.Security.Cryptography.Tests.Perf_Hashing.Sha1is 17-55% slower. (Potentially differences in the GDI+ code) -
System.IO.Tests.Perf_StreamWriter.WriteString(writeLength: 100)is 21-46% slower. -
System.Text.Json.Serialization.Tests.WriteJson<BinaryData>.SerializeToStreambenchmark can be from 16% to x4 times slower. Optimize System.Buffers for arm64 using cross-platform intrinsics #35033 -
SIMD.ConsoleMandelbenchmarks are 40% slower . Double Vector128 for SpanHelpers.IndexOf(byte,byte,int) on ARM64 #66993 -
Burgers.Test3is 12-59% slower Double Vector128 for SpanHelpers.IndexOf(byte,byte,int) on ARM64 #66993 - A lot of
System.Collections.Containsbenchmarks are 2-3 times slower (most likely due to lack of vectorization). Same goes forSystem.Memory.Span<Char>.IndexOfValue,System.Memory.Span<Char>.Fill,System.Memory.Span<Int32>.StartsWith,System.Memory.Span<Byte>.IndexOfAnyTwoValuesandSystem.Memory.ReadOnlySpan.IndexOfString(Ordinal). Double Vector128 for SpanHelpers.IndexOf(byte,byte,int) on ARM64 #66993 - A lot of
SequenceCompareTobenchmarks are 30% up to 4 times slower Double Vector128 for SpanHelpers.IndexOf(byte,byte,int) on ARM64 #66993
-
System.MathBenchmarks.Double.ExpandSystem.MathBenchmarks.Single.Expare 35% slower. Optimize jump stubs on arm64 #62302
@dotnet/area-system-globalization
-
System.Globalization.Tests.Perf_DateTimeCultureInfo.Parse(culturestring: ja)benchmark can be from 20% to x7 times slower (it's most likely an ICU problem). Initializing the "ja" culture takes 200ms when using ICU #31273 -
Various
Perf_Interlockedbenchmarks are slower, but this is expected due to memory model differences. -
Various
Perf_Process.Startbenchmarks are slower, but only on macOS so it's most likely a macOS issue.