x64 vs ARM64 Microbenchmarks Performance Study Report

Recently @kunalspathak asked me if I could produce a report similar to https://github.com/dotnet/runtime/issues/66848 for x64 vs arm64 comparison.

I took .NET 7 Preview2 results provided by @AndyAyersMS, @kunalspathak and myself for https://github.com/dotnet/runtime/issues/66848, hacked the tool a little bit (it was not designed to compare different architecture results) and compared x64 vs arm64 using the following configs:

* my old 4 year old macBook Pro x64: macOS Monterey 12.2.1, Intel Core i7-5557U CPU 3.10GHz (Broadwell), 1 CPU, 4 logical and 2 physical cores vs @AndyAyersMS M1 Max arm64: macOS Monterey 12.2.1, Apple M1 Max 2.40GHz, 1 CPU, 10 logical and 10 physical cores
* @kunalspathak  Windows 10 (10.0.20348.587) Intel Xeon Platinum 8272CL CPU 2.60GHz, 2 CPU, 104 logical and 52 physical cores vs @kunalspathak Windows 11 (10.0.25058.1000) ARM64 machine with lots of cores

Of course it was not an apples-to-apples comparision, just the best thing we could do right now.

Full public results (without absolute values, as I don't have the permission to share them) can be found [here](https://gist.github.com/adamsitnik/3df04e23d5a88806204153593bc5f420).
Internal MS results (with absolute values) can be found [here](https://microsofteur-my.sharepoint.com/:t:/g/personal/adsitnik_microsoft_com/ESIzrKQkyZdHhnrdw_utqzsBVRhvNQpxXFRTI57V2D7TxA?e=mjbwcC). If you don't have the access please ping me on Teams.

As usual, I've focused on the benchmarks that take longer to execute on arm64 compared to x64. If you are interested in benchmarks that take less to execute, you need to read the report linked above in the reverse order.

Benchmarks:


@kunalspathak
* [x] `System.Numerics.Tests.Perf_BitOperations.PopCount_ulong` is 5-8 time slower (most likely due to lack of vectorization). `PopCount_uint` is slower only on Windows.

@tannergooding @GrabYourPitchforks
* [x] lot of `Base64Encode` benchmarks like `System.Buffers.Text.Tests.Base64Tests.Base64Encode(NumberOfBytes: 1000)` are 6 up to 16 times slower  #35033

@stephentoub @kouvel
* [ ] Some `RentReturnArrayPoolTests` benchmarks are up to few times slower, but these are multi-threaded and very often multimodal benchmarks. #63619
* [ ] `System.Threading.Tests.Perf_Timer.AsynchronousContention` is 2-3 times slower.

@wfurt @MihaZupan
* [ ] A lot of `SocketSendReceivePerfTest` benchmarks like`System.Net.WebSockets.Tests.SocketSendReceivePerfTest.ReceiveSend` are 2 times slower.

@dotnet/area-system-drawing
* [ ] `System.Drawing.Tests.Perf_Image_Load.Image_FromStream_NoValidation` are few times slower on Windows. Only the `NoValidation` benchmarks seem to run slower.

@stephentoub
* [x] Few `RegularExpressions` benchmarks like `System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "(?i)Sher[a-z]+|Hol[a-z]+", Options: Compiled)` are 40-50% slower. This pattern uses `IndexOfAny("HOho")` to find the next possible match location.  It has a 256-bit vectorization path on x64 but only 128-bit on ARM64.

@jkotas @AndyAyersMS
* [ ] `PerfLabTests.LowLevelPerf.GenericClassGenericStaticField` benchmark can be from 16% to x3 times slower. Same goes for `PerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod`.

@dotnet/jit-contrib
* [ ] `System.Security.Cryptography.Tests.Perf_Hashing.Sha1` is 17-55% slower. (Potentially differences in the GDI+ code)
* [ ] `System.IO.Tests.Perf_StreamWriter.WriteString(writeLength: 100)` is 21-46% slower. 
* [ ] `System.Text.Json.Serialization.Tests.WriteJson<BinaryData>.SerializeToStream` benchmark can be from 16% to x4 times slower. #35033
* [ ] `SIMD.ConsoleMandel` benchmarks are 40% slower . #66993
* [ ] `Burgers.Test3` is 12-59% slower #66993
* [ ] A lot of `System.Collections.Contains` benchmarks are 2-3 times slower (most likely due to lack of vectorization). Same goes for `System.Memory.Span<Char>.IndexOfValue`, `System.Memory.Span<Char>.Fill`, `System.Memory.Span<Int32>.StartsWith`, `System.Memory.Span<Byte>.IndexOfAnyTwoValues` and `System.Memory.ReadOnlySpan.IndexOfString(Ordinal)`. #66993
* [ ] A lot of `SequenceCompareTo` benchmarks are 30% up to 4 times slower #66993

@tannergooding
* [ ] `System.MathBenchmarks.Double.Exp` and `System.MathBenchmarks.Single.Exp` are 35% slower. #62302

@dotnet/area-system-globalization
* [ ] `System.Globalization.Tests.Perf_DateTimeCultureInfo.Parse(culturestring: ja)` benchmark can be from 20% to x7 times slower (it's most likely an ICU problem). #31273

* [x] Various `Perf_Interlocked` benchmarks are slower, but this is expected due to memory model differences.
* [ ] Various `Perf_Process.Start` benchmarks are slower, but only on macOS so it's most likely a macOS issue.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

x64 vs ARM64 Microbenchmarks Performance Study Report #67339

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

x64 vs ARM64 Microbenchmarks Performance Study Report #67339

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions