Skip to content

x64 vs ARM64 Microbenchmarks Performance Study Report #67339

@adamsitnik

Description

@adamsitnik

Recently @kunalspathak asked me if I could produce a report similar to #66848 for x64 vs arm64 comparison.

I took .NET 7 Preview2 results provided by @AndyAyersMS, @kunalspathak and myself for #66848, hacked the tool a little bit (it was not designed to compare different architecture results) and compared x64 vs arm64 using the following configs:

  • my old 4 year old macBook Pro x64: macOS Monterey 12.2.1, Intel Core i7-5557U CPU 3.10GHz (Broadwell), 1 CPU, 4 logical and 2 physical cores vs @AndyAyersMS M1 Max arm64: macOS Monterey 12.2.1, Apple M1 Max 2.40GHz, 1 CPU, 10 logical and 10 physical cores
  • @kunalspathak Windows 10 (10.0.20348.587) Intel Xeon Platinum 8272CL CPU 2.60GHz, 2 CPU, 104 logical and 52 physical cores vs @kunalspathak Windows 11 (10.0.25058.1000) ARM64 machine with lots of cores

Of course it was not an apples-to-apples comparision, just the best thing we could do right now.

Full public results (without absolute values, as I don't have the permission to share them) can be found here.
Internal MS results (with absolute values) can be found here. If you don't have the access please ping me on Teams.

As usual, I've focused on the benchmarks that take longer to execute on arm64 compared to x64. If you are interested in benchmarks that take less to execute, you need to read the report linked above in the reverse order.

Benchmarks:

@kunalspathak

  • System.Numerics.Tests.Perf_BitOperations.PopCount_ulong is 5-8 time slower (most likely due to lack of vectorization). PopCount_uint is slower only on Windows.

@tannergooding @GrabYourPitchforks

@stephentoub @kouvel

  • Some RentReturnArrayPoolTests benchmarks are up to few times slower, but these are multi-threaded and very often multimodal benchmarks. Faster thread local statics #63619
  • System.Threading.Tests.Perf_Timer.AsynchronousContention is 2-3 times slower.

@wfurt @MihaZupan

  • A lot of SocketSendReceivePerfTest benchmarks likeSystem.Net.WebSockets.Tests.SocketSendReceivePerfTest.ReceiveSend are 2 times slower.

@dotnet/area-system-drawing

  • System.Drawing.Tests.Perf_Image_Load.Image_FromStream_NoValidation are few times slower on Windows. Only the NoValidation benchmarks seem to run slower.

@stephentoub

  • Few RegularExpressions benchmarks like System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "(?i)Sher[a-z]+|Hol[a-z]+", Options: Compiled) are 40-50% slower. This pattern uses IndexOfAny("HOho") to find the next possible match location. It has a 256-bit vectorization path on x64 but only 128-bit on ARM64.

@jkotas @AndyAyersMS

  • PerfLabTests.LowLevelPerf.GenericClassGenericStaticField benchmark can be from 16% to x3 times slower. Same goes for PerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod.

@dotnet/jit-contrib

@tannergooding

@dotnet/area-system-globalization

  • System.Globalization.Tests.Perf_DateTimeCultureInfo.Parse(culturestring: ja) benchmark can be from 20% to x7 times slower (it's most likely an ICU problem). Initializing the "ja" culture takes 200ms when using ICU #31273

  • Various Perf_Interlocked benchmarks are slower, but this is expected due to memory model differences.

  • Various Perf_Process.Start benchmarks are slower, but only on macOS so it's most likely a macOS issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-Metatenet-performancePerformance related issuetrackingThis issue is tracking the completion of other related issues.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions