Implement NarrowUtf16ToAscii for AArch64 #70080

SwapnilGaikwad · 2022-06-01T13:41:02Z

Fixes #41292 partially.

ghost · 2022-06-01T13:41:10Z

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #41292 partially.

Author:	SwapnilGaikwad
Assignees:	-
Labels:	`area-System.Text.Encoding`
Milestone:	-

dnfadmin · 2022-06-01T13:41:16Z

All CLA requirements met.

SwapnilGaikwad · 2022-06-01T13:44:05Z

Hi @kunalspathak, you might want to take a look at this PR.

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

kunalspathak · 2022-06-01T13:49:47Z

@dotnet/jit-contrib

kunalspathak · 2022-06-01T13:52:21Z

Thanks @SwapnilGaikwad for your contribution. Could you also share some performance numbers? You can start with https://github.com/dotnet/performance/blob/d7dac8a7ca12a28d099192f8a901cf8e30361384/src/benchmarks/micro/libraries/System.Text.Encoding/Perf.Encoding.cs.

SwapnilGaikwad · 2022-06-01T18:09:27Z

Thanks @SwapnilGaikwad for your contribution. Could you also share some performance numbers? You can start with https://github.com/dotnet/performance/blob/d7dac8a7ca12a28d099192f8a901cf8e30361384/src/benchmarks/micro/libraries/System.Text.Encoding/Perf.Encoding.cs.

Hi Kunal, did a quick test on A72 for the GetBytes method from the System.Text.Tests.Perf_Encoding class. The patch executes about 6% (for ascii) and 3% (for utf-8) strings of 2048 size relative to the main.
However, had to comment out the debug asserts to avoid emitting them in assembly. We are inspecting the assembly further to spot suboptimal sequence of instructions ('ll work on this in the next week due to public holidays).

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

kunalspathak · 2022-06-13T16:53:16Z

@SwapnilGaikwad - Did you get chance to make any progress?

SwapnilGaikwad · 2022-06-13T17:49:40Z

Hi @kunalspathak, we made progress. Currently benchmarking the version that combines SSE2 and ASIMD implementations along with the above comments. I hope to get it ready by tomorrow.

SwapnilGaikwad · 2022-06-15T16:36:13Z

Hi @kunalspathak, now the patch makes use of the vector API more. For ASCII strings of 512 size, it executes in about 0.92x on AArch64 and 0.86x on x86 compared to the execution time for the HEAD.
The generic vector implementation (in the Vector.IsHardwareAccelerated block) performs chunked reads so the improvement is not as significant as one would expect while moving from scalar to SIMD version.
Do you have any recommendations to extract the assembly for the intrinsic? Couldn't get it using --disassm option from the docs.

kunalspathak · 2022-06-15T16:51:49Z

For ASCII strings of 512 size, it executes in about 0.92x on AArch64 and 0.86x on x86 compared to the execution time for the HEAD.

That sounds great. Do you mind posting the actual numbers like done in #70654 (comment)?

Couldn't get it using --disassm option from the docs.

I think it has a typo. Can you try with --disasm? You can also set COMPlus_JitDisasm=<methodName> to see the disassembly. Note that it will only work on debug/checked clrjit.

SwapnilGaikwad · 2022-06-15T17:43:35Z

Hi @kunalspathak, here are the numbers.

On AArch64 (Altra)

|   Method |        Job |                                                                                              Toolchain | size | encName |      Mean |    Error |   StdDev |    Median |       Min |       Max | Ratio | MannWhitney(2%) |  Gen 0 | Allocated | Alloc Ratio |
|--------- |----------- |------------------------------------------------------------------------------------------------------- |----- |-------- |----------:|---------:|---------:|----------:|----------:|----------:|------:|---------------- |-------:|----------:|------------:|
| GetBytes | Job-RILZMR |     /runtime_HEAD/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun   |   16 |   ascii |  30.58 ns | 0.051 ns | 0.048 ns |  30.59 ns |  30.43 ns |  30.63 ns |  1.00 |            Base | 0.0764 |      40 B |        1.00 |
| GetBytes | Job-UMNKCC | /runtime_intrinsic/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun  |   16 |   ascii |  31.76 ns | 0.023 ns | 0.019 ns |  31.77 ns |  31.74 ns |  31.80 ns |  1.04 |          Slower | 0.0765 |      40 B |        1.00 |
|          |            |                                                                                                        |      |         |           |          |          |           |           |           |       |                 |        |           |             |
| GetBytes | Job-RILZMR |     /runtime_HEAD/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun   |   16 |   utf-8 |  33.70 ns | 0.044 ns | 0.034 ns |  33.70 ns |  33.63 ns |  33.75 ns |  1.00 |            Base | 0.0764 |      40 B |        1.00 |
| GetBytes | Job-UMNKCC | /runtime_intrinsic/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun  |   16 |   utf-8 |  31.55 ns | 0.087 ns | 0.073 ns |  31.56 ns |  31.41 ns |  31.66 ns |  0.94 |          Faster | 0.0765 |      40 B |        1.00 |
|          |            |                                                                                                        |      |         |           |          |          |           |           |           |       |                 |        |           |             |
| GetBytes | Job-RILZMR |     /runtime_HEAD/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun   |  512 |   ascii | 181.52 ns | 0.199 ns | 0.186 ns | 181.42 ns | 181.25 ns | 181.97 ns |  1.00 |            Base | 1.0243 |     536 B |        1.00 |
| GetBytes | Job-UMNKCC | /runtime_intrinsic/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun  |  512 |   ascii | 164.45 ns | 2.145 ns | 2.006 ns | 163.97 ns | 162.13 ns | 168.03 ns |  0.91 |          Faster | 1.0244 |     536 B |        1.00 |
|          |            |                                                                                                        |      |         |           |          |          |           |           |           |       |                 |        |           |             |
| GetBytes | Job-RILZMR |     /runtime_HEAD/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun   |  512 |   utf-8 | 229.04 ns | 0.175 ns | 0.137 ns | 229.02 ns | 228.82 ns | 229.31 ns |  1.00 |            Base | 1.0240 |     536 B |        1.00 |
| GetBytes | Job-UMNKCC | /runtime_intrinsic/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun  |  512 |   utf-8 | 229.45 ns | 0.770 ns | 0.720 ns | 229.49 ns | 228.42 ns | 231.02 ns |  1.00 |            Same | 1.0240 |     536 B |        1.00 |

On x86 (Xeon Gold 5120T)

|   Method |        Job |                                                                                                        Toolchain | size | encName |      Mean |    Error |   StdDev |    Median |       Min |       Max | Ratio | MannWhitney(2%) |  Gen 0 | Allocated | Alloc Ratio |
|--------- |----------- |----------------------------------------------------------------------------------------------------------------- |----- |-------- |----------:|---------:|---------:|----------:|----------:|----------:|------:|---------------- |-------:|----------:|------------:|
| GetBytes | Job-HGFNCZ |         /runtime_HEAD/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |   16 |   ascii |  24.13 ns | 0.091 ns | 0.076 ns |  24.11 ns |  24.06 ns |  24.31 ns |  1.00 |            Base | 0.0039 |      40 B |        1.00 |
| GetBytes | Job-RVIYCK | /runtime_intrinsic/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun    |   16 |   ascii |  25.86 ns | 0.229 ns | 0.203 ns |  25.84 ns |  25.58 ns |  26.33 ns |  1.07 |          Slower | 0.0039 |      40 B |        1.00 |
|          |            |                                                                                                                  |      |         |           |          |          |           |           |           |       |                 |        |           |             |
| GetBytes | Job-HGFNCZ |         /runtime_HEAD/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |   16 |   utf-8 |  25.83 ns | 0.102 ns | 0.085 ns |  25.79 ns |  25.74 ns |  26.03 ns |  1.00 |            Base | 0.0039 |      40 B |        1.00 |
| GetBytes | Job-RVIYCK | /runtime_intrinsic/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun    |   16 |   utf-8 |  23.94 ns | 0.293 ns | 0.260 ns |  23.80 ns |  23.75 ns |  24.59 ns |  0.93 |          Faster | 0.0040 |      40 B |        1.00 |
|          |            |                                                                                                                  |      |         |           |          |          |           |           |           |       |                 |        |           |             |
| GetBytes | Job-HGFNCZ |         /runtime_HEAD/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |   ascii | 103.24 ns | 0.175 ns | 0.137 ns | 103.21 ns | 103.11 ns | 103.60 ns |  1.00 |            Base | 0.0532 |     536 B |        1.00 |
| GetBytes | Job-RVIYCK | /runtime_intrinsic/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun    |  512 |   ascii |  88.31 ns | 0.355 ns | 0.296 ns |  88.25 ns |  87.88 ns |  89.00 ns |  0.86 |          Faster | 0.0531 |     536 B |        1.00 |
|          |            |                                                                                                                  |      |         |           |          |          |           |           |           |       |                 |        |           |             |
| GetBytes | Job-HGFNCZ |         /runtime_HEAD/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |   utf-8 | 124.10 ns | 0.469 ns | 0.392 ns | 123.95 ns | 123.82 ns | 125.06 ns |  1.00 |            Base | 0.0531 |     536 B |        1.00 |
| GetBytes | Job-RVIYCK | /runtime_intrinsic/artifacts/bin/testhost/net7.0-Linux-Release-x64/shared/Microsoft.NETCore.App/7.0.0/corerun    |  512 |   utf-8 | 119.51 ns | 0.834 ns | 0.696 ns | 119.62 ns | 118.29 ns | 120.67 ns |  0.96 |          Faster | 0.0531 |     536 B |        1.00 |

The --disasm flag says the disassembly is not supported on AArch64, give Arm64 is not supported (Iced library limitation) message. I used the COMPlus_JitDisasm=<methodName> option earlier but it dumped the assembly for the asserts, so I commented them. To be sure about the sequence, I tried with the release version but couldn't get the assembly (noted that it's not possible). I am not sure whether the assembly with the checked/debug version without asserts as optimal as (or can be assumed to be) the release version. Would you recommend using it for reference?

kunalspathak · 2022-06-15T17:49:51Z

You can build everything release - coreclr/libraries and then drop a checked clrjit to make that environment variable work.

kunalspathak · 2022-06-15T17:51:06Z

Hi @kunalspathak, here are the numbers.

Thanks for sharing the numbers. It seems the ascii-16 bytes is slightly slower. Do you know why?

kunalspathak

Changes looks good overall. I would like to see the disasm code difference of SSE2 as well as the disasm for AdvSimd. @tannergooding - do you mind taking a look as well?

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

kunalspathak · 2022-06-16T13:53:10Z

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

I would be curious to see the assembly difference for SSE2.

👍. Vector64<T> is a bit odd on x86/x64 since its treated as a regular struct { ulong _value; }, so I'd expect promotion/etc to do the right thing here
but it might not be as efficient as extracting the lowest UInt64 scalar (it probably could be with some tweaks however if it does differ)

Seems that by switching to StoreUnsafe introduced a bigger sequence of assembly with a few unnecessary movs.
On HEAD, we get the following assembly for extract and store.

vpackuswb xmm0, xmm0, xmm0 vmovq qword ptr [r15], xmm0 mov r12d, 8 test r15b, 8

With the above change, extract and store looks as following.

mov r12, rbx vpackuswb xmm0, xmm0, xmm0 vmovapd xmmword ptr [rbp-80H], xmm0 mov rax, qword ptr [rbp-80H] mov qword ptr [rbp-58H], rax mov rax, qword ptr [rbp-58H] mov qword ptr [r12], rax mov r13d, 8 test bl, 8

Would you suggest to stick to Vector128 api in this case?
Full assembly dumps are available here - HEAD, PR

I meant - can you share the arm64 versions? You can also turn off the address displays using set COMPlus_JitDiffableDasm=1.

Assembly dumps without Debug.Assert() for x86- HEAD, PR.
Having assert introduced additional stack mov. Commenting debug asserts may resemble closely to the assembly with the release builds.

I meant - can you share the arm64 versions? You can also turn off the address displays using set COMPlus_JitDiffableDasm=1.

Here they are: HEAD, PR

They are identical.

Here is the updated one. For the PR, assembly for NarrowUtf16ToAscii_Intrinsified should have been dumped instead of NarrowUtf16ToAscii.

I believe this one uses StoreUnsafe because I don't see AV related code generated as you pointed out in #70080 (comment)?

Yup, this one uses StoreUnsafe. The aligned stores are only performed in a loop, now changing them to StoreUnsafe.

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

SwapnilGaikwad · 2022-06-16T16:12:15Z

Hi @kunalspathak, here are the numbers.

Thanks for sharing the numbers. It seems the ascii-16 bytes is slightly slower. Do you know why?

It seems the loop peeling logic to get the destination pointing to 8-byte aligned address adds a slight overhead for smaller strings.
However, for longer string the loop peeling logic pays off. If we remove the 8-byte aligned write logic, we see very negligible overhead compared to the current HEAD [1].

[1] Microbenchmark results after removing the aligned write logic:
release_runtime_intrinsic: The current PR
release_without_peel: The current PR without aligned write (commenting lines 1551 to 1570)
runtime_HEAD: Current HEAD

|   Method |        Job |                                                                                                     Toolchain | size | encName |      Mean |    Error |   StdDev |    Median |       Min |       Max | Ratio | MannWhitney(2%) |  Gen 0 | Allocated | Alloc Ratio |
|--------- |----------- |-------------------------------------------------------------------------------------------------------------- |----- |-------- |----------:|---------:|---------:|----------:|----------:|----------:|------:|---------------- |-------:|----------:|------------:|
| GetBytes | Job-TLYBOZ | /release_runtime_intrinsic/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |   16 |   ascii |  34.72 ns | 0.039 ns | 0.037 ns |  34.72 ns |  34.66 ns |  34.78 ns |  1.07 |          Slower | 0.0765 |      40 B |        1.00 |
| GetBytes | Job-NLAVLI |      /release_without_peel/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |   16 |   ascii |  32.07 ns | 0.058 ns | 0.051 ns |  32.06 ns |  31.95 ns |  32.15 ns |  0.99 |            Same | 0.0764 |      40 B |        1.00 |
| GetBytes | Job-AMSBUC |              /runtime_HEAD/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |   16 |   ascii |  32.48 ns | 0.046 ns | 0.043 ns |  32.49 ns |  32.41 ns |  32.54 ns |  1.00 |            Base | 0.0765 |      40 B |        1.00 |
|          |            |                                                                                                               |      |         |           |          |          |           |           |           |       |                 |        |           |             |
| GetBytes | Job-YCMYKR | /release_runtime_intrinsic/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |   ascii | 158.93 ns | 0.265 ns | 0.221 ns | 158.93 ns | 158.52 ns | 159.43 ns |  0.90 |          Faster | 1.0241 |     536 B |        1.00 |
| GetBytes | Job-NLAVLI |      /release_without_peel/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |   ascii | 168.20 ns | 1.472 ns | 1.377 ns | 168.33 ns | 166.01 ns | 170.06 ns |  0.96 |          Faster | 1.0246 |     536 B |        1.00 |
| GetBytes | Job-AMSBUC |              /runtime_HEAD/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |   ascii | 175.96 ns | 0.719 ns | 0.672 ns | 175.88 ns | 174.48 ns | 177.05 ns |  1.00 |            Base | 1.0239 |     536 B |        1.00 |

SwapnilGaikwad · 2022-06-16T16:51:27Z

How can I dump the assembly for NarrowUtf16ToAscii_Intrinsified to ensure that the constants are not emitted? The COMPlus_JitDump option while executing micro-benchmarks don't dump the assembly for the narrowing method even with the checked+debug build. It dumps other methods. I used the following command.

COMPlus_JitDump="*" dotnet run -c Release -f net7.0 --filter "System.Text.Tests.Perf_Encoding.GetBytes" --corerun "$HEAD/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun" "$PATCH/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun" --statisticalTest 2%

Also, re-building with ./build.sh clr -rc checked -lc release doesn't update the corerun binaries. A clean build avoids this issue but takes much longer. Alternatively, I created a console app; extracted the NarrowUtf16ToAscii_Intrinsified method and executed it using the skeleton used by ASCIIUtilityTests.cs
The COMPlus_JitDump works fine there. However, it dumps quite suboptimal assembly including logic to throw PlatformNotSupportedException().

@kunalspathak @tannergooding Do you guys have any better ways to extract the assembly reliably? 🤔

kunalspathak · 2022-06-16T17:04:15Z

@kunalspathak @tannergooding Do you guys have any better ways to extract the assembly reliably? 🤔

Follow #70080 (comment). To remove unnecessary asserts, you will need release version of SPC. Did you try #70080 (comment)?

SwapnilGaikwad · 2022-06-17T13:15:46Z

@kunalspathak @tannergooding Do you guys have any better ways to extract the assembly reliably? 🤔

Follow #70080 (comment). To remove unnecessary asserts, you will need release version of SPC. Did you try #70080 (comment)?

Thanks! Now can extract the assembly. The missing piece was that the build wasn't checked.

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

kunalspathak

Need to update a comment before we can go ahead with the merge.

Also, I see that when the review comments are addressed, they are squashed in previous commits. It's not something that is common on this repo. As a reviewer, I have to review entire changes instead of just the updates that were done as part of review comment. A better approach would be to not squash the commit and so those can be reviewed as a standalone change, and it would make my role as reviewer a lot easier. Is there some benefit to the approach you are using?

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

SwapnilGaikwad · 2022-06-29T13:23:11Z

Need to update a comment before we can go ahead with the merge.

Also, I see that when the review comments are addressed, they are squashed in previous commits. It's not something that is common on this repo. As a reviewer, I have to review entire changes instead of just the updates that were done as part of review comment. A better approach would be to not squash the commit and so those can be reviewed as a standalone change, and it would make my role as reviewer a lot easier. Is there some benefit to the approach you are using?

Sure. Apologies for inconvenience. I'll use add separate commits going forward.

kunalspathak

LGTM. Thank you for your contribution!

ghost added area-System.Text.Encoding community-contribution Indicates that the PR has been added by a community member labels Jun 1, 2022

stephentoub reviewed Jun 1, 2022

View reviewed changes