-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Implement NarrowUtf16ToAscii for AArch64 #70080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement NarrowUtf16ToAscii for AArch64 #70080
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-text-encoding Issue DetailsFixes #41292 partially.
|
|
Hi @kunalspathak, you might want to take a look at this PR. |
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
|
@dotnet/jit-contrib |
|
Thanks @SwapnilGaikwad for your contribution. Could you also share some performance numbers? You can start with https://github.com/dotnet/performance/blob/d7dac8a7ca12a28d099192f8a901cf8e30361384/src/benchmarks/micro/libraries/System.Text.Encoding/Perf.Encoding.cs. |
Hi Kunal, did a quick test on A72 for the |
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
|
@SwapnilGaikwad - Did you get chance to make any progress? |
|
Hi @kunalspathak, we made progress. Currently benchmarking the version that combines SSE2 and ASIMD implementations along with the above comments. I hope to get it ready by tomorrow. |
ef51903 to
daf52d6
Compare
|
Hi @kunalspathak, now the patch makes use of the vector API more. For ASCII strings of 512 size, it executes in about 0.92x on AArch64 and 0.86x on x86 compared to the execution time for the HEAD. |
That sounds great. Do you mind posting the actual numbers like done in #70654 (comment)?
I think it has a typo. Can you try with |
|
Hi @kunalspathak, here are the numbers. On x86 (Xeon Gold 5120T) The |
|
You can build everything release - coreclr/libraries and then drop a checked clrjit to make that environment variable work. |
Thanks for sharing the numbers. It seems the ascii-16 bytes is slightly slower. Do you know why? |
kunalspathak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes looks good overall. I would like to see the disasm code difference of SSE2 as well as the disasm for AdvSimd. @tannergooding - do you mind taking a look as well?
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be curious to see the assembly difference for SSE2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍. Vector64<T> is a bit odd on x86/x64 since its treated as a regular struct { ulong _value; }, so I'd expect promotion/etc to do the right thing here
but it might not be as efficient as extracting the lowest UInt64 scalar (it probably could be with some tweaks however if it does differ)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that by switching to StoreUnsafe introduced a bigger sequence of assembly with a few unnecessary movs.
On HEAD, we get the following assembly for extract and store.
vpackuswb xmm0, xmm0, xmm0
vmovq qword ptr [r15], xmm0
mov r12d, 8
test r15b, 8
With the above change, extract and store looks as following.
mov r12, rbx
vpackuswb xmm0, xmm0, xmm0
vmovapd xmmword ptr [rbp-80H], xmm0
mov rax, qword ptr [rbp-80H]
mov qword ptr [rbp-58H], rax
mov rax, qword ptr [rbp-58H]
mov qword ptr [r12], rax
mov r13d, 8
test bl, 8
Would you suggest to stick to Vector128 api in this case?
Full assembly dumps are available here - HEAD, PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant - can you share the arm64 versions? You can also turn off the address displays using set COMPlus_JitDiffableDasm=1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the updated one. For the PR, assembly for NarrowUtf16ToAscii_Intrinsified should have been dumped instead of NarrowUtf16ToAscii.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this one uses StoreUnsafe because I don't see AV related code generated as you pointed out in #70080 (comment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this one uses StoreUnsafe. The aligned stores are only performed in a loop, now changing them to StoreUnsafe.
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
It seems the loop peeling logic to get the destination pointing to 8-byte aligned address adds a slight overhead for smaller strings. [1] Microbenchmark results after removing the aligned write logic: |
|
How can I dump the assembly for
Also, re-building with @kunalspathak @tannergooding Do you guys have any better ways to extract the assembly reliably? 🤔 |
Follow #70080 (comment). To remove unnecessary asserts, you will need release version of SPC. Did you try #70080 (comment)? |
daf52d6 to
a5bb5cf
Compare
Thanks! Now can extract the assembly. The missing piece was that the build wasn't checked. |
a5bb5cf to
6968765
Compare
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
9fc032e to
5ae8594
Compare
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
5ae8594 to
3b775f3
Compare
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
3b775f3 to
e6f6cb9
Compare
kunalspathak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update a comment before we can go ahead with the merge.
Also, I see that when the review comments are addressed, they are squashed in previous commits. It's not something that is common on this repo. As a reviewer, I have to review entire changes instead of just the updates that were done as part of review comment. A better approach would be to not squash the commit and so those can be reviewed as a standalone change, and it would make my role as reviewer a lot easier. Is there some benefit to the approach you are using?
src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs
Outdated
Show resolved
Hide resolved
Sure. Apologies for inconvenience. I'll use add separate commits going forward. |
kunalspathak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for your contribution!
Fixes #41292 partially.