generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 39
AVX2 enabling #164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
AVX2 enabling #164
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This involves a number of general infrastructure additions such as new versions of lemmas and tools for 256-bit and 512-bit sizes, as well as extra utility functions to handle decoding of x86 SIMD instructions. So far, we just decode the single instruction VPXOR, but adding more instructions should now be relatively easy. The "decode_aux" function is slightly uglified by using conditional branches instead of pattern matching, to work round limitations of the evaluation machinery. We plan to replace this soon anyway with simpler code based on the HOL Light Compute module, so we won't fix it now. Although the register model extends up to ZMMs for possible future extension with AVX512, all the practical simulation infrastructure treats the YMM registers as basic and records updates to those registers. The treatment of XMMs within YMMs within ZMMs performs zero-extension on write (analogous to the treatment of EAX within RAX on general purpose registers). This is in conformance with the specified behavior when the operation is VEX-encoded, which is all our decoder is handling. See the Intel combined ISA manual 15.5, e.g. "The lower 128 bits of a YMM register is aliased to the corresponding XMM register. Legacy SSE instructions (i.e., SIMD instructions operating on XMM state but not using the VEX prefix, also referred to non-VEX encoded SIMD instructions) will not access the upper bits (MAXVL-1:128) of the YMM registers. AVX and FMA instructions with a VEX prefix and vector length of 128-bits zeroes the upper 128 bits of the YMM register." The formal specifications of the standard x86 ABI and the Microsoft x64 ABI (usually called the "Windows ABI") are both updated to include the possible modification of the SIMD register state. The specs for both the ABIs are taken directly from Table 4 here: https://www.agner.org/optimize/calling_conventions.pdf For the standard ABI *all* the ZMM registers are treated as scratch. The fact that the Windows ABI is less liberal (requiring XMM6..XMM15 to be preserved) necessitates some trivial but extensive changes to existing proofs, since we can no longer simulate through subroutines that use MAYCHANGE_REGS_AND_FLAGS_PERMITTED_BY_ABI to abbreviate the possible state updates, since such a subroutine might in principle modify XMM6..XMM15 based on its spec. We work round this using a new function X86_SIMD_SHARPEN_RULE which in essence reruns the relevant subroutine wrapper boilerplate on a theorem to create a sharper one.
Contributor
|
Thanks for starting this! I will try and work on it over my christmas and NY break. |
This enables general pattern-matching over the VEXM type in the "evaluate" function. (It's a bit ad hoc to do it this way for the different types rather than via general code, but we will probably make more substantial changes here soon.) Taking advantage of this improvement, we rewrite decode_aux with pattern-matching instead of the nested conditional branches for the VEXM part, which seems more natural.
This extends the state considered from just the flags and integer registers (other than RSP) to include YMM0..YMM15 too. It also explicitly adds tests for the new VPXOR instruction which would otherwise be unexercised since it's not yet used in the code.
aqjune-aws
approved these changes
Jan 15, 2025
Collaborator
aqjune-aws
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the code and the updates looked reasonable to me. Also I ran the simulator more extensively by only leaving the VPXOR cases and running for 20 minutes, and it passed.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This involves a number of general infrastructure additions such as new versions of lemmas and tools for 256-bit and 512-bit sizes, as well as extra utility functions to handle decoding of x86 SIMD instructions.
So far, we just decode the single instruction VPXOR, but adding more instructions should now be relatively easy. The "decode_aux" function is slightly uglified by using conditional branches instead of pattern matching, to work round limitations of the evaluation machinery. We plan to replace this soon anyway with simpler code based on the HOL Light Compute module, so we won't fix it now.
Although the register model extends up to ZMMs for possible future extension with AVX512, all the practical simulation infrastructure treats the YMM registers as basic and records updates to those registers. The treatment of XMMs within YMMs within ZMMs performs zero-extension on write (analogous to the treatment of EAX within RAX on general purpose registers). This is in conformance with the specified behavior when the operation is VEX-encoded, which is all our decoder is handling. See the Intel combined ISA manual 15.5, e.g.
"The lower 128 bits of a YMM register is aliased to the
corresponding XMM register. Legacy SSE instructions (i.e., SIMD
instructions operating on XMM state but not using the VEX prefix,
also referred to non-VEX encoded SIMD instructions) will not
access the upper bits (MAXVL-1:128) of the YMM registers. AVX and
FMA instructions with a VEX prefix and vector length of 128-bits
zeroes the upper 128 bits of the YMM register."
The formal specifications of the standard x86 ABI and the Microsoft x64 ABI (usually called the "Windows ABI") are both updated to include the possible modification of the SIMD register state. The specs for both the ABIs are taken directly from Table 4 here:
https://www.agner.org/optimize/calling_conventions.pdf
For the standard ABI all the ZMM registers are treated as scratch. The fact that the Windows ABI is less liberal (requiring XMM6..XMM15 to be preserved) necessitates some trivial but extensive changes to existing proofs, since we can no longer simulate through subroutines that use MAYCHANGE_REGS_AND_FLAGS_PERMITTED_BY_ABI to abbreviate the possible state updates, since such a subroutine might in principle modify XMM6..XMM15 based on its spec. We work round this using a new function X86_SIMD_SHARPEN_RULE which in essence reruns the relevant subroutine wrapper boilerplate on a theorem to create a sharper one.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.