AVX2 enabling #164

jargh · 2024-11-24T00:05:49Z

This involves a number of general infrastructure additions such as new versions of lemmas and tools for 256-bit and 512-bit sizes, as well as extra utility functions to handle decoding of x86 SIMD instructions.

So far, we just decode the single instruction VPXOR, but adding more instructions should now be relatively easy. The "decode_aux" function is slightly uglified by using conditional branches instead of pattern matching, to work round limitations of the evaluation machinery. We plan to replace this soon anyway with simpler code based on the HOL Light Compute module, so we won't fix it now.

Although the register model extends up to ZMMs for possible future extension with AVX512, all the practical simulation infrastructure treats the YMM registers as basic and records updates to those registers. The treatment of XMMs within YMMs within ZMMs performs zero-extension on write (analogous to the treatment of EAX within RAX on general purpose registers). This is in conformance with the specified behavior when the operation is VEX-encoded, which is all our decoder is handling. See the Intel combined ISA manual 15.5, e.g.

"The lower 128 bits of a YMM register is aliased to the
corresponding XMM register. Legacy SSE instructions (i.e., SIMD
instructions operating on XMM state but not using the VEX prefix,
also referred to non-VEX encoded SIMD instructions) will not
access the upper bits (MAXVL-1:128) of the YMM registers. AVX and
FMA instructions with a VEX prefix and vector length of 128-bits
zeroes the upper 128 bits of the YMM register."

The formal specifications of the standard x86 ABI and the Microsoft x64 ABI (usually called the "Windows ABI") are both updated to include the possible modification of the SIMD register state. The specs for both the ABIs are taken directly from Table 4 here:

https://www.agner.org/optimize/calling_conventions.pdf

For the standard ABI all the ZMM registers are treated as scratch. The fact that the Windows ABI is less liberal (requiring XMM6..XMM15 to be preserved) necessitates some trivial but extensive changes to existing proofs, since we can no longer simulate through subroutines that use MAYCHANGE_REGS_AND_FLAGS_PERMITTED_BY_ABI to abbreviate the possible state updates, since such a subroutine might in principle modify XMM6..XMM15 based on its spec. We work round this using a new function X86_SIMD_SHARPEN_RULE which in essence reruns the relevant subroutine wrapper boilerplate on a theorem to create a sharper one.

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

This involves a number of general infrastructure additions such as new versions of lemmas and tools for 256-bit and 512-bit sizes, as well as extra utility functions to handle decoding of x86 SIMD instructions. So far, we just decode the single instruction VPXOR, but adding more instructions should now be relatively easy. The "decode_aux" function is slightly uglified by using conditional branches instead of pattern matching, to work round limitations of the evaluation machinery. We plan to replace this soon anyway with simpler code based on the HOL Light Compute module, so we won't fix it now. Although the register model extends up to ZMMs for possible future extension with AVX512, all the practical simulation infrastructure treats the YMM registers as basic and records updates to those registers. The treatment of XMMs within YMMs within ZMMs performs zero-extension on write (analogous to the treatment of EAX within RAX on general purpose registers). This is in conformance with the specified behavior when the operation is VEX-encoded, which is all our decoder is handling. See the Intel combined ISA manual 15.5, e.g. "The lower 128 bits of a YMM register is aliased to the corresponding XMM register. Legacy SSE instructions (i.e., SIMD instructions operating on XMM state but not using the VEX prefix, also referred to non-VEX encoded SIMD instructions) will not access the upper bits (MAXVL-1:128) of the YMM registers. AVX and FMA instructions with a VEX prefix and vector length of 128-bits zeroes the upper 128 bits of the YMM register." The formal specifications of the standard x86 ABI and the Microsoft x64 ABI (usually called the "Windows ABI") are both updated to include the possible modification of the SIMD register state. The specs for both the ABIs are taken directly from Table 4 here: https://www.agner.org/optimize/calling_conventions.pdf For the standard ABI *all* the ZMM registers are treated as scratch. The fact that the Windows ABI is less liberal (requiring XMM6..XMM15 to be preserved) necessitates some trivial but extensive changes to existing proofs, since we can no longer simulate through subroutines that use MAYCHANGE_REGS_AND_FLAGS_PERMITTED_BY_ABI to abbreviate the possible state updates, since such a subroutine might in principle modify XMM6..XMM15 based on its spec. We work round this using a new function X86_SIMD_SHARPEN_RULE which in essence reruns the relevant subroutine wrapper boilerplate on a theorem to create a sharper one.

ctz · 2024-12-02T19:55:39Z

Thanks for starting this! I will try and work on it over my christmas and NY break.

This enables general pattern-matching over the VEXM type in the "evaluate" function. (It's a bit ad hoc to do it this way for the different types rather than via general code, but we will probably make more substantial changes here soon.) Taking advantage of this improvement, we rewrite decode_aux with pattern-matching instead of the nested conditional branches for the VEXM part, which seems more natural.

This extends the state considered from just the flags and integer registers (other than RSP) to include YMM0..YMM15 too. It also explicitly adds tests for the new VPXOR instruction which would otherwise be unexercised since it's not yet used in the code.

aqjune-aws

I read the code and the updates looked reasonable to me. Also I ran the simulator more extensively by only leaving the VPXOR cases and running for 20 minutes, and it passed.

jargh marked this pull request as ready for review November 24, 2024 04:08

jargh mentioned this pull request Nov 24, 2024

Feature request: p256 avx2 affine point table selection #148

Open

Merge branch 'awslabs:main' into main

f77aa16

jargh and others added 8 commits December 3, 2024 19:46

Merge branch 'awslabs:main' into main

778e67e

Merge branch 'awslabs:main' into main

17eedc7

Merge branch 'awslabs:main' into main

ecdc3c7

Merge branch 'awslabs:main' into main

5542ba7

Merge remote-tracking branch 'awslabs/main'

d2abb2e

Merge remote-tracking branch 'awslabs/main'

8921d9b

jargh requested a review from aqjune-aws January 13, 2025 23:52

Merge branch 'awslabs:main' into main

9107d28

aqjune-aws approved these changes Jan 15, 2025

View reviewed changes

jargh merged commit 72a34a3 into awslabs:main Jan 15, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AVX2 enabling #164

AVX2 enabling #164

Uh oh!

jargh commented Nov 24, 2024

Uh oh!

ctz commented Dec 2, 2024

Uh oh!

aqjune-aws left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AVX2 enabling #164

AVX2 enabling #164

Uh oh!

Conversation

jargh commented Nov 24, 2024

Uh oh!

ctz commented Dec 2, 2024

Uh oh!

aqjune-aws left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants