-
Notifications
You must be signed in to change notification settings - Fork 808
Move the Intel specific features to a separate document. Mainly: #7307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 17 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
eaca6bf
-Remove dynamic extent. I kept it in the open questions though of the…
dkhaldi 4bf0775
remove comment
dkhaldi cc99697
Add an open question for adding a constructor with no argument
dkhaldi e673549
address Jack's comments
dkhaldi 687ff28
Address more comments from Jack
dkhaldi 9fb5b05
More cleanup of the non unified parts
dkhaldi cc6f846
Add packed layout and store flexibility to the Intel extension
dkhaldi d92ca02
Rename sycl_ext_oneapi_intel_matrix.asciidoc to sycl_ext_intel_matrix…
dkhaldi b3c3402
Improve the query API by adding XMX of PVC and other improvements spe…
dkhaldi e3ee0ea
reorder Group and layout parameters, make get_wi_data free function a…
dkhaldi f3050a8
Remove Open Questions section that has one item already resolved:
dkhaldi 74217b9
Address Jack's comments
dkhaldi 96aaa2f
Add information about the combinations that the implementation suppor…
dkhaldi e8d1646
format
dkhaldi 88906a2
format
dkhaldi e9f5366
format
dkhaldi 10477a4
Incorporate Jinsong's corrections
dkhaldi 4582fb3
format <= sign in the supported sizes
dkhaldi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
155 changes: 155 additions & 0 deletions
155
...c/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,155 @@ | ||
| # Additional Intel-only specifics about matrix extension for DPC++ | ||
|
|
||
| :source-highlighter: coderay | ||
| :coderay-linenums-mode: table | ||
| :dpcpp: pass:[DPC++] | ||
|
|
||
| // This section needs to be after the document title. | ||
| :doctype: book | ||
| :toc2: | ||
| :toc: left | ||
| :encoding: utf-8 | ||
| :lang: en | ||
|
|
||
| :blank: pass:[ +] | ||
|
|
||
| // Set the default source code type in this document to C++, | ||
| // for syntax highlighting purposes. This is needed because | ||
| // docbook uses c++ and html5 uses cpp. | ||
| :language: {basebackend@docbook:c++:cpp} | ||
|
|
||
|
|
||
| == Notice | ||
|
|
||
| Copyright (c) 2021-2022 Intel Corporation. All rights reserved. | ||
|
|
||
| NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are | ||
| trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. | ||
| used by permission by Khronos. | ||
|
|
||
| This extension is written against the SYCL 2020 revision 5 specification. All | ||
| references below to the "core SYCL specification" or to section numbers in the | ||
| SYCL specification refer to that revision. | ||
|
|
||
| **_NOTE:_** This document describes the extra features and details for the implementation of `joint_matrix` extension on Intel AMX and Intel XMX. | ||
| This is an initial experimental version to try out functionality | ||
| and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. | ||
|
|
||
| ## Introduction | ||
| The Intel backend implementations on both Intel AMX and Intel XMX support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are additional specifics about the supported layouts that enable extra performance and functionality listed in this document. | ||
| This extension presents some supplementary Intel AMX and Intel XMX features not contained within the sycl_ext_oneapi_matrix extension. The additional features are built on top of the sycl_ext_oneapi_matrix extension but are only supported by the Intel AMX and Intel XMX backends. | ||
|
|
||
| ## Feature test macro | ||
|
|
||
| This extension provides a feature-test macro as described in the core SYCL | ||
| specification section 6.3.3 "Feature test macros". Therefore, an | ||
| implementation supporting this extension must predefine the macro | ||
| `SYCL_EXT_INTEL_MATRIX` to one of the values defined in the table below. | ||
| Applications can test for the existence of this macro to determine if the | ||
| implementation supports this feature, or applications can test the macro's | ||
| value to determine which of the extension's APIs the implementation supports. | ||
|
|
||
| [frame="none",options="header"] | ||
| |====================== | ||
| |Value |Description | ||
| |1 |Introduce `packed` layout and extend `joint_matrix_store` to Matrix A and B. | ||
| |====================== | ||
|
|
||
|
|
||
| ## Extra Functionality | ||
|
|
||
| ### Layout | ||
| Besides row major and column major layouts, `layout` introduces the custom layout packed layout that refers to the VNNI format descibed in the following section. | ||
|
|
||
| ```c++ | ||
| namespace sycl::ext::intel::experimental::matrix { | ||
| enum class layout { | ||
| packed | ||
| }; | ||
| } | ||
| ``` | ||
|
|
||
|
|
||
| ### Layout argument in `joint_matrix_load` | ||
| `layout` in `joint_matrix_load` can take `packed` as argument to specify that the data has already been transformed into VNNI format (`packed`). in this case, `stride` argument of `joint_matrix_load` describes the number of elements between consecutive rows for packed layouts. | ||
|
|
||
| In order to get maximum performance on Intel AMX and Intel XMX, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below. | ||
|
|
||
| IMPORTANT: In the current Intel AMX and Intel XMX implementations, the layout in the load of matrix B (provided by the `layout memL` parameter below) must be `packed` or `row_major`. Automatic VNNI transform is supported on AMX. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C (provided by the `layout memL` parameter below) must also be `row_major`. | ||
|
|
||
| ### Store Operation | ||
| Besides store of matrix `accumulator`, the Intel implementation allows store on matrix `a` and `b` as well. | ||
|
|
||
| #### Store | ||
| ```c++ | ||
| namespace sycl::ext::intel::experimental::matrix { | ||
| template <typename Group, typename T, size_t NumRows, size_t NumCols, | ||
| use Use, layout Layout, access::address_space Space> | ||
| void joint_matrix_store(Group sg, | ||
| joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res, | ||
| multi_ptr<T, Space, IsDecorated> src, size_t stride); | ||
| } | ||
| ``` | ||
|
|
||
|
|
||
| ## VNNI/Packed Layout | ||
| Intel AMX and Intel XMX compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. | ||
| The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transformation. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type. | ||
|
|
||
| #### Example 1: 16-bit elements | ||
| // Example of a 4 row x 4 column matrix using a 16-bit data element, in row-major layout. | ||
| // Element a1 is contiguous in memory with element b1, etc. | ||
| // --------------------------------- | ||
| // a1, b1, c1, d1 | ||
| // a2, b2, c2, d2 | ||
| // a3, b3, c3, d3 | ||
| // a4, b4, c4, d4 | ||
| // --------------------------------- | ||
| // The same matrix reformatted in packed layout. | ||
| // Here, packing of 2 elements is needed to form 32 bits. | ||
| // Element a1 is contiguous in memory with element a2, etc. | ||
| // --------------------------------- | ||
| // a1, a2, b1, b2, c1, c2, d1, d2 | ||
| // a3, a4, b3, b4, c3, c4, d3, d4 | ||
|
|
||
| #### Example 2: 8-bit elements | ||
|
|
||
| // Example of a 4 row x 4 column matrix using a 8-bit data element, in row-major layout. | ||
| // Element a1 is contiguous in memory with element b1, etc. | ||
| // --------------------------------- | ||
| // a1, b1, c1, d1 | ||
| // a2, b2, c2, d2 | ||
| // a3, b3, c3, d3 | ||
| // a4, b4, c4, d4 | ||
| // --------------------------------- | ||
| // The same matrix reformatted in packed layout. | ||
| // Here, packing of 4 elements is needed to form 32 bits. | ||
| // Elements a1, a2, a3, a4 are contiguous in memory, etc. | ||
| // --------------------------------- | ||
| // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4 | ||
|
|
||
| ## Supported Combinations Per Hardware | ||
|
|
||
| The table below provides a list of the combinations that `joint_matrix` implementations support on each of Intel AMX and Intel XMX hardware. Note that these can be returned in a parametrized way using the `tpu_params` query class. | ||
|
|
||
| ### Intel AMX Supported Combinations | ||
|
|
||
| [frame="none",options="header"] | ||
| |====================== | ||
| | A type | B type | Accumulator type | M | N | K | ||
| | (u)int8_t | (u)int8_t | int32_t | `< =` 16 | `< =` 16 | `< =` 64 | ||
| | bf16 | bf16 | fp32 | `< =` 16 | `< =` 16 | `< =` 32 | ||
| |====================== | ||
|
|
||
| ### Intel XMX Supported Combinations | ||
|
|
||
| [frame="none",options="header"] | ||
| |====================== | ||
| | A type | B type | Accumulator type | M | N | K | ||
| | (u)int8_t | (u)int8_t | int32_t | `< =` 8 | 16 | 32 | ||
| | fp16 | fp16 | fp32 | `< =` 8 | 16 | 16 | ||
| | bf16 | bf16 | fp32 | `< =` 8 | 16 | 16 | ||
| |====================== | ||
|
|
||
| ## Open Questions | ||
| - Should the same class, `joint_matrix`, handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case)? Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes that can be variable. The ability to define only one interface for both would make it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. In a previous version of the design, we used `sycl::dynamic_extent` to differentiate between static and dynamic sizes. But since this was not implemented at all, we decided to remove it. We can revisit this design choice if this comes up as part of a customer request or if SPIRV matrix extension extends its support to dynamic sizes. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that you had difficulty with the formatting here. The above should work, and it will look better in the HTML rendering. The
+signs are an inline passthrough, which prevents Asciidoctor from doing special formatting.