diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc new file mode 100644 index 0000000000000..883c73c655217 --- /dev/null +++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc @@ -0,0 +1,155 @@ +# Additional Intel-only specifics about matrix extension for DPC++ + +:source-highlighter: coderay +:coderay-linenums-mode: table +:dpcpp: pass:[DPC++] + +// This section needs to be after the document title. +:doctype: book +:toc2: +:toc: left +:encoding: utf-8 +:lang: en + +:blank: pass:[ +] + +// Set the default source code type in this document to C++, +// for syntax highlighting purposes. This is needed because +// docbook uses c++ and html5 uses cpp. +:language: {basebackend@docbook:c++:cpp} + + +== Notice + +Copyright (c) 2021-2022 Intel Corporation. All rights reserved. + +NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are +trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. +used by permission by Khronos. + +This extension is written against the SYCL 2020 revision 5 specification. All +references below to the "core SYCL specification" or to section numbers in the +SYCL specification refer to that revision. + +**_NOTE:_** This document describes the extra features and details for the implementation of `joint_matrix` extension on Intel AMX and Intel XMX. + This is an initial experimental version to try out functionality +and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. + +## Introduction +The Intel backend implementations on both Intel AMX and Intel XMX support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are additional specifics about the supported layouts that enable extra performance and functionality listed in this document. +This extension presents some supplementary Intel AMX and Intel XMX features not contained within the sycl_ext_oneapi_matrix extension. The additional features are built on top of the sycl_ext_oneapi_matrix extension but are only supported by the Intel AMX and Intel XMX backends. + +## Feature test macro + +This extension provides a feature-test macro as described in the core SYCL +specification section 6.3.3 "Feature test macros". Therefore, an +implementation supporting this extension must predefine the macro +`SYCL_EXT_INTEL_MATRIX` to one of the values defined in the table below. +Applications can test for the existence of this macro to determine if the +implementation supports this feature, or applications can test the macro's +value to determine which of the extension's APIs the implementation supports. + +[frame="none",options="header"] +|====================== +|Value |Description +|1 |Introduce `packed` layout and extend `joint_matrix_store` to Matrix A and B. +|====================== + + +## Extra Functionality + +### Layout +Besides row major and column major layouts, `layout` introduces the custom layout packed layout that refers to the VNNI format descibed in the following section. + +```c++ +namespace sycl::ext::intel::experimental::matrix { +enum class layout { + packed +}; +} +``` + + +### Layout argument in `joint_matrix_load` +`layout` in `joint_matrix_load` can take `packed` as argument to specify that the data has already been transformed into VNNI format (`packed`). in this case, `stride` argument of `joint_matrix_load` describes the number of elements between consecutive rows for packed layouts. + +In order to get maximum performance on Intel AMX and Intel XMX, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below. + +IMPORTANT: In the current Intel AMX and Intel XMX implementations, the layout in the load of matrix B (provided by the `layout memL` parameter below) must be `packed` or `row_major`. Automatic VNNI transform is supported on AMX. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C (provided by the `layout memL` parameter below) must also be `row_major`. + +### Store Operation +Besides store of matrix `accumulator`, the Intel implementation allows store on matrix `a` and `b` as well. + +#### Store +```c++ +namespace sycl::ext::intel::experimental::matrix { + template + void joint_matrix_store(Group sg, + joint_matrix &res, + multi_ptr src, size_t stride); +} +``` + + +## VNNI/Packed Layout +Intel AMX and Intel XMX compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. +The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transformation. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type. + +#### Example 1: 16-bit elements + // Example of a 4 row x 4 column matrix using a 16-bit data element, in row-major layout. + // Element a1 is contiguous in memory with element b1, etc. + // --------------------------------- + // a1, b1, c1, d1 + // a2, b2, c2, d2 + // a3, b3, c3, d3 + // a4, b4, c4, d4 + // --------------------------------- + // The same matrix reformatted in packed layout. + // Here, packing of 2 elements is needed to form 32 bits. + // Element a1 is contiguous in memory with element a2, etc. + // --------------------------------- + // a1, a2, b1, b2, c1, c2, d1, d2 + // a3, a4, b3, b4, c3, c4, d3, d4 + +#### Example 2: 8-bit elements + + // Example of a 4 row x 4 column matrix using a 8-bit data element, in row-major layout. + // Element a1 is contiguous in memory with element b1, etc. + // --------------------------------- + // a1, b1, c1, d1 + // a2, b2, c2, d2 + // a3, b3, c3, d3 + // a4, b4, c4, d4 + // --------------------------------- + // The same matrix reformatted in packed layout. + // Here, packing of 4 elements is needed to form 32 bits. + // Elements a1, a2, a3, a4 are contiguous in memory, etc. + // --------------------------------- + // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4 + +## Supported Combinations Per Hardware + +The table below provides a list of the combinations that `joint_matrix` implementations support on each of Intel AMX and Intel XMX hardware. Note that these can be returned in a parametrized way using the `tpu_params` query class. + +### Intel AMX Supported Combinations + +[frame="none",options="header"] +|====================== +| A type | B type | Accumulator type | M | N | K +| (u)int8_t | (u)int8_t | int32_t | +<=+ 16 | +<=+ 16 | +<=+ 64 +| bf16 | bf16 | fp32 | +<=+ 16 | +<=+ 16 | +<=+ 32 +|====================== + +### Intel XMX Supported Combinations + +[frame="none",options="header"] +|====================== +| A type | B type | Accumulator type | M | N | K +| (u)int8_t | (u)int8_t | int32_t | +<=+ 8 | 16 | 32 +| fp16 | fp16 | fp32 | +<=+ 8 | 16 | 16 +| bf16 | bf16 | fp32 | +<=+ 8 | 16 | 16 +|====================== + +## Open Questions +- Should the same class, `joint_matrix`, handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case)? Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes that can be variable. The ability to define only one interface for both would make it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. In a previous version of the design, we used `sycl::dynamic_extent` to differentiate between static and dynamic sizes. But since this was not implemented at all, we decided to remove it. We can revisit this design choice if this comes up as part of a customer request or if SPIRV matrix extension extends its support to dynamic sizes. diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc index 5d23485d79cfc..4c7214ab56e7a 100644 --- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc +++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc @@ -33,10 +33,10 @@ SYCL specification refer to that revision. **_NOTE:_** _This document describes the current design and API for the matrix extension to {dpcpp}. This is an initial experimental version to try out functionality -and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX), DPAS and Nvidia(R) Tensor Cores._ +and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R) XMX) and Nvidia(R) Tensor Cores._ ## Introduction -This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, DPAS in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer. +This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Intel XMX in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer. ## Feature test macro @@ -56,37 +56,31 @@ value to determine which of the extension's APIs the implementation supports. ## Matrix API Versions -While this document presents the core API that unifies Intel AMX, DPAS, and Nvidia Tensor Cores, the implementations support slightly different versions of the API. For this reason, we introduce a new macro, namely `SYCL_EXT_ONEAPI_MATRIX_VERSION` to distinguish between these different implementations. The goal in the next few months is to get rid of this implementation versioning macro. These are the current values for this macro. +While this document presents the core API that unifies Intel AMX, Intel XMX, and Nvidia Tensor Cores, the implementations support slightly different versions of the API. For this reason, we introduce a new macro, namely `SYCL_EXT_ONEAPI_MATRIX_VERSION` to distinguish between these different implementations. The goal in the next few months is to get rid of this implementation versioning macro. These are the current values for this macro. [frame="none",options="header"] |====================== |Value |Description -|1 |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported. The old API used for this implementation is detailed in [matrix extension](doc/extensions/deprecated/sycl_ext_oneapi_deprecated_matrix_no_use.asciidoc) -|2 |JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported +|1 |Initial extension JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported. The old API used for this implementation is detailed in link:../../deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc[matrix extension] +|2 |JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported |3 |Implementation on Nvidia Tensor Cores |====================== ## New `joint_matrix` class -We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the matrix use, the memory layout, and the memory scope of the matrix. This results in the following description: +We introduce a new class called `joint_matrix`. The user needs to specify the group memory scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description: ```c++ namespace sycl::ext::oneapi::experimental::matrix { -template +template struct joint_matrix { - joint_matrix(Group g) {} + joint_matrix() {} }; } ``` IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only to matrix with `use::accumulator` -#### Shape -The same class, `joint_matrix`, should handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case). Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. - -IMPORTANT: In the current implementation, only the static extent is supported - #### Use Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumulator +(C)+ is required by backend implementations to reason about the layout of the matrix in registers. @@ -100,47 +94,43 @@ enum class use { } ``` +#### Shape +The shape of a `joint_matrix` refers to its number of rows `Rows` and number of columns `Cols`. + #### Layout -Besides row major and column major layouts, `layout` is flexible enough to introduce custom layouts such as packed layout. +This specifies the memory layout and it can be row major or column major. ```c++ namespace sycl::ext::oneapi::experimental::matrix { enum class layout { row_major, col_major, - packed, dynamic -}; + }; } ``` -#### Memory Scope -In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments. +#### Group Memory Scope +In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The group scope is added as an additional template parameter and is also part of the constructor arguments. -IMPORTANT: In the current implementation, only the subgroup scope is supported +IMPORTANT: In the current implementation, only the `sub_group` scope is supported When the group is a `sycl::sub_group`, a matrix is declared as follows: ```c++ -joint_matrix tA(sg); +joint_matrix tA; ``` ## Matrix Operations and their Execution Scope We define three new functions needed to perform the main and common operations on matrices, namely load, store, and the actual multiply and add operation. This set of functions can be easily extended if the matrix hardware implements new features. -The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data is being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed`). `stride` describes the number of elements between consecutive rows for row major and packed layouts, or between columns for the column major layout. - -Note that in order to get maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below. - -IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B (provided by the `layout memL` parameter below) must be `packed` or `row_major`. Automatic VNNI transform is supported on AMX. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C (provided by the `layout memL` parameter below) must also be `row_major`. - -Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a subgroup. These functions will be called once by each work item in the group. +Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a sub-group. These functions will be called once by each work item in the group. To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: -IMPORTANT: In the current implementation, only the subgroup scope is supported. +IMPORTANT: In the current implementation, only the `sub_group` scope is supported. #### Load ```c++ @@ -148,21 +138,24 @@ namespace sycl::ext::oneapi::experimental::matrix { template void joint_matrix_load(Group sg, - joint_matrix &res, - multi_ptr src, size_t stride, layout memL); + joint_matrix &res, + multi_ptr src, size_t stride, layout Layout); template void joint_matrix_load(Group sg, - joint_matrix &res, + joint_matrix &res, multi_ptr src, size_t stride); } ``` -`joint_matrix_load` loads data from memory to the 2d tiles/registers of Intel AMX/DPAS. +`joint_matrix_load` loads data from memory to the 2d tiles/registers of the tensor hardware. We define two overloads of the load function depending on whether the memory layout was declared as part of the `joint_matrix` type or not. -The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that was declared with `layout::dynamic`. -The second overload without a memory layout must not be used with a `joint_matrix` type that was declared with `layout::dynamic`. +The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that used the default value `layout::dynamic`. +The second overload without a memory layout must not be used with a `joint_matrix` type that used the default value `layout::dynamic`. + +The base pointer `src` here determines the starting address of the matrix to be loaded from. `Layout` determines whether the data is being read in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. + #### Store ```c++ @@ -170,22 +163,25 @@ namespace sycl::ext::oneapi::experimental::matrix { template void joint_matrix_store(Group sg, - joint_matrix &res, - multi_ptr src, size_t stride, layout memL); + joint_matrix &res, + multi_ptr dest, size_t stride, layout Layout); } ``` This function stores the data in the accumulator matrix from the 2d tiles back to memory. +The base pointer `dest` here determines the starting address of the matrix to be stored. `Layout` determines whether the data is being written in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. + + #### Multiply and Add ```c++ namespace sycl::ext::oneapi::experimental::matrix { template - joint_matrix joint_matrix_mad(Group sg, - joint_matrix A, - joint_matrix B, - joint_matrix C); + joint_matrix joint_matrix_mad(Group sg, + joint_matrix A, + joint_matrix B, + joint_matrix C); } ``` The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result. @@ -198,10 +194,10 @@ The current interface presented above assumes that all the matrices are directly namespace sycl::ext::oneapi::experimental::matrix { template - void joint_matrix_fill(Group sg, joint_matrix &m, Tv v); + void joint_matrix_fill(Group sg, joint_matrix &m, Tv v); } ``` -IMPORTANT: In the current implementation, only the subgroup scope is supported. +IMPORTANT: In the current implementation, only the `sub_group` scope is supported. #### Element Indexing and Piece-Wise Operations ##### Background @@ -226,28 +222,20 @@ Also, note that `get_wi_data` cannot return a fixed size array length because th 1- The main compilation mode of SYCL is JIT compilation and partitioning among WIs is implementation defined. -2- SG size is not generally fixed. - -3- AMX has the flexibility of allowing variable sizes on the matrix (`dynamic_extent`). - -In the case of CUDA backend which is SYCL AOT compiled and SG size = 32 known and fixed, the additional marray capability will be provided. +2- Sub group size is not generally fixed. The code listing below shows a synopsis of these new APIs. ```c++ namespace sycl::ext::oneapi::experimental::matrix { -template -struct joint_matrix { - wi_data get_wi_data(); -}; -template + wi_data get_wi_data(Group sg, joint_matrix Mat); + +template class wi_data { size_t length(); wi_element operator[](size_t i); }; -template class wi_element { @@ -259,58 +247,19 @@ class wi_element { ``` In the following example `wi_data_c` is a reference to the WI owned portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`. -Vectorization along the subgroup dimension will get enabled automatically to vectorize the contiguous portion of the matrix. +Vectorization along the sub group dimension will get enabled automatically to vectorize the contiguous portion of the matrix. ```c++ -auto wi_data_c = matC.get_wi_data(); +auto wi_data_c = get_wi_data(sg, matC); for (int i = 0; i < wi_data_c.length(); i++) wi_data_c[i] *= alpha; // Note that the indexing here "i" is in the vector owned by a WI, not in the matrix C ``` -IMPORTANT: In the current implementation, only the subgroup scope is supported. +IMPORTANT: In the current implementation, only the `sub_group` scope is supported. IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet. -IMPORTANT: In the Tensor Cores implementation, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the Tensor Cores backend. - -## VNNI/Packed Layout -Intel AMX and DPAS compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. -The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transformation. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type. - -#### Example 1: 16-bit elements - // Example of a 4 row x 4 column matrix using a 16-bit data element, in row-major layout. - // Element a1 is contiguous in memory with element b1, etc. - // --------------------------------- - // a1, b1, c1, d1 - // a2, b2, c2, d2 - // a3, b3, c3, d3 - // a4, b4, c4, d4 - // --------------------------------- - // The same matrix reformatted in packed layout. - // Here, packing of 2 elements is needed to form 32 bits. - // Element a1 is contiguous in memory with element a2, etc. - // --------------------------------- - // a1, a2, b1, b2, c1, c2, d1, d2 - // a3, a4, b3, b4, c3, c4, d3, d4 - -#### Example 2: 8-bit elements - - // Example of a 4 row x 4 column matrix using a 8-bit data element, in row-major layout. - // Element a1 is contiguous in memory with element b1, etc. - // --------------------------------- - // a1, b1, c1, d1 - // a2, b2, c2, d2 - // a3, b3, c3, d3 - // a4, b4, c4, d4 - // --------------------------------- - // The same matrix reformatted in packed layout. - // Here, packing of 4 elements is needed to form 32 bits. - // Elements a1, a2, a3, a4 are contiguous in memory, etc. - // --------------------------------- - // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4 - - ## Example using int8_t type ```c++ using namespace sycl::ext::oneapi::experimental::matrix; @@ -328,16 +277,16 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) const auto sg_startx = global_idx - item.get_local_id(0); const auto sg_starty = global_idy - item.get_local_id(1); sub_group sg = item.get_sub_group(); - joint_matrix tA(sg); - joint_matrix tB(sg); - joint_matrix tC(sg); + joint_matrix tA; + joint_matrix tB; + joint_matrix tC; joint_matrix_fill(sg, tC, 0); - for (int k = 0; k < K; k += tk) { + for (int k = 0; k < K; k += tK) { joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K); joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N); tC = joint_matrix_mad(sg, tA, tB, tC); } - auto wi_data_c = matC.get_wi_data(); + auto wi_data_c = get_wi_data(sg, tC); for (int i = 0; i < wi_data_c.length(); i++) wi_data_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major); @@ -345,7 +294,7 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) ``` == Query Interface -Intel AMX, DPAS and Nvidia TPUs support different sizes and types. +Intel AMX, Intel XMX and Nvidia TPUs support different sizes and types. The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation. This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, so there will be no runtime errors. @@ -355,7 +304,7 @@ The query interface proposed here consists of three functionalities: - Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/accumulator` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K. -- General query: the general query interface provides information about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query returns an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and accumulator. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively, depending on whether the implementation supports a continuous or discrete number of sizes. For example, the Intel AMX implementation supports a continuous number of sizes, so the `max_*` variant is applied and only the maximum number is returned. The DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. This form takes place when users only specify the TPU they are interested in using. +- General query: the general query interface provides information about sizes, types, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query returns an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and accumulator. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively, depending on whether the implementation supports a continuous or discrete number of sizes. For example, the Intel AMX implementation supports a continuous number of sizes, so the `max_*` variant is applied and only the maximum number is returned. The Intel XMX implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. This form takes place when users only specify the TPU they are interested in using. The table below provides a description for each of the member variables and type aliases in `tpu_params` class and the forms in which they are defined. @@ -371,9 +320,8 @@ The table below provides a description for each of the member variables and type |`joint_matrix_a`| validation, default values|type alias for `joint_matrix` for matrix A |`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B |`joint_matrix_accumulator`| validation, default values| type alias for `joint_matrix` for matrix accumulator -|`dynamic_p`| validation, default values, general query| a boolean that indicates whether the implementation supports dynamic sizes (true) or not (false) -|numtiles| validation, default values, general query|indicates number of tiles in Intel AMX (does not apply to DPAS) -|scope| validation, default values, general query| indicates the memory and execution scope supported by the TPU implementation +|numtiles| validation, default values, general query|indicates number of tiles in Intel AMX (does not apply to Intel XMX) +|scopes| validation, default values, general query| indicates the memory and execution scopes supported by the TPU implementation |`combination` | validation, default values, general query|composes the types and sizes of A, B, accumulator matrices allowed in one combination |`max_msize`, `max_nsize`, `max_ksize`| validation, default values, general query| if the TPU implementation supports a continuous number of element sizes, each of these members is non-zero, and the TPU implementation supports all element sizes from 1 up to (and including) that number. By contrast, if the TPU implementation supports a discrete number of element sizes, each of these members has the value zero |`msize`, `nsize`, `ksize`| validation, default values, general query| if the TPU implementation supports a discrete number of element sizes, each of these members is non-zero, and the value tells one of the supported element sizes. By contrast, if the TPU supports a continuous number of element sizes, each of these members has the value zero @@ -384,9 +332,6 @@ The table below provides a description for each of the member variables and type - - - ```c++ namespace sycl::ext::oneapi::experimental::matrix { template @@ -394,7 +339,7 @@ struct tpu_params; // Validation form: Valid or not // Specialization when both types and sizes are given -template +template struct tpu_params< tpu::amx, Ta, Tb, Tc, sM, sN, sK, typename std::enable_if<( @@ -420,17 +365,16 @@ struct tpu_params< static constexpr std::size_t K = (sK != 0) ? sK : ((sizeof(Ta) == 1) ? 64 : 32); - template - using joint_matrix_a = joint_matrix; - template - using joint_matrix_b = joint_matrix; - template - using joint_matrix_accumulator = joint_matrix; + template + using joint_matrix_a = joint_matrix; + template + using joint_matrix_b = joint_matrix; + template + using joint_matrix_accumulator = joint_matrix; - static constexpr bool dynamic_p = false; // should be true in future implementations - // because Intel AMX hardware supports dynamic sizes static constexpr uint32_t numtiles = 8; - static constexpr scope_t scope = scope_t::sub_group; + static constexpr scope_t scopes[] = {scope_t::sub_group}; + static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t); struct combination { uint32_t max_msize; uint32_t max_nsize; @@ -470,17 +414,16 @@ struct tpu_params - using joint_matrix_a = joint_matrix; - template - using joint_matrix_b = joint_matrix; - template - using joint_matrix_accumulator = joint_matrix; + template + using joint_matrix_a = joint_matrix; + template + using joint_matrix_b = joint_matrix; + template + using joint_matrix_accumulator = joint_matrix; - static constexpr bool dynamic_p = false; // should be true in future implementations because - // Intel AMX hardware supports dynamic sizes static constexpr uint32_t numtiles = 8; - static constexpr scope_t scope = scope_t::sub_group; + static constexpr scope_t scopes[] = {scope_t::sub_group}; + static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t); struct combination { uint32_t max_msize; uint32_t max_nsize; @@ -504,10 +447,9 @@ struct tpu_params struct tpu_params { - static constexpr bool dynamic_p = false; // should be true in future implementations because - // Intel AMX hardware supports dynamic sizes static constexpr uint32_t numtiles = 8; - static constexpr scope_t scope = scope_t::sub_group; + static constexpr scope_t scopes[] = {scope_t::sub_group}; + static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t); struct combination { uint32_t max_msize; uint32_t max_nsize; @@ -532,14 +474,15 @@ struct tpu_params { enum class tpu { - dpas, + xmx8, + xmx16, amx }; enum class matrix_type { bf16, fp16, - fp19, // tfloat32 + tf32, fp32, fp64, sint2, @@ -568,22 +511,22 @@ enum class scope_t { ```c++ // User can provide sizes besides the types and tpu_params can assert if they are supported or not // in this case, an assertion will happens as 16 is not a supported size for M -using myparams = tpu_params; +using myparams = tpu_params; size_t NDRangeM = M / myparams::M; //Assertion would happen at this line size_t NDRangeN = N / myparams::N; ``` === Default Values Example: ```c++ -using myparams = tpu_params_both; +using myparams = tpu_params_both; // use this to construct the ranges on the host side size_t NDRangeM = M / myparams::M; size_t NDRangeN = N / myparams::N; //if M, N, K do not multiply the default sizes, padding has to be done // device code: the matrices are constructed using the default dimensions -myparams::joint_matrix_a sub_a(sg); -myparams::joint_matrix_b sub_b(sg); -myparams::joint_matrix_accumulator sub_c(sg); +myparams::joint_matrix_a sub_a; +myparams::joint_matrix_b sub_b; +myparams::joint_matrix_accumulator sub_c; ``` @@ -591,15 +534,15 @@ myparams::joint_matrix_accumulator sub_c(sg); ```c++ constexpr int M = 1500; // with msize = 8 and msize = 4, // M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops -tpu_params params; +tpu_params params; constexpr int msize = break_dimension(params, M); constexpr int msize_remainder = break_dimension_remainder(params, M); constexpr int nsize = params.combinations[0].nsize; constexpr int ksize = params.combinations[0].ksize; // device code: -joint_matrix sub_a(sg); -joint_matrix sub_b(sg); -joint_matrix sub_c(sg); +joint_matrix sub_a; +joint_matrix sub_b; +joint_matrix sub_c; //Remainder handling ``` @@ -619,8 +562,8 @@ The indexing provided inside the `wi_data` class accesses only the portion of th Within the joint matrix extension, we want to write, as much as possible, one code to run on different backends. If backend X states that a WI owns one exact row of the matrix for instance, writing the following code will work only on that backend for that version of hardware. If a different hardware and implementation is used, the same WI may own only half of the row if, for example, the SG size increased. ```c++ -auto data = C.get_wi_data(); -for (int i = 0; i < length; ++i) { +auto data = get_wi_data(sg, C); +for (int i = 0; i < data.length(); ++i) { sum_of_local_rows[row] += data[i]; } ``` @@ -629,26 +572,16 @@ We want to keep backward compatibility in the joint matrix code when implementat So for the mapping problem, since this mapping is implementation-defined, one of the proposals is to add runtime functions like: ```c++ -auto data = C.get_wi_data(); -for (int i = 0; i < length; ++i) { +auto data = get_wi_data(sg, C); +for (int i = 0; i < data.length; ++i) { auto row, col = data[i].get_coord(); sum_of_local_rows[row] += data[i]; } ``` - -## Open Questions -- Ronan Keryell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?" - -- In the future looking APIs, `get_wi_data` (that is currently under design) returns an owned object. Should this return a view object to make sure the original matrix C is changed after its slices are modified. - -- `dynamic_extent` on the shape of `joint_matrix` is only available on Intel AMX. Should this be part of the API? -- This document still contains non-portable code between Intel AMX and DPAS, and Nvidia Tensor Cores such as: packed layout and dynamic_extent. Currently, these restrictions are explained in the spec text. But we might decide to move these to a separate Intel-specific additional matrix API document. - ## TODO List - Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class. -- Change the type of `scope` in the query interface to be able to return more than one value. This will be useful in the event we support other scopes like workgroup besides subgroups -- Add a more realistic and complete example that shows the value of the general query +- Add a more realistic and complete example that shows the value of the general query. ## Revision History @@ -660,4 +593,5 @@ for (int i = 0; i < length; ++i) { |2 |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS |3 |2022-05-16 |Dounia Khaldi |Add matrix fill and piece-wise operations support |4 |2022-08-25 |Dounia Khaldi |Update the matrix spec by adding the new matrix use parameter and remove reference to the AOT AMX initial implementation +|5 |2022-11-07 |Dounia Khaldi |Update the matrix spec by making it portable across Intel AMX, Intel XMX and Nvidia tensor Cores, and move the Intel-specifics to a separate extension document. |======================