diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc index 71affb2cbb255..96bc9e85c7adf 100644 --- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc +++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc @@ -33,11 +33,20 @@ SYCL specification refer to that revision. **_NOTE:_** _This document describes the current design and API for the matrix extension to {dpcpp}. This is an initial experimental version to try out functionality -and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX) and DPAS. We are going to work with the community on incrementally improving +and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX), DPAS, and Nvidia® Tensor Cores. We are going to work with the community on incrementally improving the API to bring them closer to standard C++ (aligned with the `std::mdspan` and `std::mdarray` proposals) and SYCL in the next several months._ ## Introduction -This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer. +This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefits from the maximum performance these different hardware can offer. + +IMPORTANT: When compiling for Nvidia® GPUs the matrix extension requires specification of the GPU architecture version. + +This is the compilation command line needed to invoke the Tensor Cores matrix extension on program "matrix-nvidia.cpp": + +```c++ +clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -DSYCL_EXT_ONEAPI_MATRIX=3 matrix-nvidia.cpp -o output +``` +**_NOTE:_** _--cuda-gpu-arch may be set lower than sm_80 depending on the required matrix operation and whether it is supported by the desired arch._ ## Feature test macro @@ -57,11 +66,11 @@ value to determine which of the extension's APIs the implementation supports. |====================== ## New `joint_matrix` class -We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the memory layout, and the memory scope of the matrix. This results into the following description: +We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the memory layout, and the memory scope of the matrix. This results in the following description: ```c++ namespace sycl::ext::oneapi::experimental::matrix { -template struct joint_matrix { joint_matrix(Group g) {} @@ -69,9 +78,17 @@ struct joint_matrix { } ``` +#### `joint_matrix` precision types + +For data types `b1` (CUDA backend only) and `tf32` the number of bits required for the correct precision of each element of the matrix is lower than the size of the storage type of the matrix. +For such cases it is necessary to instantiate a `joint_matrix` using the correct precision type, e.g. + +```c++ +joint_matrix tA(sg); +``` #### Memory Scope -In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasis that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments. +In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments. IMPORTANT: In the current implementation, only the subgroup scope is supported @@ -82,13 +99,13 @@ joint_matrix tA(sg); ``` #### Shape -The same class `joint_matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. +The same class, `joint_matrix`, should handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case). Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. IMPORTANT: In the current implementation, only the static extent is supported #### Layout -Besides row major and column major layouts, `matrix_layout` is flexible enough to introduce customed layouts such as symmetric or tiled layouts. +Besides row major and column major layouts, `matrix_layout` is flexible enough to introduce custom layouts such as symmetric or tiled layouts. ```c++ namespace sycl::ext::oneapi::experimental::matrix { @@ -101,23 +118,23 @@ enum class matrix_layout { } ``` -Intel AMX and DPAS hardware require B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B tile uses the VNNI format, which is described below. The user must provide the information of packed_b layout to make the implementation allocate the right shape. The layout information for Intel AMX should be specified in user code as follows: +Intel AMX and DPAS hardware require B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B tile uses the VNNI format, which is described below. The user must provide the information of packed_b layout to make the implementation allocate the right shape. The layout information for Intel AMX should be specified in user code as follows: ```c++ joint_matrix tB(sg); ``` -IMPORTANT: In the current implementation, only `packed_b` layout is necessary to specify on matrix B, the layout on other matrices is ignored. - +IMPORTANT: In the current AMX and DPAS implementation, it is only necessary to specify the `packed_b` layout for matrix B, the layouts on other matrices are ignored. +IMPORTANT: In the Nvidia® Tensor Cores implementation only `row_major` and `col_major` layouts are supported. ## Matrix Operations and their Execution Scope -We define three new functions needed to perform the main and common operations on matrices namely, load, store, and the actual multiply and add operation. This set of functions can be easily extended if the tensor hardware implements new features. +We define three new functions needed to perform the main and common operations on matrices; namely, load, store, and the actual multiply and add operation. This set of functions can be easily extended if the matrix hardware implements new features. -The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). `stride` describes the number of elements between consecutive rows for row major and packed layout, columns for column major layout. +The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data is being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). `stride` describes the number of elements between consecutive rows for row major and packed layouts, or between columns for the column major layout. -Note that for getting maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` when matrix `C` is column major, `packed_b` when matrix `C` is row major), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. +Note that in order to get maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` when matrix `C` is column major, `packed_b` when matrix `C` is row major), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below. -IMPORTANT: In the current implementation, the layout in the load of matrix B must be `packed_b`. Therefore, both the template parameter for the declaration of the B matrix and the call to `joint_matrix_load` for the B matrix must specify the `packed_b` layout. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`. +IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed_b`. Therefore, both the template parameter for the declaration of the B matrix and the call to `joint_matrix_load` for the B matrix must specify the `packed_b` layout. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`. Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a subgroup. These functions will be called once by each work item in the group. @@ -164,9 +181,83 @@ namespace sycl::ext::oneapi::experimental::matrix { ``` The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result. +#### Binary Multiply and Add - `joint_matrix_bmad` (Nvidia® only) + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { +template +joint_matrix +joint_matrix_bmad( + Group sg, joint_matrix A, + joint_matrix B, + joint_matrix C, BinaryOperation Op); +} +``` + +Binary Multiply and Add (BMAD) operations replace the usual dot product between a row of matrix A (`matrix_use::a` with shape M by K) with a column of matrix B (`matrix_use::b` with shape K by N). Instead, a sequence of logical operations are performed: The AND or XOR logical operations operate on the ith bit of a K bit row of matrix A with the ith bit of a K bit column of matrix B, to produce a 128 bit intermediate output. +The Population Count (popc) operator then operates on this intermediate output and the result is added with the (M, N)th element of the matrix C (`matrix_use::accumulator`). Currently only the shape M = 8, N = 8, K = 128 is supported. +An important difference with respect to the `joint_matrix_mad` interface is the addition of the `BinaryOperator Op` parameter. `Op` may be either: + +`sycl::bit_and()` + +or + +`sycl::bit_xor()` + +The A, B, and C `joint_matrix` objects are constructed and loaded/stored in the normal way, using the previously defined `joint_matrix`, `joint_matrix_load`, and `joint_matrix_store` interfaces respectively. +The C matrix must be loaded from an array of 32 bit signed integers, and the A, B single bit `joint_matrix` structs must be loaded from an array of unsigned 32-bit integers storing the packed binary matrix. +Each element of the array of unsigned 32-bit integers, from which a `joint_matrix` of type `matrix_use::a` or `matrix_use::b` is loaded, should contain 32 elements of a matrix row in packed format. + +IMPORTANT: When using Binary Multiply and Add, a `joint_matrix` of type `matrix_use::a` must have `matrix_layout::row_major`, and a `joint_matrix` of type `matrix_use::b` must have `matrix_layout::col_major`. In both cases the first template parameter of `joint_matrix` must be `precision::b1`. + +IMPORTANT: Binary Multiply and Add operations are an experimental hardware feature and all implementation details are subject to change. + +##### Motivation for BMAD (TODO: this may not be the correct place for this or it should be made more concise, however for the purpose of explaining BMAD within the context of the review I made it more verbose.) + +Single-bit MADs can be used as part of Binarized Neural Networks (BNNs) in the case that *both* the activations *and* weights are binarized. "Quantizing" a network to form a BNN represents the extreme limit of reducing the precision of the network degrees of freedom in order to gain performance and improve efficiency. +Hubara et al. (I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized Neural Networks, Advances in Neural Information Processing Systems 29 (NIPS 2016)) first demonstrated the utility of an algorithm that could use both binarized activations and weights with backpropagation, by keeping track of real valued weights which are mapped to the binarized weights. In the backwards pass the real valued weights are updated according to a heuristic named the "Straight Through Estimator", whereby the gradient of the loss function with respect to the real weights is set equal to the gradient of the loss function with respect to the binarized weights. +This implies that the precision of the data type used in the matrix multiplications can be single bit, with the necessary addition of forward and backward element wise mappings between binarized and real valued representations of the matrices. +This could prove a significant advantage for large models, since the computational cost of Matrix Multiplication scales with the number of elements per dimension, N, as O(N^3) for square matrices, whereas corresponding element wise operations scale as O(N^2). +Further algorithms based on this binarized approach have been proposed, e.g. see Rastegari et al. (M. Rastegari, V Ordonez, J. Redmon, and A. Farhadi. Computer Vision – ECCV 2016, 525-542) who have made a comparison between a binarized version of a CNN (Using a XNOR Binary Dot Product) and corresponding full precision models, for both the accuracy and performance of image classification using the ImageNet data set. + + +For an example of how binary MADs can be leveraged on current Nvidia® hardware see (A. Li, and S. Su. IEEE Transactions on Parallel and Distributed Systems, 32(7):1878-1891, 2021). +See this paper of the details of the interfaces used in the Nvidia® CUDA runtime API. + +##### Example using bitwise operations with `joint_matrix_bmad` + +```c++ +using namespace sycl::ext::oneapi::experimental::matrix; + +queue q; + q.submit([&](handler &cgh) { + auto accC = bufC.template get_access(cgh); + auto accA = bufA.template get_access(cgh); + auto accB = bufB.template get_access(cgh); + auto accD = bufD.template get_access(cgh); + range<2> LocalRange = {1, N_THREADS_PER_MATRIX_OP}; + range<2> GlobalRange = {Sub_Tiles_M, Sub_Tiles_N * N_THREADS_PER_MATRIX_OP}; + cgh.parallel_for>( + nd_range<2>(GlobalRange, LocalRange), [=](nd_item<2> item) { + sycl::sub_group sg = item.get_sub_group(); + const auto m = item.get_group().get_id()[0]; // row id of current submatrix of BIG C matrix + const auto n = item.get_group().get_id()[1]; // column id of current submatrix of BIG C matrix + joint_matrix sub_a; + joint_matrix sub_b; + joint_matrix sub_c; + joint_matrix_load(sg, sub_c, accC.get_pointer() + (m * M) * Big_N + n * N, Big_N); + for (int k = 0; k < Sub_Tiles_K; k++) { // row/col id of current submatrix of BIG A/B matrices + joint_matrix_load(sg, sub_a, accA.get_pointer() + (k * (K / 32)) + (m * M * (Big_K / 32)), Big_K); + joint_matrix_load(sg, sub_b, accB.get_pointer() + (n * N * (Big_K / 32)) + (k * (K / 32)), Big_K); + sub_c = joint_matrix_bmad(sg, sub_a, sub_b, sub_c, Op); + } + joint_matrix_store(sg, sub_c, accD.get_pointer() + (m * M) * Big_N + n * N, Big_N); + });}); +``` #### Matrix Initialization: `joint_matrix_fill` -The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill` makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to `_tile_zero` intrinsic: +The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill` makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to the `_tile_zero` intrinsic: ```c++ namespace sycl::ext::oneapi::experimental::matrix { @@ -188,11 +279,11 @@ Class 2- Piece-wise operations where the operation depends on the element index // We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future. ##### Explicit conversion with mapping from SIMD to SPMD -The data elements in a joint_matrix are distributed or shared across the work-items in the Group in an implementation-defined way. There is no fixed allocation of matrix elements owned by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the work items in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. However, it is necessary to allocate WIs to specific elements of the matrix. In order to be able to perform piece-wise operations in a general and efficient way, we provide a conversion function from the joint_matrix domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model. +The data elements in a `joint_matrix` are distributed or shared across the work-items in the Group in an implementation-defined way. There is no fixed allocation of matrix elements owned by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the work items in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. However, it is necessary to allocate WIs to specific elements of the matrix in order to perform element wise operations. In order to be able to perform element-wise operations in a general and efficient way, we provide a conversion function from the `joint_matrix` domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model. -We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI. So modifying `wi_data` means also modifying the joint matrix corresponding elements. The indexing provided inside the `wi_data` class acesses only the portion of the current WI and returns `wi_element`. This latter holds a reference to the original joint_matrix that `wi_data` was constructed from. Users can use the `=` operator to update the element of the `joint_matrix` represented by the `wi_element` after the element-wise operation. +We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI. The indexing provided inside the `wi_data` class accesses only the portion of the current WI and returns `wi_element`. This latter holds a reference to the original joint_matrix that `wi_data` was constructed from. This means that modifying `wi_data` also modifies the corresponding joint matrix elements. Users can use the `=` operator to update the element of the `joint_matrix` represented by the `wi_element` after the element-wise operation. -Using `get_wi_data`, it is not possible to know which portions of data are owned by each thread in the group as this is implementation defined and change from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping coordinates information must be known to reason about the matrix view and extract the relevant piece. But for element-wise operations where the same operation is performed on all the elements of the matrix, having all the WIs in the group apply the operation inside a loop iterating over the `length` of `wi_data` guarantees the whole matrix element-wise operation. +Using `get_wi_data`, it is not possible to know which portions of data are owned by each thread in the group as this is implementation defined and changes from one backend to the other. For general piece-wise operations, such as summing the of rows of a matrix, the WI data to joint matrix mapping must be known in order to reason about the matrix view and extract the relevant piece. However, for element-wise operations where the same operation is performed on all the elements of the matrix, having all the WIs in the group apply the operation inside a loop iterating over the `length` of `wi_data` guarantees the whole matrix element-wise operation. Therefore, this extension currently only supports class 1 of operations because the mapping between `get_wi_data` and `joint_matrix` elements is not required to be known for these operations. However, general piece-wise operations will be supported in the future as a new API will be provided to convey the mapping from `joint_matrix` domain to WI Domain (See Section "WI data to joint matrix mapping coordinates information for piece-wise operations for more information"). @@ -200,11 +291,11 @@ Also, note that `get_wi_data` cannot return a fixed size array length because th 1- The main compilation mode of SYCL is JIT compilation and partitioning among WIs is implementation defined. -2- SG size is not fixed (like in the CUDA backend where warp size is always 32). +2- SG size is not generally fixed. 3- AMX has the flexibility of allowing variable sizes on the matrix (`dynamic_extent`). -In the case of CUDA backend which is SYCL AOT compiled and SG size = 32 known and fixed, the additional marray capability will be provided. +IMPORTANT: In the CUDA backend it is possible to know how many elements are owned by each WI at compile time and the SG size = 32 is known and fixed. In this case an second optional interface for performing element-wise operations allowing for more concise source code and in some cases performance benefits, but at the expense of not being portable to other backends, is provided in the next section. The code listing below shows a synopsis of these new APIs. @@ -242,12 +333,26 @@ for (int i = 0; i < wi_data_c.length(); i++) wi_data_c[i] *= alpha; // Note that the indexing here "i" is in the vector owned by a WI, not in the matrix C ``` -IMPORTANT: In the current implementation, only the subgroup scope is supported. +IMPORTANT: In the current implementation, only the subgroup scope is supported. -IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet. +IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet. -IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend. +##### `wi_data` as an `marray` for Nvidia® Tensor Cores + +In the Nvidia® Tensor Cores case it is possible to know how many elements are owned by each WI at compile time. Due to this, in the CUDA backend we introduce an `marray` data member, `joint_matrix.wi_marray`, to the `joint_matrix` struct representing the WI portion of the `joint_matrix` owned by each local work item. This enables Class 1 operations to be performed on the `joint_matrix` without requiring a loop. In addition some math functions are optimized for marrays using vectorized instructions. An example is the `fma` SYCL math function, whose usage within the `joint_matrix` context is given in the following code snippet: + +```c++ +joint_matrix sub_a; +joint_matrix sub_b; +joint_matrix sub_c; +joint_matrix sub_d; +joint_matrix_fill(sg, sub_a, -1); +joint_matrix_fill(sg, sub_b, -1); +joint_matrix_fill(sg, sub_c, -1); +sub_d.wi_marray = fma(sub_a.wi_marray, sub_b.wi_marray, sub_c.wi_marray); +``` +IMPORTANT: `wi_marray` is not available for `precision::b1`. ## VNNI/Packed Layout Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.