From ce12ec028681aa90133c518126014b0881d9e6bc Mon Sep 17 00:00:00 2001 From: Dounia Khaldi Date: Tue, 5 Oct 2021 09:40:33 -0500 Subject: [PATCH 1/6] [SYCL][Doc][matrix] Update the experimental matrix interface that matches the new AMX/DPAS JIT implementation This includes: - Minor changes to the joint_matrix interface - Added the query interface - Update implementation information: Added JIT support for both DPAS and AMX --- .../Matrix/dpcpp-joint-matrix.asciidoc | 374 +++++++++++++++--- 1 file changed, 309 insertions(+), 65 deletions(-) diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index f7394df705a6a..0426a8ce6ed89 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -33,7 +33,7 @@ SYCL specification refer to that revision. **_NOTE:_** _This document describes the current design and API for the matrix extension to {dpcpp}. This is an initial experimental version to try out functionality -and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX). We are going to work with the community on incrementally improving +and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX) and DPAS. We are going to work with the community on incrementally improving the API to bring them closer to standard C++ (aligned with the `std::mdspan` and `std::mdarray` proposals) and SYCL in the next several months._ ## Introduction @@ -53,14 +53,16 @@ value to determine which of the extension's APIs the implementation supports. |====================== |Value |Description |1 |Initial extension implementation on AMX. Base features are supported. +|2 |Initial extension JIT implementation on AMX and DPAS. load, store, mad and the query interface are supported |====================== ## New `joint_matrix` class We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the memory layout, and the memory scope of the matrix. This results into the following description: ```c++ -namespace sycl::ext::intel::experimental::matrix { -template +namespace sycl::ext::oneapi::experimental::matrix { +template struct joint_matrix { joint_matrix(Group g) {} }; @@ -76,7 +78,7 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported When the group is a `sycl::sub_group`, a matrix is declared as follows: ```c++ -joint_matrix tA(sg); +joint_matrix tA(sg); ``` #### Shape @@ -89,7 +91,7 @@ IMPORTANT: In the current implementation, only the static extent is supported Besides row major and column major layouts, `matrix_layout` is flexible enough to introduce customed layouts such as symmetric or tiled layouts. ```c++ -namespace sycl::ext::intel::experimental::matrix { +namespace sycl::ext::oneapi::experimental::matrix { enum class matrix_layout { row_major, col_major, @@ -99,10 +101,10 @@ enum class matrix_layout { } ``` -AMX hardware requires B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B tile uses the VNNI format, which is described below. The user must provide the information of packed_b layout to make the implementation allocate the right shape. The layout information for AMX should be specified in user code as follows: +AMX and DPAS hardware require B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B tile uses the VNNI format, which is described below. The user must provide the information of packed_b layout to make the implementation allocate the right shape. The layout information for AMX should be specified in user code as follows: ```c++ -joint_matrix tB(sg); +joint_matrix tB(sg); ``` IMPORTANT: In the current implementation, only `packed_b` layout is necessary to specify on matrix B, the layout on other matrices is ignored. @@ -113,7 +115,7 @@ We define three new functions needed to perform the main and common operations o The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). `stride` describes the number of elements between consecutive rows for row major and packed layout, columns for column major layout. -Note that for getting maximum performance on AMX, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` in column major case, `packed_b` in row major case), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. +Note that for getting maximum performance on AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` in column major case, `packed_b` in row major case), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. IMPORTANT: In the current implementation, the layout in the load of matrix B must be `packed_b`. Therefore, both the template parameter for the declaration of the B matrix and the call to `joint_matrix_load` for the B matrix must specify the `packed_b` layout. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`. @@ -121,29 +123,29 @@ Since the matrix functions are group operations (as defined in Section 4.17.3 of To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: -IMPORTANT: In the current implementation, only the subgroup scope is supported. Moreover, a kernel using this extension must be decorated with the [[sycl::reqd_sub_group_size(1)]] attribute. +IMPORTANT: In the current implementation, only the subgroup scope is supported. #### Load ```c++ -namespace sycl::ext::intel::experimental::matrix { +namespace sycl::ext::oneapi::experimental::matrix { template - void joint_matrix_load(Group sg, joint_matrix &res, - multi_ptr src, size_t stride, matrix_layout layout = matrix_layout::row_major); + void joint_matrix_load(Group sg, joint_matrix &res, + multi_ptr src, size_t stride, matrix_layout memL); } ``` -This function loads data from memory to the 2d tiles of AMX that is a 2d storage. +This function loads data from memory to the 2d tiles/registers of AMX/DPAS. #### Store ```c++ -namespace sycl::ext::intel::experimental::matrix { +namespace sycl::ext::oneapi::experimental::matrix { template - void joint_matrix_store(Group sg, joint_matrix &res, - multi_ptr src, size_t stride, matrix_layout layout = matrix_layout::row_major); + void joint_matrix_store(Group sg, joint_matrix &res, + multi_ptr src, size_t stride, matrix_layout memL); } ``` This function stores the data from the 2d tiles back to memory. @@ -151,20 +153,20 @@ This function stores the data from the 2d tiles back to memory. #### Multiply and Add ```c++ -namespace sycl::ext::intel::experimental::matrix { - template - joint_matrix joint_matrix_mad(Group sg, joint_matrix A, - joint_matrix B, joint_matrix C); +namespace sycl::ext::oneapi::experimental::matrix { + template + joint_matrix joint_matrix_mad(Group sg, joint_matrix A, + joint_matrix B, joint_matrix C); } ``` The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result. ## VNNI/Packed Layout -AMX compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. +AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed_b` layout for a 16-bit type. #### Example 1: 16-bit elements @@ -202,11 +204,10 @@ The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the ca ## Example using int8_t type ```c++ -using namespace sycl::ext::intel::experimental::matrix; +using namespace sycl::ext::oneapi::experimental::matrix; queue q; -range<2> G = {M, N}; -// For this first implementation, SG_SIZE has to be equal to one +range<2> G = {M/tM, N}; range<2> L = {1, SG_SIZE}; int8_t *memA = malloc_shared(M*K, q); int8_t *memB = malloc_shared(K*N, q); @@ -219,66 +220,309 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) const auto sg_startx = global_idx - item.get_local_id(0); const auto sg_starty = global_idy - item.get_local_id(1); sub_group sg = item.get_sub_group(); - joint_matrix tA(sg); + joint_matrix tA(sg); // For B, since current implementation does not support non packed layout, - // users need to specify the updated VNNI sizes along with the packed_b layout - joint_matrix tB(sg); - joint_matrix tC(sg); - joint_matrix_load(sg, tC, memC + sg_startx * tM * N + sg_starty, N, matrix_layout::row_major); + // users need to specify the packed_b layout + joint_matrix tB(sg); + joint_matrix tC(sg); + joint_matrix_load(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, matrix_layout::row_major); for (int k = 0; k < K; k += tk) { joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, matrix_layout::row_major); - joint_matrix_load(sg, tB, memB + k * N + sg_starty, N, matrix_layout::packed_b); // VNNI + joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN*4, N*4, matrix_layout::packed_b); // VNNI tC = joint_matrix_mad(sg, tA, tB, tC); } - joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty, N, matrix_layout::row_major); + joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, matrix_layout::row_major); }).wait(); - ``` -## Implementation Status -For oneAPI release 3, an AOT implementation is available on the CPU device to targets AMX hardware. we are using AMX tile intrinsics to implement the matrix load and store operations. Since we are currently emitting AMX intrinsics directly, this only enables AOT compilation. -Currently, this is the compilation command line needed to invoke AMX unit of Sapphire Rapids CPU: +== Query Interface +AMX, DPAS and Nvidia TPUs support different sizes and types. +The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation. +This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, +so there will be no runtime errors. +The query interface proposed here consists of three functionalities: + +- At compile time, inform the user whether a specific combination is valid or not. + +- Construct the matrices using a default shape if user does not provide a combination + +- General query interface for sizes, types, static/dynamic, scope. This is needed to void padding by the user, for tuning, and efficient code generation if used by a library. ```c++ -clang++ -fsycl -march=sapphirerapids fsycl-targets="spir64_x86_64-unknown-linux" -O2 matmul-int8.cpp -o matmul-int8 -``` +namespace sycl::ext::oneapi::experimental::matrix { + + +template +struct tpu_params; + +// Valid or not: +// Specialization when both types and sizes are given +template +struct tpu_params< + tpu::amx, Ta, Tb, Tc, M, N, K, + typename std::enable_if<( + !std::is_same_v && !std::is_same_v && + !std::is_same_v && M != 0 && N != 0 && K != 0)>::type> { + // Validate that parameters are supported + static_assert( + (M == 0 && N == 0 && K == 0) || + (is_combination_valid_amx(M, N, K)), + "Invalid parameters for AMX, query valid types and maximum sizes " + "using: " + "tpu_params myparams; and then check out myparams.combinations array"); + + + // if combination is valid, construct the matrices + + static constexpr std::size_t defaultM = (M != 0) ? M : 16; + static constexpr std::size_t defaultN = (N != 0) ? N : 16; + static constexpr std::size_t defaultK = + (K != 0) ? K : ((sizeof(Ta) == 1) ? 64 : 32); + + template + using joint_matrix_a = + joint_matrix; + template + using joint_matrix_b = + joint_matrix; + template + using joint_matrix_c = + joint_matrix; + + bool dynamic_p = false; // should be true in future implementations + // because AMX hardware supports dynamic sizes + uint32_t numtiles = 8; + scope_t scope = scope_t::sub_group; +}; -Please refer to the section "Future Implementation Work" that talks about the future unified SPIR-V path that will enable JIT compilation. +// Sizes-only query +// Specialization for when only types are given, need to query only sizes +template +struct tpu_params && + !std::is_same_v && + !std::is_same_v)>::type> { + static_assert((are_types_valid_amx()), + "Invalid types for AMX, supported types are int8_t, uint8_t, " + "and bf16 (Note that unsigned short should be used in the" + "DPC++ code to implement bf16) "); + + // construct the matrices using the default sizes + static constexpr std::size_t defaultM = 16; + static constexpr std::size_t defaultN = 16; + static constexpr std::size_t defaultK = ((sizeof(Ta) == 1) ? 64 : 32); + + template + using joint_matrix_a = + joint_matrix; + template + using joint_matrix_b = + joint_matrix; + template + using joint_matrix_c = + joint_matrix; + + bool dynamic_p = false; // should be true in future implementations because + // AMX hardware supports dynamic sizes + uint32_t numtiles = 8; + scope_t scope = scope_t::sub_group; + struct combination { + uint32_t max_msize; + uint32_t max_nsize; + uint32_t max_ksize; + matrix_type atype; + matrix_type btype; + matrix_type ctype; + uint32_t msize; + uint32_t nsize; + uint32_t ksize; + }; + static constexpr combination combinations[] = { + {16, 16, (sizeof(Ta) == 1) ? 64 : 32}}; + static constexpr int num_combinations = + sizeof(combinations) / sizeof(combination); +}; + +// General query: +// types are not given, no default sizes and no implicit matrix construction +template +struct tpu_params { + static constexpr std::size_t defaultM = -1; // depends on the type + static constexpr std::size_t defaultN = -1; + static constexpr std::size_t defaultK = -1; + + bool dynamic_p = false; // should be true in future implementations because + // AMX hardware supports dynamic sizes + uint32_t numtiles = 8; + scope_t scope = scope_t::sub_group; + struct combination { + uint32_t max_msize; + uint32_t max_nsize; + uint32_t max_ksize; + matrix_type atype; + matrix_type btype; + matrix_type ctype; + uint32_t msize; + uint32_t nsize; + uint32_t ksize; + }; + using mt = matrix_type; + static constexpr combination combinations[] = { + {16, 16, 64, mt::sint8, mt::sint8, mt::sint32}, + {16, 16, 64, mt::sint8, mt::uint8, mt::sint32}, + {16, 16, 64, mt::uint8, mt::sint8, mt::sint32}, + {16, 16, 64, mt::uint8, mt::uint8, mt::sint32}, + {16, 16, 32, mt::bf16, mt::bf16, mt::fp32}}; + static constexpr int num_combinations = + sizeof(combinations) / sizeof(combination); +}; + + +enum class tpu { + dpas, + amx +}; -### Current Implementation Restrictions -This section summarizes the specific features that this implementation supports. In future versions of this API and implementation, the expectation is to provide a query interface to guide the usage of this API. +enum class matrix_type { +bf16, +fp16, +fp19, // tfloat32 +fp32, +fp64, +sint2, +sint4, +sint8, +sint16, +sint32, +sint64, +uint2, +uint4, +uint8, +uint16, +uint32, +uint64 +}; -#### Type, Sizes, and Layouts -The types supported by this AMX implementation are restricted to the types that AMX hardware support. Although the AMX hardware supports 2d tiles with a maximum size of 16x64 bytes, this current implementation can handle any size. If the matrix size is bigger than 1024 bytes, it will be stored in memory rather than mapped to a 2d tile. Performance penalty may occur in this case. In order to get the best performance with this implementation, matrix sizes should be no larger than 16x64 bytes and B matrix should be already packed (put in VNNI format). +enum class scope_t { +sub_group, +work_group +}; +} +``` -More specifically, the following operation C = A*B+C can be performed on AMX with this interface where: -A(int8, any-size, row_major), B(int8, any-size, packed_b), C(int32, any-size, row_major) +=== Valid or not Example: +```c++ +// User can provide sizes and tpu_params can assert if they are supported or not +using myparams = tpu_params; +// use this to construct the ranges on the host side +size_t NDRangeM = M / myparams::M; +size_t NDRangeN = N / myparams::N; +// if M,N,K do not multiply the default sizes, padding has to be inserted +// device code: the matrices are constructed using the default dimensions +myparams::joint_matrix_a sub_a(sg); +myparams::joint_matrix_b sub_b(sg); +myparams::joint_matrix_c sub_c(sg); +``` -or +=== Default Values Example: +```c++ +using myparams = tpu_params_both; +// use this to construct the ranges on the host side +size_t NDRangeM = M / myparams::defaultM; +size_t NDRangeN = N / myparams::defaultN; +//if M,N,K do not multiply the default sizes, padding has to be done +// device code: the matrices are constructed using the default dimensions +myparams::joint_matrix_a sub_a(sg); +myparams::joint_matrix_b sub_b(sg); +myparams::joint_matrix_c sub_c(sg); -A(bf16, any-size, row_major), B(bf16, any-size, packed_b), C(float, any-size, row_major). +``` -No other types or layouts are supported at this time. +=== General Query Example: +```c++ +M = 1500; // with msize = 8 and msize = 4, + // M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops +constexpr myparams params = myparams(); +constexpr int msize = break_dimension(params.combinations, M); +constexpr int msize_remainder = break_dimension_remainder(params.combinations, M); +constexpr int nsize = params.combinations[0].nsize; +constexpr int ksize = params.combinations[0].ksize; +// device code: +joint_matrix sub_a(sg); +joint_matrix sub_b(sg); +joint_matrix sub_c(sg); +//Remainder handling +``` -#### Memory and Execution Scope -This current implementation only considers a sub-group scope. However, the sub-group size has to be equal to one in this first implementation. In this case, a kernel using this extension must be decorated with the [[sycl::reqd_sub_group_size(1)]] attribute. +//No need to provide more details in this section because the query interface can serve this. +//## Implementation Status -## Future Implementation Work +//### oneAPI 2022.0 release +//For oneAPI 2022.0 release, a JIT implementation has been made available on both AMX and DPAS hardware of the specific features discussed above. In this case, there is no need to specify any architectural options to the command line. The static query interface can be used to guide the usage of this API. +// The DPAS and AMX implementations support the logical capability support of the HW -### Unified LLVM IR and SPIRV JIT Enabling -To enable JIT compilation, a unified matrix IR needs to be added. Currently, there is no matrix type in LLVM IR or SPIR-V. We are working towards adding a new matrix type in both LLVM IR and SPIR-V. This JIT enabling is expected to be part of a future compiler release. -#### LLVM IR Extension -As a short-term solution, we are extending the https://llvm.org/docs/LangRef.html#llvm-matrix-transpose-intrinsic[existing LLVM IR matrix intrinsics] to include features like VNNI layout. The current matrix intrinsics use flattened vectors to represent the matrix. Therefore, we are exploring both adding matrix type to LLVM IR and also using MLIR `vector` dialect for this work. -#### SPIR-V Extension -The current draft proposal can be found https://gitlab.devtools.intel.com/OpenCL/opencl-extension-drafts/-/blob/master/SPV_INTEL_matrix.asciidoc[here]. -We are adding translation from LLVM IR matrix to SPIR-V matrix and vice versa in the LLVM to SPIR-V translator tool. ## Future-looking API +### Matrix Initialization: `joint_matrix_fill` +The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill` makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On AMX, if the initialization constant is zero, this would map to `_tile_zero` intrinsic: + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + template + void joint_matrix_fill(Group sg, joint_matrix &m, const T& v); +} +``` + +### Element Indexing and Element-Wise Operations +There were multiple options on how to enable this feature. + +#### Option 1: Non-restrictive element indexing +Allowing non-restrictive element indexing on the matrix element as shown below would result into slow indexing on the GPU. + Besides, it will rely heavily on spirv and compiler vectorization: + +```c++ +matrix C; +for (int i = 0; i < 8; i++) + for (int j = 0; j < 8; j++) + C(i,j) *= alpha; //Align with mdspan +``` +#### Option2: Restrictive fast element indexing +In the DPC++ context, the expectation is that all element-wise operations will happen in a converged control path by all work items in the group. +Option 2 proposes a new set of element-wise operations by overloading existing operations to work on `matrix` object. An example is shown below: +```c++ +joint_matrix C(sg); + C *= alpha; +``` +The problem with this option is that it is restrictive to a very limited set of operations. + +#### Option3: Restrictive conversion in the interface from SIMD to SIMT +Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix. +While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms. + +#### proposal: Explicit conversion in the interface from SIMD to SIMT +We introduce a new function `get_wi_slice` that provides any portion of the matrix that the user wants but in a SIMT array object:. + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { +template + marray get_wi_slice(joint_matrix &m, size_t row_index, + size_t col_index, size_t n_rows, size_t n_cols); +} +``` + +Example where each WI gets 1 column: +```c++ +marray wi_C = get_wi_slice(C, 0, wi_idx, msize, 1, matrix_layout::row_major); +for (int i = 0; i < msize; i++) + row_sum += wi_C[i]; +``` + ### Memory scope The current experimental API uses `joint_` semantics to define the memory scope of the matrix. The long term solution is to use the proposed https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/LocalMemory/SYCL_INTEL_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below. @@ -297,9 +541,8 @@ We did not utilize this extension for this matrix API version because sub-group - What should the API description include: (1) only features that are implemented, (2) features that are actually part of the API: currently implemented and the ones that we expect implementing them in the future. Specifically, should the document include things like dynamic_ extent and Group? These are part of the API but are not currently implemented. ## TODO List -- Handle sub group sizes that are bigger than one. -- Add support for queries that gives information about the capabilities of the implementation on a particular device. -- Once the SPIRV translator work is done, this code generation work will move to the backend along enabling JIT compilation. +- Add support for fill matrix and element-wise operations features +- Add 'matrix_use' parameter to the matrix to distinguish between matrix A, B, and matrix accumulator. This is necessary for supporting VNNI and transpose transform ## Revision History @@ -307,4 +550,5 @@ We did not utilize this extension for this matrix API version because sub-group |====================== |Rev |Date |Author |Changes |1 |2021-04-13 |Dounia Khaldi |Initial public working draft. +|2 |2021-10-05 |Dounia Khaldi |JIT implementation on both AMX and DPAS |====================== From c1b7a130c5216b09b88012d5f4bb8ea7ca2dfcf5 Mon Sep 17 00:00:00 2001 From: Dounia Khaldi Date: Thu, 7 Oct 2021 15:08:01 -0500 Subject: [PATCH 2/6] [SYCL][Doc][matrix] Incorporate John's, Jack's, and Greg's reviews, specifically: - John: improve the naming for layouts arguments, fix valid or not example, replace SIMT with SPMD - Jack: clarify packed_a/b layouts, add a comment about ptx 'mma' , fix a typo (void), clarify the default sizes query - Greg: add layout to the parameter list of the joint_matrix type alias in the query, adding types alias, explain difference between msize and max_msize, remove mt alias as it can confuse the reader, fix indentation, fix the general query example, --- .../Matrix/dpcpp-joint-matrix.asciidoc | 172 +++++++++--------- 1 file changed, 85 insertions(+), 87 deletions(-) diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index 0426a8ce6ed89..00ea67327871c 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -37,7 +37,7 @@ and performance, and **future versions of this API may change in ways that are i the API to bring them closer to standard C++ (aligned with the `std::mdspan` and `std::mdarray` proposals) and SYCL in the next several months._ ## Introduction -This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: AMX in Intel CPU, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer. +This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer. ## Feature test macro @@ -52,8 +52,8 @@ value to determine which of the extension's APIs the implementation supports. [frame="none",options="header"] |====================== |Value |Description -|1 |Initial extension implementation on AMX. Base features are supported. -|2 |Initial extension JIT implementation on AMX and DPAS. load, store, mad and the query interface are supported +|1 |Initial extension implementation on Intel AMX. Base features are supported. +|2 |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad and the query interface are supported |====================== ## New `joint_matrix` class @@ -82,7 +82,7 @@ joint_matrix tA(sg); ``` #### Shape -The same class `joint_matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. +The same class `joint_matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. IMPORTANT: In the current implementation, only the static extent is supported @@ -101,7 +101,7 @@ enum class matrix_layout { } ``` -AMX and DPAS hardware require B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B tile uses the VNNI format, which is described below. The user must provide the information of packed_b layout to make the implementation allocate the right shape. The layout information for AMX should be specified in user code as follows: +Intel AMX and DPAS hardware require B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B tile uses the VNNI format, which is described below. The user must provide the information of packed_b layout to make the implementation allocate the right shape. The layout information for Intel AMX should be specified in user code as follows: ```c++ joint_matrix tB(sg); @@ -115,7 +115,7 @@ We define three new functions needed to perform the main and common operations o The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). `stride` describes the number of elements between consecutive rows for row major and packed layout, columns for column major layout. -Note that for getting maximum performance on AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` in column major case, `packed_b` in row major case), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. +Note that for getting maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` when matrix `C` is column major, `packed_b` when matrix `C` is row major), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. IMPORTANT: In the current implementation, the layout in the load of matrix B must be `packed_b`. Therefore, both the template parameter for the declaration of the B matrix and the call to `joint_matrix_load` for the B matrix must specify the `packed_b` layout. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`. @@ -129,13 +129,13 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported. ```c++ namespace sycl::ext::oneapi::experimental::matrix { template - void joint_matrix_load(Group sg, joint_matrix &res, - multi_ptr src, size_t stride, matrix_layout memL); + void joint_matrix_load(Group sg, joint_matrix &res, + multi_ptr src, size_t stride, matrix_layout MemLayout); } ``` -This function loads data from memory to the 2d tiles/registers of AMX/DPAS. +This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS. #### Store @@ -166,7 +166,7 @@ The matrix multiply and add function performs the multiply operation on the matr ## VNNI/Packed Layout -AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. +Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed_b` layout for a 16-bit type. #### Example 1: 16-bit elements @@ -236,7 +236,7 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) ``` == Query Interface -AMX, DPAS and Nvidia TPUs support different sizes and types. +Intel AMX, DPAS and Nvidia TPUs support different sizes and types. The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation. This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, so there will be no runtime errors. @@ -244,9 +244,9 @@ The query interface proposed here consists of three functionalities: - At compile time, inform the user whether a specific combination is valid or not. -- Construct the matrices using a default shape if user does not provide a combination +- Construct the matrices using a default shape if user does not provide a combination. This corresponds to the case where the user provides the sizes of large `tile` matrices but does not specify the sizes of the corresponding submatrices of the `tiles`. In this case, the query will construct these submatrices of the matrices whose size the user provided. -- General query interface for sizes, types, static/dynamic, scope. This is needed to void padding by the user, for tuning, and efficient code generation if used by a library. +- General query interface for sizes, types, static/dynamic, scope. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. ```c++ namespace sycl::ext::oneapi::experimental::matrix { @@ -255,7 +255,7 @@ namespace sycl::ext::oneapi::experimental::matrix { template struct tpu_params; -// Valid or not: +// Valid or Not: // Specialization when both types and sizes are given template struct tpu_params< @@ -267,11 +267,15 @@ struct tpu_params< static_assert( (M == 0 && N == 0 && K == 0) || (is_combination_valid_amx(M, N, K)), - "Invalid parameters for AMX, query valid types and maximum sizes " + "Invalid parameters for Intel AMX, query valid types and maximum sizes " "using: " "tpu_params myparams; and then check out myparams.combinations array"); + using type_a = Ta; // this type alias is not available in the current implementation + using type_b = Tb; // this type alias is not available in the current implementation + using type_c = Tc; // this type alias is not available in the current implementation + // if combination is valid, construct the matrices static constexpr std::size_t defaultM = (M != 0) ? M : 16; @@ -279,18 +283,15 @@ struct tpu_params< static constexpr std::size_t defaultK = (K != 0) ? K : ((sizeof(Ta) == 1) ? 64 : 32); + template + using joint_matrix_a = joint_matrix; template - using joint_matrix_a = - joint_matrix; - template - using joint_matrix_b = - joint_matrix; + using joint_matrix_b = joint_matrix; template - using joint_matrix_c = - joint_matrix; + using joint_matrix_c = joint_matrix; bool dynamic_p = false; // should be true in future implementations - // because AMX hardware supports dynamic sizes + // because Intel AMX hardware supports dynamic sizes uint32_t numtiles = 8; scope_t scope = scope_t::sub_group; }; @@ -303,27 +304,28 @@ struct tpu_params && !std::is_same_v)>::type> { static_assert((are_types_valid_amx()), - "Invalid types for AMX, supported types are int8_t, uint8_t, " + "Invalid types for Intel AMX, supported types are int8_t, uint8_t, " "and bf16 (Note that unsigned short should be used in the" "DPC++ code to implement bf16) "); - + + using type_a = Ta; // this type alias is not available in the current implementation + using type_b = Tb; // this type alias is not available in the current implementation + using type_c = Tc; // this type alias is not available in the current implementation + // construct the matrices using the default sizes static constexpr std::size_t defaultM = 16; static constexpr std::size_t defaultN = 16; static constexpr std::size_t defaultK = ((sizeof(Ta) == 1) ? 64 : 32); + template + using joint_matrix_a = joint_matrix; template - using joint_matrix_a = - joint_matrix; - template - using joint_matrix_b = - joint_matrix; + using joint_matrix_b = joint_matrix; template - using joint_matrix_c = - joint_matrix; + using joint_matrix_c = joint_matrix; bool dynamic_p = false; // should be true in future implementations because - // AMX hardware supports dynamic sizes + // Intel AMX hardware supports dynamic sizes uint32_t numtiles = 8; scope_t scope = scope_t::sub_group; struct combination { @@ -352,9 +354,9 @@ struct tpu_params { static constexpr std::size_t defaultK = -1; bool dynamic_p = false; // should be true in future implementations because - // AMX hardware supports dynamic sizes + // Intel AMX hardware supports dynamic sizes uint32_t numtiles = 8; - scope_t scope = scope_t::sub_group; + constscope_t scope = scope_t::sub_group; struct combination { uint32_t max_msize; uint32_t max_nsize; @@ -366,13 +368,13 @@ struct tpu_params { uint32_t nsize; uint32_t ksize; }; - using mt = matrix_type; + static constexpr combination combinations[] = { - {16, 16, 64, mt::sint8, mt::sint8, mt::sint32}, - {16, 16, 64, mt::sint8, mt::uint8, mt::sint32}, - {16, 16, 64, mt::uint8, mt::sint8, mt::sint32}, - {16, 16, 64, mt::uint8, mt::uint8, mt::sint32}, - {16, 16, 32, mt::bf16, mt::bf16, mt::fp32}}; + {16, 16, 64, matrix_type::sint8, matrix_type::sint8, matrix_type::sint32}, + {16, 16, 64, matrix_type::sint8, matrix_type::uint8, matrix_type::sint32}, + {16, 16, 64, matrix_type::uint8, matrix_type::sint8, matrix_type::sint32}, + {16, 16, 64, matrix_type::uint8, matrix_type::uint8, matrix_type::sint32}, + {16, 16, 32, matrix_type::bf16, matrix_type::bf16, matrix_type::fp32}}; static constexpr int num_combinations = sizeof(combinations) / sizeof(combination); }; @@ -384,28 +386,28 @@ enum class tpu { }; enum class matrix_type { -bf16, -fp16, -fp19, // tfloat32 -fp32, -fp64, -sint2, -sint4, -sint8, -sint16, -sint32, -sint64, -uint2, -uint4, -uint8, -uint16, -uint32, -uint64 + bf16, + fp16, + fp19, // tfloat32 + fp32, + fp64, + sint2, + sint4, + sint8, + sint16, + sint32, + sint64, + uint2, + uint4, + uint8, + uint16, + uint32, + uint64 }; enum class scope_t { -sub_group, -work_group + sub_group, + work_group }; } ``` @@ -413,16 +415,11 @@ work_group === Valid or not Example: ```c++ -// User can provide sizes and tpu_params can assert if they are supported or not -using myparams = tpu_params; -// use this to construct the ranges on the host side -size_t NDRangeM = M / myparams::M; -size_t NDRangeN = N / myparams::N; -// if M,N,K do not multiply the default sizes, padding has to be inserted -// device code: the matrices are constructed using the default dimensions -myparams::joint_matrix_a sub_a(sg); -myparams::joint_matrix_b sub_b(sg); -myparams::joint_matrix_c sub_c(sg); +// User can provide sizes besides the types and tpu_params can assert if they are supported or not +// in this case, an assertion will happens as 16 is not a supported size for M +using myparams = tpu_params; +size_t NDRangeM = M / myparams::defaultM; //Assertion would happen at this line +size_t NDRangeN = N / myparams::defaultN; ``` === Default Values Example: @@ -433,17 +430,17 @@ size_t NDRangeM = M / myparams::defaultM; size_t NDRangeN = N / myparams::defaultN; //if M,N,K do not multiply the default sizes, padding has to be done // device code: the matrices are constructed using the default dimensions -myparams::joint_matrix_a sub_a(sg); -myparams::joint_matrix_b sub_b(sg); -myparams::joint_matrix_c sub_c(sg); +myparams::joint_matrix_a sub_a(sg); +myparams::joint_matrix_b sub_b(sg); +myparams::joint_matrix_c sub_c(sg); ``` === General Query Example: ```c++ -M = 1500; // with msize = 8 and msize = 4, +constexpr int M = 1500; // with msize = 8 and msize = 4, // M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops -constexpr myparams params = myparams(); +tpu_params params; constexpr int msize = break_dimension(params.combinations, M); constexpr int msize_remainder = break_dimension_remainder(params.combinations, M); constexpr int nsize = params.combinations[0].nsize; @@ -455,13 +452,13 @@ joint_matrix sub_c(sg); //Remainder handling ``` -//No need to provide more details in this section because the query interface can serve this. +//No don't need to provide more details in this section because the query interface can serve this. //## Implementation Status //### oneAPI 2022.0 release -//For oneAPI 2022.0 release, a JIT implementation has been made available on both AMX and DPAS hardware of the specific features discussed above. In this case, there is no need to specify any architectural options to the command line. The static query interface can be used to guide the usage of this API. -// The DPAS and AMX implementations support the logical capability support of the HW +//For oneAPI 2022.0 release, a JIT implementation has been made available on both Intel AMX and DPAS hardware of the specific features discussed above. In this case, there is no need to specify any architectural options to the command line. The static query interface can be used to guide the usage of this API. +// The DPAS and Intel AMX implementations support the logical capability support of the HW @@ -469,7 +466,7 @@ joint_matrix sub_c(sg); ## Future-looking API ### Matrix Initialization: `joint_matrix_fill` -The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill` makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On AMX, if the initialization constant is zero, this would map to `_tile_zero` intrinsic: +The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill` makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to `_tile_zero` intrinsic: ```c++ namespace sycl::ext::oneapi::experimental::matrix { @@ -480,7 +477,7 @@ namespace sycl::ext::oneapi::experimental::matrix { ``` ### Element Indexing and Element-Wise Operations -There were multiple options on how to enable this feature. +There are multiple options on how to enable this feature. #### Option 1: Non-restrictive element indexing Allowing non-restrictive element indexing on the matrix element as shown below would result into slow indexing on the GPU. @@ -501,12 +498,13 @@ joint_matrix C(sg); ``` The problem with this option is that it is restrictive to a very limited set of operations. -#### Option3: Restrictive conversion in the interface from SIMD to SIMT +#### Option3: Restrictive conversion in the interface from SIMD to SPMD Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix. -While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms. +While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. + However using the `mma` ptx instructions as opposed to the `wmma` ptx instructions the mapping is known. Knowing this mapping is important for the user to implement new operations like sum of rows of a matrix for quantized algorithms. -#### proposal: Explicit conversion in the interface from SIMD to SIMT -We introduce a new function `get_wi_slice` that provides any portion of the matrix that the user wants but in a SIMT array object:. +#### proposal: Explicit conversion in the interface from SIMD to SPMD +We introduce a new function `get_wi_slice` that provides any portion of the matrix that the user wants but in a SPMD array object:. ```c++ namespace sycl::ext::oneapi::experimental::matrix { @@ -536,7 +534,7 @@ We did not utilize this extension for this matrix API version because sub-group ## Open Questions - Besides row, col major and packed (VNNI) layout, what are the additional layouts that should absolutely be added? -- Are there alternative names for the `packed_a` and `packed_b` layouts that would be clearer to distinguish between the VNNI Layout in matrix A and VNNI layout in matrix B of a matrix multiply and add operation on AMX? +- Are there alternative names for the `packed_a` and `packed_b` layouts that would be clearer to distinguish between the VNNI Layout in matrix A and VNNI layout in matrix B of a matrix multiply and add operation on Intel AMX? - Ronan Keryell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?" - What should the API description include: (1) only features that are implemented, (2) features that are actually part of the API: currently implemented and the ones that we expect implementing them in the future. Specifically, should the document include things like dynamic_ extent and Group? These are part of the API but are not currently implemented. @@ -550,5 +548,5 @@ We did not utilize this extension for this matrix API version because sub-group |====================== |Rev |Date |Author |Changes |1 |2021-04-13 |Dounia Khaldi |Initial public working draft. -|2 |2021-10-05 |Dounia Khaldi |JIT implementation on both AMX and DPAS +|2 |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS |====================== From e8482ac227b594da21422f10d01c90bccbea64b1 Mon Sep 17 00:00:00 2001 From: Dounia Khaldi Date: Fri, 15 Oct 2021 20:20:52 -0500 Subject: [PATCH 3/6] [SYCL][Doc][matrix] Incorporate Greg's review specifically: - Add changing the default sizes in the query to M, N, K to the todo list - Add missing layout to the alias matrices in query - Add comment about the combinations array in the size-only query case - Add the combinations array to the validation case as well, for consistency - Add a table that explains each of the query class members and type aliases - Adjust the order of types and sizes in the combination type --- .../Matrix/dpcpp-joint-matrix.asciidoc | 101 +++++++++++++----- 1 file changed, 76 insertions(+), 25 deletions(-) diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index 00ea67327871c..2d797925ea19c 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -246,7 +246,37 @@ The query interface proposed here consists of three functionalities: - Construct the matrices using a default shape if user does not provide a combination. This corresponds to the case where the user provides the sizes of large `tile` matrices but does not specify the sizes of the corresponding submatrices of the `tiles`. In this case, the query will construct these submatrices of the matrices whose size the user provided. -- General query interface for sizes, types, static/dynamic, scope. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. +- General query interface for sizes, types, static/dynamic, scope. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. + +The table below provides a desciption for each of the member variables and type aliases in `tpu_params` class. + +[frame="none",options="header"] +|====================== +| Member/type alias in `tpu_params` |Description +|`type_a`| type alias for the type of matrix A +|`type_b`| type alias for the type of matrix B +|`type_c`| type alias for the type of matrix C +|`defaultM`| when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports +|`defaultN`| when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports +|`defaultK`| when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports +|`joint_matrix_a`| type alias for `joint_matrix` for matrix A +|`joint_matrix_b`| type alias for `joint_matrix` for matrix B +|`joint_matrix_c`| type alias for `joint_matrix` for matrix C +|`dynamic_p`| a boolean that indicates whether the implementation supports dynamic sizes (true) or not (false) +|numtiles| indicates number of tiles in Intel AMX (does not apply to DPAS) +|scope| indicates the memory and execution scope supported by the TPU implementation +|`combination` | composes the types and sizes of A, B, C matrices allowed in one combination +|`max_msize`, `max_nsize`, `max_ksize`|When one of these members is non-zero, it indicates that the TPU supports all element sizes in the range from 1 up to the given value. By contrast, a zero value indicates that the TPU implementation supports only a discrete set of element sizes, which are given by the corresponding msize, nsize, or ksize members +|`msize`, `nsize`, `ksize`| presents one of the sizes that the TPU implementation supports +|`atype`, `btype`, `ctype`| indicates the types supported in the combination +|`combinations` |Tells the set of supported matrix sizes and types according to the template parameters that are provided. In the "general query" form, the user provides only the TPU type, so the combinations array contains all supported tile sizes and element types for that TPU. In the "default values" form, the user provides the TPU type and element types, so the combinations array contains only those supported matrix sizes and element types that match those element types on that TPU. In the "validation" form, the user provides the TPU type, element types, and element sizes. +|`num_combinations`| indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array +|====================== + + + + + ```c++ namespace sycl::ext::oneapi::experimental::matrix { @@ -285,15 +315,31 @@ struct tpu_params< template using joint_matrix_a = joint_matrix; - template + template using joint_matrix_b = joint_matrix; - template + template using joint_matrix_c = joint_matrix; - bool dynamic_p = false; // should be true in future implementations + static constexpr bool dynamic_p = false; // should be true in future implementations // because Intel AMX hardware supports dynamic sizes - uint32_t numtiles = 8; - scope_t scope = scope_t::sub_group; + static constexpr uint32_t numtiles = 8; + static constexpr scope_t scope = scope_t::sub_group; + struct combination { + uint32_t max_msize; + uint32_t max_nsize; + uint32_t max_ksize; + uint32_t msize; + uint32_t nsize; + uint32_t ksize; + matrix_type atype; + matrix_type btype; + matrix_type ctype; + }; + // In this case, the combinations array contains only the combination that the user provided + static constexpr combination combinations[] = { + {16, 16, (sizeof(Ta) == 1) ? 64 : 32, M, N, K}}; + static constexpr int num_combinations = + sizeof(combinations) / sizeof(combination); }; // Sizes-only query @@ -319,26 +365,28 @@ struct tpu_params using joint_matrix_a = joint_matrix; - template + template using joint_matrix_b = joint_matrix; - template + template using joint_matrix_c = joint_matrix; - bool dynamic_p = false; // should be true in future implementations because + static constexpr bool dynamic_p = false; // should be true in future implementations because // Intel AMX hardware supports dynamic sizes - uint32_t numtiles = 8; - scope_t scope = scope_t::sub_group; + static constexpr uint32_t numtiles = 8; + static constexpr scope_t scope = scope_t::sub_group; struct combination { uint32_t max_msize; uint32_t max_nsize; uint32_t max_ksize; - matrix_type atype; - matrix_type btype; - matrix_type ctype; uint32_t msize; uint32_t nsize; uint32_t ksize; + matrix_type atype; + matrix_type btype; + matrix_type ctype; }; + // In this case, the combinations array contain only the combinations that correspond to the Ta, Tb, and Tc + // types that the user provided static constexpr combination combinations[] = { {16, 16, (sizeof(Ta) == 1) ? 64 : 32}}; static constexpr int num_combinations = @@ -353,28 +401,28 @@ struct tpu_params { static constexpr std::size_t defaultN = -1; static constexpr std::size_t defaultK = -1; - bool dynamic_p = false; // should be true in future implementations because + static constexpr bool dynamic_p = false; // should be true in future implementations because // Intel AMX hardware supports dynamic sizes - uint32_t numtiles = 8; - constscope_t scope = scope_t::sub_group; + static constexpr uint32_t numtiles = 8; + static constexpr scope_t scope = scope_t::sub_group; struct combination { uint32_t max_msize; uint32_t max_nsize; uint32_t max_ksize; - matrix_type atype; - matrix_type btype; - matrix_type ctype; uint32_t msize; uint32_t nsize; uint32_t ksize; + matrix_type atype; + matrix_type btype; + matrix_type ctype; }; static constexpr combination combinations[] = { - {16, 16, 64, matrix_type::sint8, matrix_type::sint8, matrix_type::sint32}, - {16, 16, 64, matrix_type::sint8, matrix_type::uint8, matrix_type::sint32}, - {16, 16, 64, matrix_type::uint8, matrix_type::sint8, matrix_type::sint32}, - {16, 16, 64, matrix_type::uint8, matrix_type::uint8, matrix_type::sint32}, - {16, 16, 32, matrix_type::bf16, matrix_type::bf16, matrix_type::fp32}}; + {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::sint8, matrix_type::sint32}, + {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::uint8, matrix_type::sint32}, + {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::sint8, matrix_type::sint32}, + {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::uint8, matrix_type::sint32}, + {16, 16, 32, 0, 0,0, matrix_type::bf16, matrix_type::bf16, matrix_type::fp32}}; static constexpr int num_combinations = sizeof(combinations) / sizeof(combination); }; @@ -541,6 +589,9 @@ We did not utilize this extension for this matrix API version because sub-group ## TODO List - Add support for fill matrix and element-wise operations features - Add 'matrix_use' parameter to the matrix to distinguish between matrix A, B, and matrix accumulator. This is necessary for supporting VNNI and transpose transform +- Change the names default sizes in the query from defaultM, defaultN, defaultK to M,N,K +- Change the type of `scope` in the query interface to be able to return more than one value. This will be useful in the event we support other scopes like workgroup besides subgroups + ## Revision History From 206d2cf2202db1dcdf153fd72620416f4d3717f4 Mon Sep 17 00:00:00 2001 From: Dounia Khaldi Date: Mon, 25 Oct 2021 13:54:15 -0500 Subject: [PATCH 4/6] [SYCL][Matrix] Incorporate Greg's and Alexey S.'s reviews, specifically, - Greg: Add an additional column to the table to specify for each member, in which forms they are defined - Greg: Clarify the definition of the default values form - Alexey S.: add an open question about the returned type of get_wi_slice: owned vs. view object --- .../Matrix/dpcpp-joint-matrix.asciidoc | 62 ++++++++++--------- 1 file changed, 32 insertions(+), 30 deletions(-) diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index 2d797925ea19c..eb04df0d1d3c0 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -242,35 +242,35 @@ This also offers development and tuning productivity by both scientists and libr so there will be no runtime errors. The query interface proposed here consists of three functionalities: -- At compile time, inform the user whether a specific combination is valid or not. +- Validation: at compile time, the validation functionality informs the user whether a specific combination is valid or not. This takes place when the user specifies all template parameters. -- Construct the matrices using a default shape if user does not provide a combination. This corresponds to the case where the user provides the sizes of large `tile` matrices but does not specify the sizes of the corresponding submatrices of the `tiles`. In this case, the query will construct these submatrices of the matrices whose size the user provided. +- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/c` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K. -- General query interface for sizes, types, static/dynamic, scope. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. +- General query: the general query interface provides information about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. This form takes place when users only specify the TPU they are interested in using. -The table below provides a desciption for each of the member variables and type aliases in `tpu_params` class. +The table below provides a description for each of the member variables and type aliases in `tpu_params` class and the forms in which they are defined. [frame="none",options="header"] |====================== -| Member/type alias in `tpu_params` |Description -|`type_a`| type alias for the type of matrix A -|`type_b`| type alias for the type of matrix B -|`type_c`| type alias for the type of matrix C -|`defaultM`| when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports -|`defaultN`| when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports -|`defaultK`| when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports -|`joint_matrix_a`| type alias for `joint_matrix` for matrix A -|`joint_matrix_b`| type alias for `joint_matrix` for matrix B -|`joint_matrix_c`| type alias for `joint_matrix` for matrix C -|`dynamic_p`| a boolean that indicates whether the implementation supports dynamic sizes (true) or not (false) -|numtiles| indicates number of tiles in Intel AMX (does not apply to DPAS) -|scope| indicates the memory and execution scope supported by the TPU implementation -|`combination` | composes the types and sizes of A, B, C matrices allowed in one combination -|`max_msize`, `max_nsize`, `max_ksize`|When one of these members is non-zero, it indicates that the TPU supports all element sizes in the range from 1 up to the given value. By contrast, a zero value indicates that the TPU implementation supports only a discrete set of element sizes, which are given by the corresponding msize, nsize, or ksize members -|`msize`, `nsize`, `ksize`| presents one of the sizes that the TPU implementation supports -|`atype`, `btype`, `ctype`| indicates the types supported in the combination -|`combinations` |Tells the set of supported matrix sizes and types according to the template parameters that are provided. In the "general query" form, the user provides only the TPU type, so the combinations array contains all supported tile sizes and element types for that TPU. In the "default values" form, the user provides the TPU type and element types, so the combinations array contains only those supported matrix sizes and element types that match those element types on that TPU. In the "validation" form, the user provides the TPU type, element types, and element sizes. -|`num_combinations`| indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array +| Member/type alias in `tpu_params` | Forms they are defined in |Description +|`type_a`| validation, default values|type alias for the type of matrix A +|`type_b`| validation, default values|type alias for the type of matrix B +|`type_c`| validation, default values|type alias for the type of matrix C +|`defaultM`| validation, default values, general query|when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports +|`defaultN`| validation, default values, general query|when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports +|`defaultK`| validation, default values, general query|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports +|`joint_matrix_a`| validation, default values|type alias for `joint_matrix` for matrix A +|`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B +|`joint_matrix_c`| validation, default values| type alias for `joint_matrix` for matrix C +|`dynamic_p`| validation, default values, general query| a boolean that indicates whether the implementation supports dynamic sizes (true) or not (false) +|numtiles| validation, default values, general query|indicates number of tiles in Intel AMX (does not apply to DPAS) +|scope| validation, default values, general query| indicates the memory and execution scope supported by the TPU implementation +|`combination` | validation, default values, general query|composes the types and sizes of A, B, C matrices allowed in one combination +|`max_msize`, `max_nsize`, `max_ksize`| validation, default values, general query|When one of these members is non-zero, it indicates that the TPU supports all element sizes in the range from 1 up to the given value. By contrast, a zero value indicates that the TPU implementation supports only a discrete set of element sizes, which are given by the corresponding msize, nsize, or ksize members +|`msize`, `nsize`, `ksize`| validation, default values, general query|presents one of the sizes that the TPU implementation supports +|`atype`, `btype`, `ctype`| validation, default values, general query| indicates the types supported in the combination +|`combinations` | validation, default values, general query|Tells the set of supported matrix sizes and types according to the template parameters that are provided. In the "general query" form, the user provides only the TPU type, so the combinations array contains all supported tile sizes and element types for that TPU. In the "default values" form, the user provides the TPU type and element types, so the combinations array contains only those supported matrix sizes and element types that match those element types on that TPU. In the "validation" form, the user provides the TPU type, element types, and element sizes so only this specific combination is returned in the combinations array. +|`num_combinations`| validation, default values, general query|indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array |====================== @@ -285,7 +285,7 @@ namespace sycl::ext::oneapi::experimental::matrix { template struct tpu_params; -// Valid or Not: +// Validation form: Valid or not // Specialization when both types and sizes are given template struct tpu_params< @@ -342,7 +342,7 @@ struct tpu_params< sizeof(combinations) / sizeof(combination); }; -// Sizes-only query +// Default values form: Sizes-only query // Specialization for when only types are given, need to query only sizes template struct tpu_params struct tpu_params { @@ -461,7 +461,7 @@ enum class scope_t { ``` -=== Valid or not Example: +=== Validation Example: ```c++ // User can provide sizes besides the types and tpu_params can assert if they are supported or not // in this case, an assertion will happens as 16 is not a supported size for M @@ -489,8 +489,8 @@ myparams::joint_matrix_c sub_c(sg); constexpr int M = 1500; // with msize = 8 and msize = 4, // M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops tpu_params params; -constexpr int msize = break_dimension(params.combinations, M); -constexpr int msize_remainder = break_dimension_remainder(params.combinations, M); +constexpr int msize = break_dimension(params, M); +constexpr int msize_remainder = break_dimension_remainder(params, M); constexpr int nsize = params.combinations[0].nsize; constexpr int ksize = params.combinations[0].ksize; // device code: @@ -583,8 +583,10 @@ We did not utilize this extension for this matrix API version because sub-group ## Open Questions - Besides row, col major and packed (VNNI) layout, what are the additional layouts that should absolutely be added? - Are there alternative names for the `packed_a` and `packed_b` layouts that would be clearer to distinguish between the VNNI Layout in matrix A and VNNI layout in matrix B of a matrix multiply and add operation on Intel AMX? +-- Yes, this will be addressed in the next revision where `use` argument will be introduced to distinguish between right (B) , left (A), and accumulator matrix. - Ronan Keryell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?" -- What should the API description include: (1) only features that are implemented, (2) features that are actually part of the API: currently implemented and the ones that we expect implementing them in the future. Specifically, should the document include things like dynamic_ extent and Group? These are part of the API but are not currently implemented. + +- In the future looking APIs, `get_wi_slice` (that is currently under design) returns an owned object. Should this return a view object to make sure the original matrix C is changed after its slices are modified. ## TODO List - Add support for fill matrix and element-wise operations features From 41133cc8c5c9f8c2ef38e6cff9341c164e7f131b Mon Sep 17 00:00:00 2001 From: Dounia Khaldi Date: Wed, 27 Oct 2021 13:54:50 -0500 Subject: [PATCH 5/6] [SYCL][Matrix] Add a todo item about Greg's request to add a more realistic general query example --- sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc | 1 + 1 file changed, 1 insertion(+) diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index eb04df0d1d3c0..48d71c0eebe3e 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -593,6 +593,7 @@ We did not utilize this extension for this matrix API version because sub-group - Add 'matrix_use' parameter to the matrix to distinguish between matrix A, B, and matrix accumulator. This is necessary for supporting VNNI and transpose transform - Change the names default sizes in the query from defaultM, defaultN, defaultK to M,N,K - Change the type of `scope` in the query interface to be able to return more than one value. This will be useful in the event we support other scopes like workgroup besides subgroups +- Add a more realistic and complete example that shows the value of the general query ## Revision History From 7dab76e1d33341b1e6bf339ab933552281abb3e2 Mon Sep 17 00:00:00 2001 From: Dounia Khaldi Date: Mon, 1 Nov 2021 08:52:47 -0500 Subject: [PATCH 6/6] [SYCL][Matrix] Incorporate suggestions from Greg to modify some text --- .../Matrix/dpcpp-joint-matrix.asciidoc | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index 48d71c0eebe3e..f3b96a8827ea0 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -256,9 +256,9 @@ The table below provides a description for each of the member variables and type |`type_a`| validation, default values|type alias for the type of matrix A |`type_b`| validation, default values|type alias for the type of matrix B |`type_c`| validation, default values|type alias for the type of matrix C -|`defaultM`| validation, default values, general query|when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports -|`defaultN`| validation, default values, general query|when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports -|`defaultK`| validation, default values, general query|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports +|`defaultM`| validation, default values|when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value M that the user provides if M is supported by the implementation +|`defaultN`| validation, default values|when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value N that the user provides if N is supported by the implementation +|`defaultK`| validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation |`joint_matrix_a`| validation, default values|type alias for `joint_matrix` for matrix A |`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B |`joint_matrix_c`| validation, default values| type alias for `joint_matrix` for matrix C @@ -266,10 +266,10 @@ The table below provides a description for each of the member variables and type |numtiles| validation, default values, general query|indicates number of tiles in Intel AMX (does not apply to DPAS) |scope| validation, default values, general query| indicates the memory and execution scope supported by the TPU implementation |`combination` | validation, default values, general query|composes the types and sizes of A, B, C matrices allowed in one combination -|`max_msize`, `max_nsize`, `max_ksize`| validation, default values, general query|When one of these members is non-zero, it indicates that the TPU supports all element sizes in the range from 1 up to the given value. By contrast, a zero value indicates that the TPU implementation supports only a discrete set of element sizes, which are given by the corresponding msize, nsize, or ksize members -|`msize`, `nsize`, `ksize`| validation, default values, general query|presents one of the sizes that the TPU implementation supports +|`max_msize`, `max_nsize`, `max_ksize`| validation, default values, general query| if the TPU implementation supports a continuous number of element sizes, each of these members is non-zero, and the TPU implementation supports all element sizes from 1 up to (and including) that number. By contrast, if the TPU implementation supports a discrete number of element sizes, each of these members has the value zero +|`msize`, `nsize`, `ksize`| validation, default values, general query| if the TPU implementation supports a discrete number of element sizes, each of these members is non-zero, and the value tells one of the supported element sizes. By contrast, if the TPU supports a continuous number of element sizes, each of these members has the value zero |`atype`, `btype`, `ctype`| validation, default values, general query| indicates the types supported in the combination -|`combinations` | validation, default values, general query|Tells the set of supported matrix sizes and types according to the template parameters that are provided. In the "general query" form, the user provides only the TPU type, so the combinations array contains all supported tile sizes and element types for that TPU. In the "default values" form, the user provides the TPU type and element types, so the combinations array contains only those supported matrix sizes and element types that match those element types on that TPU. In the "validation" form, the user provides the TPU type, element types, and element sizes so only this specific combination is returned in the combinations array. +|`combinations` | validation, default values, general query| tells the set of supported matrix sizes and types according to the template parameters that are provided. In the "general query" form, the user provides only the TPU type, so the combinations array contains all supported tile sizes and element types for that TPU. In the "default values" form, the user provides the TPU type and element types, so the combinations array contains only those supported matrix sizes and element types that match those element types on that TPU. In the "validation" form, the user provides the TPU type, element types, and element sizes so only this specific combination is returned in the combinations array. |`num_combinations`| validation, default values, general query|indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array |====================== @@ -397,10 +397,6 @@ struct tpu_params struct tpu_params { - static constexpr std::size_t defaultM = -1; // depends on the type - static constexpr std::size_t defaultN = -1; - static constexpr std::size_t defaultK = -1; - static constexpr bool dynamic_p = false; // should be true in future implementations because // Intel AMX hardware supports dynamic sizes static constexpr uint32_t numtiles = 8;