Matrix Programming Extension for DPC++: SYCL_EXT_ONEAPI_MATRIX

Notice

Note	Khronos® is a registered trademark and SYCL™ and SPIR™ are trademarks of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used by permission by Khronos.

This extension is written against the SYCL 2020 revision 3 specification. All references below to the "core SYCL specification" or to section numbers in the SYCL specification refer to that revision.

NOTE: This document describes the current design and API for the matrix extension to DPC++. This is an initial experimental version to try out functionality and performance, and future versions of this API may change in ways that are incompatible with this experimental version. The current implementation provides support of the matrix interface on Intel® Advanced Matrix Extensions (AMX). We are going to work with the community on incrementally improving the API to bring them closer to standard C++ (aligned with the std::mdspan and std::mdarray proposals) and SYCL in the next several months.

Introduction

This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: AMX in Intel CPU, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.

Feature test macro

This extension provides a feature-test macro as described in the core SYCL specification section 6.3.3 "Feature test macros". Therefore, an implementation supporting this extension must predefine the macro SYCL_EXT_ONEAPI_MATRIX to one of the values defined in the table below. Applications can test for the existence of this macro to determine if the implementation supports this feature, or applications can test the macro’s value to determine which of the extension’s APIs the implementation supports.

Value	Description
1	Initial extension implementation on AMX. Base features are supported.

New `joint_matrix` class

We introduce a new class called joint_matrix. The user needs to specify the type of the elements, shape, the memory layout, and the memory scope of the matrix.

Memory Scope

In this experimental API version, we used the terminology of joint_matrix instead of plain matrix to emphasis that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.

Shape

The same class joint_matrix should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use sycl::dynamic_extent to differentiate between static and dynamic sizes.

Layout

Layout is necessary on the type to be able to calculate the physical offset if the user needs to access a single entry for some purpose. Besides row major and column major layout, matrix_layout is flexible enough to introduce customed layouts such as symmetric or tiled layouts.

This results into the following description:

namespace sycl::ext::intel::experimental::matrix {
template <typename Group, typename T, size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent, matrix_layout Layout = matrix_layout::row_major>
struct joint_matrix {
    joint_matrix(Group g) {}
};
}

When the group is a sycl::sub_group, a matrix is declared as follows:

joint_matrix<sub_group, int8_t, tM, tN> tA(sg);

AMX hardware requires both A and B to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed size for B is different. Let’s take an example of bf16 type. The size of packed B is (K/2, N*2). For maximum performance, the allocation of B tile has to be done at the declaration of B matrix point using these VNNIed sizes. This motivate the choice of adding layout argument to the matrix type.

Same applies to matrix A. But in the case where the memory layout (where the matrices will be loaded from) is column major. Here, we multiply matrices A (K, M) and B (N, K) into a matrix C (N, M). The packed size for A is different. For bf16 type, the size of packed A is (K/2, M*2).

namespace sycl::ext::intel::experimental::matrix {
enum class matrix_layout {
  row_major,
  col_major,
  packed_a,
  packed_b
};
}

Matrix Operations and their Execution Scope

We define three new functions needed to perform the main and common operations on matrices namely, load, store, and the actual multiply and add operation. This set of functions can be easily extended if the tensor hardware implements new features.

The base pointer determines the starting address of the matrix to be loaded/stored. layout determines whether the data are being read/written in a row (row_major), column major (column_major) fashion, or if the data has already been transformed into VNNI format (packed_a, packed_b).

Note that for getting maximum performance on AMX, prepacking data in the memory is necessary. If users did not specify the packed layouts (packed_a in column major case, packed_b in row major case), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts packed_a and packed_b to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in VNNI layout section below.

Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The Group template argument can be a work-group or a sub-group. These functions will be called once by each work item in the group.

To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The DPC++ syntax is the following:

Load

namespace sycl::ext::intel::experimental::matrix {
  template <typename Group, typename T, size_t NumRows, size_t NumCols,
          matrix_layout Layout,
          access::address_space Space>
  void joint_matrix_load(Group sg, joint_matrix<Group, T, NumRows, NumCols> &res,
		    multi_ptr<T, Space> src, size_t stride, matrix_layout layout = matrix_layout::row_major);
}

This function loads data from memory to the 2d tiles of AMX that is a 2d storage.

Store

namespace sycl::ext::intel::experimental::matrix {
  template <typename Group, typename T, size_t NumRows, size_t NumCols,
          matrix_layout Layout,
          access::address_space Space>
  void joint_matrix_store(Group sg, matrix<Group, T, NumRows, NumCols> &res,
		     multi_ptr<T, Space> src, size_t stride, matrix_layout layout = matrix_layout::row_major);
}

This function stores the data from the 2d tiles back to memory.

Multiply and Add

namespace sycl::ext::intel::experimental::matrix {
  template <typename Group, typename T1, typename T2, std::size_t M,
          std::size_t K, std::size_t N,
	  matrix_layout LayoutA, matrix_layout LayoutB,
          matrix_layout LayoutC>
  joint_matrix<Group, T2, M, N> joint_matrix_mad(Group sg, joint_matrix<Group, T1, M, K> A,
               joint_matrix<Group, T1, K, N> B, joint_matrix<Group, T2, M, N> C);
}

The matrix multiply and add function performs the multiply operation on the matrices A and B, accumulate the result with C and return the result.

VNNI/Packed Layout

AMX compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. The VNNI blocking factor is 2 in the case of 16bits, 4 in the case of 8 bits elements. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform.

// Example of bf16 data type:
// ---------------------------------
// a1, b1, c1, d1
// a2, b2, c2, d2
// a3, b3, c3, d3
// a4, b4, c4, d4
// ---------------------------------
// reformat to
// ---------------------------------
// a1, a2, b1, b2, c1, c2, d1, d2
// a3, a4, b3, b4, c3, c4, d3, d4

Example using int8_t type

using namespace sycl::ext::intel::experimental::matrix;

queue q;
range<2> G = {M, N};
// For this first implementation, SG_SIZE has to be equal to one
range<2> L = {1, SG_SIZE};
int8_t *memA = malloc_shared<int8_t>(M*K, q);
int8_t *memB = malloc_shared<int8_t>(K*N, q);
Int32_t *memC = malloc_shared<int32_t>(M*N, q);
//Assuming memB has already been VNNIed
q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
  [[sycl::reqd_sub_group_size(SG_SIZE)]] {
   const auto global_idx = item.get_global_id(0);
   const auto global_idy = item.get_global_id(1);
   const auto sg_startx = global_idx - item.get_local_id(0);
   const auto sg_starty = global_idy - item.get_local_id(1);
   sub_group sg = item.get_sub_group();
   joint_matrix<sub_group, int8_t, tM, tK> tA(sg);
   // For B, since current implementation does not support non packed layout,
   // users need to specify the updated VNNI sizes along with the packed_b layout
   joint_matrix<sub_group, int8_t, tK/4, tN*4, packed_b> tB(sg);
   joint_matrix<sub_group, int32_t, tM, tN> tC(sg);
   joint_matrix_load(sg, tC, memC + sg_startx * tM * N + sg_starty, N, matrix_layout::row_major);
   for (int k = 0; k < K; k += tk) {
     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, matrix_layout::row_major);//collective
     joint_matrix_load(sg, tB, memB + k * N + sg_starty, N, matrix_layout::packed_b);//VNNI
     tC = joint_matrix_mad(sg, tA, tB, tC);
   }
   joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty, N, matrix_layout::row_major);
}).wait();

Implementation Status

For oneAPI release 3, an AOT implementation is available on the CPU device to targets AMX hardware. we are using AMX tile intrinsics to implement the matrix load and store operations. Since we are currently emitting AMX intrinsics directly, this only enables AOT compilation. Please refer to the following section that talks about the future unified SPIR-V path that will enable JIT compilation.

Currently, this is the compilation command line needed to invoke AMX unit of Sapphire Rapids CPU:

clang++ -fsycl -march=sapphirerapids fsycl-targets="spir64_x86_64-uknown-linux-sycldevice"  -O2 matmul-int8.cpp -o matmul-int8

Current Implementation Restrictions

This section provides the specific features that this implementation supports. However, in future versions of this API and implementation, the expectation is to provide a query interface to guide the usage of this API.

Type, Sizes, and Layouts

The types supported by this AMX implementation are restricted to the types that AMX hardware support. Although the AMX hardware supports 2d tiles with a maximum size of 16x64 bytes, this current implementation can handle any size. If the matrix size is bigger than 1024 bytes, it will be stored in memory rather than mapped to a 2d tile. Performance penalty may occur in this case. In order to get the best performance with this implementation, matrix sizes should be no larger than 16x64 bytes and A and B matrices should be already packed (put in VNNI format).

More specifically, the following operation C = A*B+C can be performed on AMX with this interface where: A(int8, any-size, row_major), B(int8, any-size, packed_b), C(int32, any-size, row_major) or A(bf16, any-size, row_major), B(bf16, any-size, packed_b), C(float, any-size, row_major)

Memory and Execution Scope

This current implementation only considers a sub-group scope. However, the sub-group size has to be equal to one in this first implementation.

Future Implementation Work

Unfied LLVM IR and SPIRV JIT Enabling

To enable JIT compilation, a unified matrix IR needs to be added. Currently, there is no matrix type in LLVM IR or SPIR-V. We are working towards adding a new matrix type in both LLVM IR and SPIR-V. This JIT enabling is expected to be part of a future compiler release.

LLVM IR Extension

As a short-term solution, we are extending the existing LLVM IR matrix intrinsics to include features like VNNI layout. The current matrix intrinsics use flattened vectors to represent the matrix. Therefore, we are exploring both adding matrix type to LLVM IR and also using MLIR vector dialect for this work.

SPIR-V Extension

The current draft proposal can be found here. We are adding translation from LLVM IR matrix to SPIR-V matrix and vice versa in the LLVM to SPIR-V translator tool.

Future-looking API

Memory scope

The current experimental API uses joint_ semantics to define the memory scope of the matrix. The long term solution is to use the proposed group_local_memory extension to allocate the matrix in local memory associated with a SYCL group as shown in the example below.

multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN>>(sg);

We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in DPC++. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet.

Open Questions

Besides row, col major and packed (VNNI) layout, what are the additional layouts that should absolutely be added?
Are there alternative names for the packed_a and packed_b layouts that would be clearer to distinguish between the VNNI Layout in matrix A and VNNI layout in matrix B of a matrix multiply and add operation on AMX?
Ronan Keryell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?"
What should the API description include: (1) only features that are implemented, (2) features that are actually part of the API: currently implemented and the ones that we expect implementing them in the future. Specifically, should the document include things like dynamic_ extent and Group? These are part of the API but are not currently implemented.

TODO List

Handle sub group sizes that are bigger than one.
Add support for queries that gives information about the capabilities of the implementation on a particular device.
Once the SPIRV translator work is done, this code generation work will move to the backend along enabling JIT compilation.

Revision History

Rev	Date	Author	Changes
1	2021-04-13	Dounia Khaldi	Initial public working draft.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix Programming Extension for DPC++: SYCL_EXT_ONEAPI_MATRIX

Notice

Introduction

Feature test macro

New `joint_matrix` class

Memory Scope

Shape

Layout

Matrix Operations and their Execution Scope

Load

Store

Multiply and Add

VNNI/Packed Layout

Example using int8_t type

Implementation Status

Current Implementation Restrictions

Type, Sizes, and Layouts

Memory and Execution Scope

Future Implementation Work

Unfied LLVM IR and SPIRV JIT Enabling

LLVM IR Extension

SPIR-V Extension

Future-looking API

Memory scope

Open Questions

TODO List

Revision History

FilesExpand file tree

dpcpp-joint-matrix.asciidoc

Latest commit

History

dpcpp-joint-matrix.asciidoc

File metadata and controls

Matrix Programming Extension for DPC++: SYCL_EXT_ONEAPI_MATRIX

Notice

Introduction

Feature test macro

New joint_matrix class

Memory Scope

Shape

Layout

Matrix Operations and their Execution Scope

Load

Store

Multiply and Add

VNNI/Packed Layout

Example using int8_t type

Implementation Status

Current Implementation Restrictions

Type, Sizes, and Layouts

Memory and Execution Scope

Future Implementation Work

Unfied LLVM IR and SPIRV JIT Enabling

LLVM IR Extension

SPIR-V Extension

Future-looking API

Memory scope

Open Questions

TODO List

Revision History

New `joint_matrix` class