Skip to content
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
164 changes: 111 additions & 53 deletions sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ value to determine which of the extension's APIs the implementation supports.
|======================
|Value |Description
|1 |Initial extension implementation on Intel AMX. Base features are supported.
|2 |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad and the query interface are supported
|2 |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported
|======================

## New `joint_matrix` class
Expand Down Expand Up @@ -165,6 +165,93 @@ namespace sycl::ext::oneapi::experimental::matrix {
The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.


#### Matrix Initialization: `joint_matrix_fill`
The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill` makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to `_tile_zero` intrinsic:

```c++
namespace sycl::ext::oneapi::experimental::matrix {
template <typename Group, typename T, size_t NumRows, size_t NumCols,
matrix_layout L, typename Tv>
void joint_matrix_fill(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &m, const Tv v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void joint_matrix_fill(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &m, const Tv v);
void joint_matrix_fill(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &m, Tv v);

It does not make sense to use const here when passing a parameter by value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there separate types for T and Tv? Don't you want v to have the same type T?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v can have a different type

}
```
IMPORTANT: In the current implementation, only the subgroup scope is supported.

#### Element Indexing and Piece-Wise Operations
Besides matrix multiply and add, matrices are used in linear and non linear piece-wise operations. Activation functions are an example of element-wise operations. They can be linear like `ReLU` that, for each value `z`, returns the maximum between `z` and zero, or non linear like `Sigmoid` that calculates `1/(1+ exp(-z))`. Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations. For instance, quantized GEMM for `int8_t` is calculated using `A*B + sum_rows_A + sum_cols_B + scalar_zero_point`. `sum_rows_A` and `sum_cols_B` do not operate on elements of the matrix but on pieces: row in `sum_rows_A` and columns in `sum_cols_B`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the last sentence about quantized GEMM. How do the APIs in this section help compute the sum or rows or columns in A or B?

I think the main point here is that the element indexing API can be used in cases where some computation needs to be performed individually on each element of the matrix. Correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is that: element-wise operations where computation needs to be performed individually on each element of the matrix. But there is also piece-wise operations like sum of rows. An example on how to do that using get_wi_data can be:

auto data = C.get_wi_data();
for (int i = 0; i < length; ++i) {
auto row, col = data[i].get_coord();
sum_of_local_rows[row] += data[i];
}

But in this case, the user needs to know the mapping WI data to matrix first. As you can see here, the mapping function is missing. I added in the future looking API the mapping function as future work.

Copy link
Contributor

@JackAKirk JackAKirk May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all that needs to be stated is that:

Suggested change
Besides matrix multiply and add, matrices are used in linear and non linear piece-wise operations. Activation functions are an example of element-wise operations. They can be linear like `ReLU` that, for each value `z`, returns the maximum between `z` and zero, or non linear like `Sigmoid` that calculates `1/(1+ exp(-z))`. Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations. For instance, quantized GEMM for `int8_t` is calculated using `A*B + sum_rows_A + sum_cols_B + scalar_zero_point`. `sum_rows_A` and `sum_cols_B` do not operate on elements of the matrix but on pieces: row in `sum_rows_A` and columns in `sum_cols_B`.
Besides matrix multiply and add, this extension aims to make it possible to perform piece-wise operations on matrices in a SPMD manner. The mechanisms that are recommended to perform such piece-wise operations depend upon which of the following classes the operation falls into:
Class "1". Element-wise operations that are performed identically on every element of the matrix.
Class "2". Element-wise operations that depend on the element index of the matrix or operations that take multiple elements as operands (such as a sum of all elements in a row for example).
This extension currently only supports case 1). However a proposal for supporting 2) (for some backends) in the future is provided in a later section.

Then continue with the explanation of how case 1) is dealt with. Case 2) seems to have been considered in section "### WI data to joint matrix mapping coordinates information for piece-wise operations" and requires that the backend knows the mapping from "joint_matrix Domain" to "WI Domain".

Copy link
Contributor

@JackAKirk JackAKirk May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically case 1) doesn't require mapping between get_data and joint_matrix, but cases 2) do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make the change, thanks


We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future.
Copy link
Contributor

@JackAKirk JackAKirk May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the paragraph beginning "We explored" can be removed also I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay


In order to be able to perform any piece-wise operation in a general and efficient way, we provide a mapping conversion function from the matrix domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model. In joint matrix, it is up to the implementation to distribute the matrix among the work items or keep it shared. For instance, the matrix is a shared entity among the work items in the case of AMX where the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. Here, for the purpose of piece-wise operations, the conversion to the SPMD model happens using the matrix elements to WI mapping. Besides, the WI data to joint matrix mapping coordinates information must be known in order to extract the relevant piece for operations like sum of rows of a matrix.
Copy link
Contributor

@JackAKirk JackAKirk May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AMX example is a good addition here. I think it can be made a bit more clear and precise:

Suggested change
In order to be able to perform any piece-wise operation in a general and efficient way, we provide a mapping conversion function from the matrix domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model. In joint matrix, it is up to the implementation to distribute the matrix among the work items or keep it shared. For instance, the matrix is a shared entity among the work items in the case of AMX where the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. Here, for the purpose of piece-wise operations, the conversion to the SPMD model happens using the matrix elements to WI mapping. Besides, the WI data to joint matrix mapping coordinates information must be known in order to extract the relevant piece for operations like sum of rows of a matrix.
There is no fixed allocation of matrix elements held by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the WIs in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. However it is necessary to allocate WIs to specific elements of the matrix. In order to be able to perform piece-wise operations in a general and efficient way, we provide a conversion function from the joint_matrix domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model. The mapping between the matrix data owned by a WI and them matrix data owned by the joint matrix may not be known for all backends. As such, additional conditions are necessary for class "2" piece-wise operations that we propose in a later section...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks


Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix.
Copy link
Contributor

@JackAKirk JackAKirk May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to remove this whole paragraph beginning "Nvidia wmma interface" because it (or something similar) is more appropriate for the CUDA backend spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix.
However using the `mma` ptx instructions as opposed to the `wmma` ptx instructions the mapping is known. Knowing this mapping is important for the user to implement new operations like sum of rows of a matrix for quantized algorithms.

##### Solution: Explicit conversion with mapping from SIMD to SPMD
We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI. So modifying `wi_data` means also modifying the joint matrix corresponding elements. The indexing provided inside the `wi_data` class acesses only the portion of the current WI and returns `wi_element`. This latter holds a reference to the original joint_matrix that `wi_data` was constructed from. Users can use the `=` operator to update the element of the `joint_matrix` represented by the `wi_element` after the element-wise operation.

Using `get_wi_data`, it is not possible to know which portions of data are owned by each thread in the group as this is implementation defined and change from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping coordinates information is needed to reason about the matrix view. But for element-wise operations where the same operation is performed on all the elements of the matrix, having all the WIs in the group apply the operation inside a loop iterating over the `length` of `wi_data` guarantees the whole matrix element-wise operation.


```c++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of these code synopses is to show the API, not the implementation. Therefore, remove the function bodies and all the private data members. For example:

namespace sycl::ext::oneapi::experimental::matrix {

template <typename T, size_t NumRows, size_t NumCols,
          matrix_layout Layout = matrix_layout::row_major,
          typename Group = sycl::sub_group>
struct joint_matrix {
   wi_data<T, NumRows, NumCols, Layout, Group> get_wi_data();
};

/* ... */

} // namespace sycl::ext::oneapi::experimental::matrix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do that, thanks

namespace sycl::ext::oneapi::experimental::matrix {
template <typename T, size_t NumRows, size_t NumCols,
matrix_layout Layout = matrix_layout::row_major,
typename Group = sycl::sub_group>
struct joint_matrix {
joint_matrix(Group sg) {…}
wi_data<T, NumRows, NumCols, Layout, Group> get_wi_data() {
return wi_data<T, NumRows, NumCols, Layout, Group>(*this);
}
};
template <typename T, size_t NumRows, size_t NumCols, matrix_layout Layout, typename Group>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do wi_data and wi_element really need all these template parameters? It seems like it would be easier to use if the only template parameter was T. It seems like the other template parameters are only there because there is a private data member M (a reference to matrix). However, you only seem to use M.spvm in the function bodies. Could you instead just store the spvm member directly in wi_data and wi_element?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on your suggestion? I get what you want to do but did not get the how.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed response. I was OOO for about 2 months and just got back recently.

I was thinking that these types could be simplified to have fewer template parameters, so the goal would be to have an API like this:

template <typename T>
class wi_data {
  size_t length();
  wi_element<T> operator[](size_t i);
};

template <typename T>
class wi_element {
  operator T();
  wi_element &operator=(const T &rhs);
};

Very roughly, I was thinking that you could accomplish this by changing the private data member included in wi_data and wi_element. Currently, these both contain a reference to the joint matrix M. However, it seems like they only need to use M.spvm. Therefore, could you change the implementation to hold just the spvm like:

template <typename T>
class wi_data {
  /* not sure what type */ spvm;
public:
  size_t length() {return __spirv_JointMatrixWorkItemLengthINTEL(spvm);}
  wi_element<T> operator[](size_t i) {
    return wi_element<T>(spvm, i);
  }
};

template <typename T>
class wi_element {
  /* not sure what type */ spvm;
  std::size_t idx;

public:
  operator T() {
    return __spirv_VectorExtractDynamic(spvm, idx);
  }
  wi_element &operator=(const T &rhs) {
    M.spvm = __spirv_VectorInsertDynamic(spvm, rhs, idx);
    return *this;
  }
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlueck
It looks like spvm type also needs these template parameters. SO I don't think we can reduce them:
__spv::__spirv_JointMatrixINTEL<
T, NumRows, NumCols, spv_matrix_layout_traits::value> *spvm;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's too bad.

class wi_data {
joint_matrix<T, NumRows, NumCols, Layout, Group> &M;
public:
wi_data(joint_matrix<T, NumRows, NumCols, Layout, Group> &Mat) : M(Mat){}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume we do NOT want application code to construct a wi_data (or wi_element) directly? Instead, I presume we want application to call joint_matrix::get_wi_data to get the wi_data? If that is the case, these constructors should be private in the implementation, and joint_matrix should be a friend, so that it can construct the objects.

The code synopsis, then, would only list the public member functions:

template <typename T, size_t NumRows, size_t NumCols, matrix_layout Layout, typename Group>
class wi_data {
 public:
  size_t length();
  wi_element<T, NumRows, NumCols, Layout, Group> operator[](size_t i);
};

template <typename T, size_t NumRows, size_t NumCols, matrix_layout Layout, typename Group>
class wi_element {
 public:
  operator T();
  wi_element &operator=(const T &rhs);
};

size_t length() {return __spirv_JointMatrixWorkItemLengthINTEL(M.spvm);}
wi_element<T, NumRows, NumCols, Layout, Group> operator[](size_t i) {
return wi_element<T, NumRows, NumCols, Layout, Group>(M, i);
}
};
template <typename T, size_t NumRows, size_t NumCols,
matrix_layout Layout = matrix_layout::row_major,
typename Group = sycl::sub_group>
class wi_element {
joint_matrix<T, NumRows, NumCols, Layout, Group> &M;
std::size_t idx;

public:
wi_element(joint_matrix<T, NumRows, NumCols, Layout, Group> &Mat, std::size_t i): M(Mat), idx(i) {}
operator T() {
return __spirv_VectorExtractDynamic(M.spvm, idx);
}
wi_element &operator=(const T &rhs) {
M.spvm = __spirv_VectorInsertDynamic(M.spvm, rhs, idx);
return *this;
}
};
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the code synopsis, there should be some description of the member functions. I'd suggest three tables, one for each class:

  • Table describing member functions of joint_matrix (get_wi_data)
  • Table describing member functions of wi_data
  • Table describing member functions of wi_element.

You can see an example here: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_device_global.asciidoc#representation-of-device-globals

(Scroll down to the table after the code synopsis.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't have that many members, I added description in the text. Let me know if it looks enough.


Example where each WI gets a vector in `wi_data_c`. Vectorization along the subgroup dimension will get enabled automatically to vectorize the contiguous portion of the matrix. Since `wi_data_c` constructs a view of the joint matrix `matC`, `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence:

"Vectorization along the subgroup dimension will get enabled automatically to vectorize the contiguous portion of the matrix."

implies to me that WI portion of joint_matrix represents contiguous elements of the matrix (either row or column major), but this is not generally true: In the Tensor Cores backend it is rarely true: in most cases the WI owned portion of joint_matrix includes elements that are not contiguous. I'm not sure if there is any value to this sentence and I think it can be removed. Instead there could be:

Suggested change
Example where each WI gets a vector in `wi_data_c`. Vectorization along the subgroup dimension will get enabled automatically to vectorize the contiguous portion of the matrix. Since `wi_data_c` constructs a view of the joint matrix `matC`, `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`.
In the following example `wi_data_c` is a reference to the WI owned portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not imply WI portion of joint matrix are contiguous. It implies round robin distribution among the WIs.
Let's take a matrix example of 3 rows x 2 cols, if we assume we have SG size of 2 just for illustration here. To ensure vectorization along the SG dimension, each WI should own 1 column. so you have simd of 2 (1 row) in this case and 3 vector instructions (for each row). Does this make sense?
Does the CUDA backend generate vector code for the loop around WI portions?



```c++
auto wi_data_c = matC.get_wi_data();
for (int i = 0; i < wi_data_c.length(); i++)
wi_data_c[i] *= alpha; // Note that the indexing here “i” is in the vector owned by a WI, not in the matrix C
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
wi_data_c[i] *= alpha; // Note that the indexing here “i” is in the vector owned by a WI, not in the matrix C
wi_data_c[i] *= alpha; // Note that the indexing here "i" is in the vector owned by a WI, not in the matrix C

```

IMPORTANT: In the current implementation, only the subgroup scope is supported.

IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet.

IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. so `wi_data` in this case can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend.
Copy link
Contributor

@JackAKirk JackAKirk May 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification:

The current Tensor Cores implementation isn't AOT: it is JIT just at a lower level than DPC++: DPC++ compiles down to CUDA "PTX" asm, which can then be compiled down to a lower level asm called "SASS" at execution time. Despite this, the number of data elements in joint_matrix owned by each WI is known at compile time for all compilation versions. The thing that isn't known at compile time is the mapping between these WI data elements and the subgroup matrix owned by joint_matrix as a whole.
The other point is that, from the available information in this doc, I don't see that it follows that the WI owned elements can only be returned as an array if and only if "the number of data elements in joint_matrix owned by each WI is known at compile time". But this point is not important to dwell on because I think such an explanation is not really necessary for this "IMPORTANT" note: the point here is to make the user aware that a limited number of backends (Tensor Cores only atm) can return the WI portion of the matrix as an marray (although as specified earlier in the document the order of the marray elements does not have a defined mapping to the order of the matrix owned by the joint_matrix subgroup as a whole).

Suggested change
IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. so `wi_data` in this case can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend.
IMPORTANT: The Tensor Cores implementation is capable of returning the complete set of matrix elements owned by a WI as an `marray`, which can be useful in cases where the user wishes to perform identical operations on every element of the matrix efficiently and conveniently, where SYCL math functions optimized for `marray` exist. An additional interface will be provided for the Tensor Cores backend to return the `marray`.

Copy link
Contributor Author

@dkhaldi dkhaldi May 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I say AOT, I am referring to the SYCL compilation flow . When users have to specify -fsycl-targets this is AOT. JIT is when the compiler generates SPIRV. At runtime SPIRV is translated to the target-specific offload binaries.

I will add a comment about the fact that get_wi_data length cannot be constexpr because:
1- SYCL JIT compilation and partitioning among WIs is implementation defined. There is no way to provide a constexpr length that is true for every backend (it has to be a return value of a SPIRV function so it is not constexpr)
2- SG size is not fixed (like in the CUDA backend where warp size is always 32)
3- Even if we solve the first two, since AMX tile size can be a runtime variable, AMX case will still have to return variable length

In the case of CUDA backend which is SYCL AOT compiled and SG size = 32 known and fixed, the additional marray capability will be provided.



## VNNI/Packed Layout
Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.
The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed_b` layout for a 16-bit type.
Expand Down Expand Up @@ -225,12 +312,15 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
// users need to specify the packed_b layout
joint_matrix<int8_t, tK, tN, packed_b> tB(sg);
joint_matrix<int32_t, tM, tN> tC(sg);
joint_matrix_load(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, matrix_layout::row_major);
joint_matrix_fill(sg, tC, 0);
for (int k = 0; k < K; k += tk) {
joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, matrix_layout::row_major);
joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN*4, N*4, matrix_layout::packed_b); // VNNI
tC = joint_matrix_mad(sg, tA, tB, tC);
}
auto wi_slice_c = matC.get_wi_data();
for (int i = 0; i < wi_slice_c.length(); i++)
wi_slice_c[i] *= alpha; // The indexing here “i” is in the vector owned by a WI, not in the matrix C
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
wi_slice_c[i] *= alpha; // The indexing here “i” is in the vector owned by a WI, not in the matrix C
wi_slice_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C

Minor nit: looks like you cut-and-paste this code from a Word document, which introduced non-ascii quote characters. They should be changed to standard double-quote characters.

joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, matrix_layout::row_major);
}).wait();
```
Expand Down Expand Up @@ -509,71 +599,38 @@ joint_matrix<int, msize, nsize> sub_c(sg);

## Future-looking API

### Matrix Initialization: `joint_matrix_fill`
The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill` makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to `_tile_zero` intrinsic:

```c++
namespace sycl::ext::oneapi::experimental::matrix {
template <typename Group, typename T, size_t NumRows, size_t NumCols,
matrix_layout L>
void joint_matrix_fill(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &m, const T& v);
}
```

### Element Indexing and Element-Wise Operations
There are multiple options on how to enable this feature.
### Memory scope
The current experimental API uses `joint_` semantics to define the memory scope of the matrix. The long term solution is to use the proposed link:../supported/sycl_ext_oneapi_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below.

#### Option 1: Non-restrictive element indexing
Allowing non-restrictive element indexing on the matrix element as shown below would result into slow indexing on the GPU.
Besides, it will rely heavily on spirv and compiler vectorization:

```c++
matrix<int, 8, 8> C;
for (int i = 0; i < 8; i++)
for (int j = 0; j < 8; j++)
C(i,j) *= alpha; //Align with mdspan
```
#### Option2: Restrictive fast element indexing
In the DPC++ context, the expectation is that all element-wise operations will happen in a converged control path by all work items in the group.
Option 2 proposes a new set of element-wise operations by overloading existing operations to work on `matrix` object. An example is shown below:
```c++
joint_matrix<ONEAPI::sub_group, int, 8, 8> C(sg);
C *= alpha;
multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN>>(sg);
```
The problem with this option is that it is restrictive to a very limited set of operations.

#### Option3: Restrictive conversion in the interface from SIMD to SPMD
Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix.
While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix.
However using the `mma` ptx instructions as opposed to the `wmma` ptx instructions the mapping is known. Knowing this mapping is important for the user to implement new operations like sum of rows of a matrix for quantized algorithms.
We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet.

#### proposal: Explicit conversion in the interface from SIMD to SPMD
We introduce a new function `get_wi_data` that provides any portion of the matrix that the user wants but in a SPMD array object:.
### WI data to joint matrix mapping coordinates information for piece-wise operations
The indexing provided inside the `wi_data` class acesses only the portion of the current WI. It is not possible the location or coordinates of this portion in the original matrix. This coordinates mapping is implementation defined and change from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping coordinates information is needed to reason about the matrix view.
With joint matrix, we want to write, as much as possible, one code to run on different backends. So if backend X states that a WI owns one exact row of the matrix for instance. Writing the following code will work only on that backend for that version of hardware. The hardware and implementations change, for instance, the same WI can own half of the row because SG size increased or hardware units increased.

```c++
namespace sycl::ext::oneapi::experimental::matrix {
template <typename Group, typename T, size_t NumRows, size_t NumCols, matrix_layout L>
marray<T, n_rows * n_cols> get_wi_data(joint_matrix<T, NumRows, NumCols, L, Group> &m, size_t row_index,
size_t col_index, size_t n_rows, size_t n_cols);
auto data = C.get_wi_data();
for (int i = 0; i < length; ++i) {
sum_of_local_rows[row] += data[i];
}
```

Example where each WI gets 1 column:
```c++
marray<T,msize> wi_C = get_wi_data(C, 0, wi_idx, msize, 1, matrix_layout::row_major);
for (int i = 0; i < msize; i++)
row_sum += wi_C[i];
```


### Memory scope
The current experimental API uses `joint_` semantics to define the memory scope of the matrix. The long term solution is to use the proposed link:../supported/sycl_ext_oneapi_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below.

We want to keep backward compatibility in the joint matrix code when implementations or hardware change. To that end, instead of hard-code this mapping, we write general backend and target-agnostic, especially in the JIT compilation mode of SYCL. This is possible by querying this mapping so code does not have to change from one version to the other.

So for the mapping problem, since this mapping is implementation-defined, one of the proposals is to add runtime functions like:
```c++
multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN>>(sg);
auto data = C.get_wi_data();
for (int i = 0; i < length; ++i) {
auto row, col = data[i].get_coord();
sum_of_local_rows[row] += data[i];
}
```
We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet.


## Open Questions
Expand All @@ -585,7 +642,7 @@ We did not utilize this extension for this matrix API version because sub-group
- In the future looking APIs, `get_wi_data` (that is currently under design) returns an owned object. Should this return a view object to make sure the original matrix C is changed after its slices are modified.

## TODO List
- Add support for fill matrix and element-wise operations features
- Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class.
- Add 'matrix_use' parameter to the matrix to distinguish between matrix A, B, and matrix accumulator. This is necessary for supporting VNNI and transpose transform
- Change the names default sizes in the query from defaultM, defaultN, defaultK to M,N,K
- Change the type of `scope` in the query interface to be able to return more than one value. This will be useful in the event we support other scopes like workgroup besides subgroups
Expand All @@ -599,4 +656,5 @@ We did not utilize this extension for this matrix API version because sub-group
|Rev |Date |Author |Changes
|1 |2021-04-13 |Dounia Khaldi |Initial public working draft.
|2 |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS
|3 |2022-05-16 |Dounia Khaldi |Add matrix fill and piece-wise operations support
|======================