-
Notifications
You must be signed in to change notification settings - Fork 808
[SYCL][Matrix] Add documentation about new matrix features #6157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
1b3c7c8
3b7f0fd
eef2e4d
e7e9ff6
a3f833a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -53,7 +53,7 @@ value to determine which of the extension's APIs the implementation supports. | |||||||||||||
| |====================== | ||||||||||||||
| |Value |Description | ||||||||||||||
| |1 |Initial extension implementation on Intel AMX. Base features are supported. | ||||||||||||||
| |2 |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad and the query interface are supported | ||||||||||||||
| |2 |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported | ||||||||||||||
| |====================== | ||||||||||||||
|
|
||||||||||||||
| ## New `joint_matrix` class | ||||||||||||||
|
|
@@ -165,6 +165,85 @@ namespace sycl::ext::oneapi::experimental::matrix { | |||||||||||||
| The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result. | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| #### Matrix Initialization: `joint_matrix_fill` | ||||||||||||||
| The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill` makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to `_tile_zero` intrinsic: | ||||||||||||||
|
|
||||||||||||||
| ```c++ | ||||||||||||||
| namespace sycl::ext::oneapi::experimental::matrix { | ||||||||||||||
| template <typename Group, typename T, size_t NumRows, size_t NumCols, | ||||||||||||||
| matrix_layout L, typename Tv> | ||||||||||||||
| void joint_matrix_fill(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &m, const Tv v); | ||||||||||||||
|
||||||||||||||
| } | ||||||||||||||
| ``` | ||||||||||||||
| IMPORTANT: In the current implementation, only the subgroup scope is supported. | ||||||||||||||
|
|
||||||||||||||
| #### Element Indexing and Piece-Wise Operations | ||||||||||||||
| Besides matrix multiply and add, matrices are used in linear and non linear piece-wise operations. Activation functions are an example of element-wise operations. They can be linear like `ReLU` that, for each value `z`, returns the maximum between `z` and zero, or non linear like `Sigmoid` that calculates `1/(1+ exp(-z))`. Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations. For instance, quantized GEMM for `int8_t` is calculated using `A*B + sum_rows_A + sum_cols_B + scalar_zero_point`. `sum_rows_A` and `sum_cols_B` do not operate on elements of the matrix but on pieces: row in `sum_rows_A` and columns in `sum_cols_B`. | ||||||||||||||
|
||||||||||||||
| Besides matrix multiply and add, matrices are used in linear and non linear piece-wise operations. Activation functions are an example of element-wise operations. They can be linear like `ReLU` that, for each value `z`, returns the maximum between `z` and zero, or non linear like `Sigmoid` that calculates `1/(1+ exp(-z))`. Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations. For instance, quantized GEMM for `int8_t` is calculated using `A*B + sum_rows_A + sum_cols_B + scalar_zero_point`. `sum_rows_A` and `sum_cols_B` do not operate on elements of the matrix but on pieces: row in `sum_rows_A` and columns in `sum_cols_B`. | |
| Besides matrix multiply and add, this extension aims to make it possible to perform piece-wise operations on matrices in a SPMD manner. The mechanisms that are recommended to perform such piece-wise operations depend upon which of the following classes the operation falls into: | |
| Class "1". Element-wise operations that are performed identically on every element of the matrix. | |
| Class "2". Element-wise operations that depend on the element index of the matrix or operations that take multiple elements as operands (such as a sum of all elements in a row for example). | |
| This extension currently only supports case 1). However a proposal for supporting 2) (for some backends) in the future is provided in a later section. |
Then continue with the explanation of how case 1) is dealt with. Case 2) seems to have been considered in section "### WI data to joint matrix mapping coordinates information for piece-wise operations" and requires that the backend knows the mapping from "joint_matrix Domain" to "WI Domain".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically case 1) doesn't require mapping between get_data and joint_matrix, but cases 2) do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make the change, thanks
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the paragraph beginning "We explored" can be removed also I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to remove this whole paragraph beginning "Nvidia wmma interface" because it (or something similar) is more appropriate for the CUDA backend spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remaining paragraphs above seem inappropriate for an API specification. The audience for this document wants to know what this API does and how to use it. However, these paragraphs seem more like a justification for why this API was chosen vs. some other possibility. That's not really the purpose of this document. I'd suggest either removing them or moving them to a new section towards the bottom of the document titled something like "Background on the element indexing operations".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will move these in a background subsection but I will leave it in this section. Let's see if it looks better.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence should be expanded to explain the purpose of this API better. The reader needs to understand that each work-item contains only a subset of the elements in the matrix. The sentence above sort of mentions this, but I think it could be clearer. For example:
The data elements in a
joint_matrixdistributed across the work-items in theGroupin an implementation-defined way, such that each work-item owns a unique subset of the data elements. An application can use the APIs in this section to access the data elements owned by each work-item. This is especially useful for algorithms that operate on each data element individually.
I think this last sentence could replace the first paragraph you have "Besides matrix multiply and add, matrices are used in linear ...". However, if you think there's more to say about when these APIs are useful, you could add some more sentences here explaining it.
Then finish up by saying something like:
The code listing below shows a synopsis of these new APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already added more clarifications based on Jack's first review. I will add more based on your input as well. However, note that , "such that each work-item owns a unique subset of the data elements" is not always true like in the AMX case for instance. A matrix is allocated in the 2d register tile that is a subgroup shared memory (register in this case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, note that , "such that each work-item owns a unique subset of the data elements" is not always true like in the AMX case for instance. A matrix is allocated in the 2d register tile that is a subgroup shared memory (register in this case).
Are you saying that when one work-item calls get_wi_data that it might get overlapping elements that are also returned from some other work-item's call to get_wi_data? If this is the case, I don't see how this API is very useful. For example, code like this would result in some elements being incremented twice:
auto wi_data_c = matC.get_wi_data();
for (int i = 0; i < wi_data_c.length(); i++)
wi_data_c[i] += 1;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that when one work-item calls get_wi_data that it might get overlapping elements
No this is not possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of these code synopses is to show the API, not the implementation. Therefore, remove the function bodies and all the private data members. For example:
namespace sycl::ext::oneapi::experimental::matrix {
template <typename T, size_t NumRows, size_t NumCols,
matrix_layout Layout = matrix_layout::row_major,
typename Group = sycl::sub_group>
struct joint_matrix {
wi_data<T, NumRows, NumCols, Layout, Group> get_wi_data();
};
/* ... */
} // namespace sycl::ext::oneapi::experimental::matrix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do that, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do wi_data and wi_element really need all these template parameters? It seems like it would be easier to use if the only template parameter was T. It seems like the other template parameters are only there because there is a private data member M (a reference to matrix). However, you only seem to use M.spvm in the function bodies. Could you instead just store the spvm member directly in wi_data and wi_element?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on your suggestion? I get what you want to do but did not get the how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delayed response. I was OOO for about 2 months and just got back recently.
I was thinking that these types could be simplified to have fewer template parameters, so the goal would be to have an API like this:
template <typename T>
class wi_data {
size_t length();
wi_element<T> operator[](size_t i);
};
template <typename T>
class wi_element {
operator T();
wi_element &operator=(const T &rhs);
};
Very roughly, I was thinking that you could accomplish this by changing the private data member included in wi_data and wi_element. Currently, these both contain a reference to the joint matrix M. However, it seems like they only need to use M.spvm. Therefore, could you change the implementation to hold just the spvm like:
template <typename T>
class wi_data {
/* not sure what type */ spvm;
public:
size_t length() {return __spirv_JointMatrixWorkItemLengthINTEL(spvm);}
wi_element<T> operator[](size_t i) {
return wi_element<T>(spvm, i);
}
};
template <typename T>
class wi_element {
/* not sure what type */ spvm;
std::size_t idx;
public:
operator T() {
return __spirv_VectorExtractDynamic(spvm, idx);
}
wi_element &operator=(const T &rhs) {
M.spvm = __spirv_VectorInsertDynamic(spvm, rhs, idx);
return *this;
}
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmlueck
It looks like spvm type also needs these template parameters. SO I don't think we can reduce them:
__spv::__spirv_JointMatrixINTEL<
T, NumRows, NumCols, spv_matrix_layout_traits::value> *spvm;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's too bad.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume we do NOT want application code to construct a wi_data (or wi_element) directly? Instead, I presume we want application to call joint_matrix::get_wi_data to get the wi_data? If that is the case, these constructors should be private in the implementation, and joint_matrix should be a friend, so that it can construct the objects.
The code synopsis, then, would only list the public member functions:
template <typename T, size_t NumRows, size_t NumCols, matrix_layout Layout, typename Group>
class wi_data {
public:
size_t length();
wi_element<T, NumRows, NumCols, Layout, Group> operator[](size_t i);
};
template <typename T, size_t NumRows, size_t NumCols, matrix_layout Layout, typename Group>
class wi_element {
public:
operator T();
wi_element &operator=(const T &rhs);
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the code synopsis, there should be some description of the member functions. I'd suggest three tables, one for each class:
- Table describing member functions of
joint_matrix(get_wi_data) - Table describing member functions of
wi_data - Table describing member functions of
wi_element.
You can see an example here: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_device_global.asciidoc#representation-of-device-globals
(Scroll down to the table after the code synopsis.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we don't have that many members, I added description in the text. Let me know if it looks enough.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| wi_data_c[i] *= alpha; // Note that the indexing here “i” is in the vector owned by a WI, not in the matrix C | |
| wi_data_c[i] *= alpha; // Note that the indexing here "i" is in the vector owned by a WI, not in the matrix C |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| wi_slice_c[i] *= alpha; // The indexing here “i” is in the vector owned by a WI, not in the matrix C | |
| wi_slice_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C |
Minor nit: looks like you cut-and-paste this code from a Word document, which introduced non-ascii quote characters. They should be changed to standard double-quote characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not make sense to use
consthere when passing a parameter by value.