-
Notifications
You must be signed in to change notification settings - Fork 809
[SYCL][Matrix] Add documentation about new matrix features #6157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
1b3c7c8
3b7f0fd
eef2e4d
e7e9ff6
a3f833a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -182,14 +182,17 @@ Besides matrix multiply and add, matrices are used in linear and non linear piec | |||||
|
|
||||||
| We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future. | ||||||
|
|
||||||
| In order to be able to perform any piece-wise operation in a general and in an efficient way, we provide a mapping conversion function from the matrix domain that is owned by a group of work items to the portion that is owned by a work item. Besides, the WI data to joint matrix mapping coordinates information must be known. | ||||||
| In order to be able to perform any piece-wise operation in a general and efficient way, we provide a mapping conversion function from the matrix domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model. In joint matrix, it is up to the implementation to distribute the matrix among the work items or keep it shared. For instance, the matrix is a shared entity among the work items in the case of AMX where the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. Here, for the purpose of piece-wise operations, the conversion to the SPMD model happens using the matrix elements to WI mapping. Besides, the WI data to joint matrix mapping coordinates information must be known in order to extract the relevant piece for operations like sum of rows of a matrix. | ||||||
|
||||||
| In order to be able to perform any piece-wise operation in a general and efficient way, we provide a mapping conversion function from the matrix domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model. In joint matrix, it is up to the implementation to distribute the matrix among the work items or keep it shared. For instance, the matrix is a shared entity among the work items in the case of AMX where the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. Here, for the purpose of piece-wise operations, the conversion to the SPMD model happens using the matrix elements to WI mapping. Besides, the WI data to joint matrix mapping coordinates information must be known in order to extract the relevant piece for operations like sum of rows of a matrix. | |
| There is no fixed allocation of matrix elements held by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the WIs in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. However it is necessary to allocate WIs to specific elements of the matrix. In order to be able to perform piece-wise operations in a general and efficient way, we provide a conversion function from the joint_matrix domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model. The mapping between the matrix data owned by a WI and them matrix data owned by the joint matrix may not be known for all backends. As such, additional conditions are necessary for class "2" piece-wise operations that we propose in a later section... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to remove this whole paragraph beginning "Nvidia wmma interface" because it (or something similar) is more appropriate for the CUDA backend spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of these code synopses is to show the API, not the implementation. Therefore, remove the function bodies and all the private data members. For example:
namespace sycl::ext::oneapi::experimental::matrix {
template <typename T, size_t NumRows, size_t NumCols,
matrix_layout Layout = matrix_layout::row_major,
typename Group = sycl::sub_group>
struct joint_matrix {
wi_data<T, NumRows, NumCols, Layout, Group> get_wi_data();
};
/* ... */
} // namespace sycl::ext::oneapi::experimental::matrix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do that, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the code synopsis, there should be some description of the member functions. I'd suggest three tables, one for each class:
- Table describing member functions of
joint_matrix(get_wi_data) - Table describing member functions of
wi_data - Table describing member functions of
wi_element.
You can see an example here: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_device_global.asciidoc#representation-of-device-globals
(Scroll down to the table after the code synopsis.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we don't have that many members, I added description in the text. Let me know if it looks enough.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence:
"Vectorization along the subgroup dimension will get enabled automatically to vectorize the contiguous portion of the matrix."
implies to me that WI portion of joint_matrix represents contiguous elements of the matrix (either row or column major), but this is not generally true: In the Tensor Cores backend it is rarely true: in most cases the WI owned portion of joint_matrix includes elements that are not contiguous. I'm not sure if there is any value to this sentence and I think it can be removed. Instead there could be:
| Example where each WI gets a vector in `wi_data_c`. Vectorization along the subgroup dimension will get enabled automatically to vectorize the contiguous portion of the matrix. Since `wi_data_c` constructs a view of the joint matrix `matC`, `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`. | |
| In the following example `wi_data_c` is a reference to the WI owned portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not imply WI portion of joint matrix are contiguous. It implies round robin distribution among the WIs.
Let's take a matrix example of 3 rows x 2 cols, if we assume we have SG size of 2 just for illustration here. To ensure vectorization along the SG dimension, each WI should own 1 column. so you have simd of 2 (1 row) in this case and 3 vector instructions (for each row). Does this make sense?
Does the CUDA backend generate vector code for the loop around WI portions?
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarification:
The current Tensor Cores implementation isn't AOT: it is JIT just at a lower level than DPC++: DPC++ compiles down to CUDA "PTX" asm, which can then be compiled down to a lower level asm called "SASS" at execution time. Despite this, the number of data elements in joint_matrix owned by each WI is known at compile time for all compilation versions. The thing that isn't known at compile time is the mapping between these WI data elements and the subgroup matrix owned by joint_matrix as a whole.
The other point is that, from the available information in this doc, I don't see that it follows that the WI owned elements can only be returned as an array if and only if "the number of data elements in joint_matrix owned by each WI is known at compile time". But this point is not important to dwell on because I think such an explanation is not really necessary for this "IMPORTANT" note: the point here is to make the user aware that a limited number of backends (Tensor Cores only atm) can return the WI portion of the matrix as an marray (although as specified earlier in the document the order of the marray elements does not have a defined mapping to the order of the matrix owned by the joint_matrix subgroup as a whole).
| IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. so `wi_data` in this case can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend. | |
| IMPORTANT: The Tensor Cores implementation is capable of returning the complete set of matrix elements owned by a WI as an `marray`, which can be useful in cases where the user wishes to perform identical operations on every element of the matrix efficiently and conveniently, where SYCL math functions optimized for `marray` exist. An additional interface will be provided for the Tensor Cores backend to return the `marray`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I say AOT, I am referring to the SYCL compilation flow . When users have to specify -fsycl-targets this is AOT. JIT is when the compiler generates SPIRV. At runtime SPIRV is translated to the target-specific offload binaries.
I will add a comment about the fact that get_wi_data length cannot be constexpr because:
1- SYCL JIT compilation and partitioning among WIs is implementation defined. There is no way to provide a constexpr length that is true for every backend (it has to be a return value of a SPIRV function so it is not constexpr)
2- SG size is not fixed (like in the CUDA backend where warp size is always 32)
3- Even if we solve the first two, since AMX tile size can be a runtime variable, AMX case will still have to return variable length
In the case of CUDA backend which is SYCL AOT compiled and SG size = 32 known and fixed, the additional marray capability will be provided.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the paragraph beginning "We explored" can be removed also I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay