CUDA: MoE helper in device code, better tile sizes#15525
CUDA: MoE helper in device code, better tile sizes#15525JohannesGaessler merged 6 commits intoggml-org:masterfrom
Conversation
|
Looking at the code I realized that it's possible to reduce the number of tiles/CUDA blocks that need to be considered for MoE. These tiles/CUDA blocks would be skipped regardless but you can squeeze out a few more % if you never need to do this check in the first place. (I updated the table in the OP.) |
ggml/src/ggml-cuda/common.cuh
Outdated
| template<int width = WARP_SIZE> | ||
| static __device__ __forceinline__ int warp_reduce_all(int x) { | ||
| #ifdef GGML_USE_HIP | ||
| #ifndef GGML_USE_HIP |
There was a problem hiding this comment.
Why resitct this to cuda? might as well use __all on hip and check for ggml_cuda_get_physical_warp_size, even if we never use this with width == 64, that should still help rdna.
There was a problem hiding this comment.
I just wasn't aware of the __all instruction.
ggml/src/ggml-cuda/common.cuh
Outdated
|
|
||
| template<int width = WARP_SIZE> | ||
| static __device__ __forceinline__ int warp_reduce_any(int x) { | ||
| #ifndef GGML_USE_HIP |
|
I think the problem was that I was misusing the warp reduce functions. I forgot that the default width is |
|
jup warp_reduce_any<warp_size> is needed. |
|
should be ok now i think, let me check. |
|
Correctness is fine and performance looks fine too. CDNA:
|
|
I'm noticing that for very large batch sizes there are issues with shared memory limits. On Pascal it should be |
|
Previously I was storing the token position in the physical batch and the expert index as 32 bit integers. However, that many bits are not needed. If you store the token position with 22 bits and the expert index with 10 bits you only need half as much shared memory and the maximum physical batch size becomes twice as high as I outlined in my previous post, which I think will be enough. The assigned bits will only be insufficient for a physical batch size > 4M or more than 1024 used experts, which should be well above the values that the code needs to run for. |
* CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks
* CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks
* CUDA: MoE helper in device code, better tile sizes * reduce superfluous CUDA blocks
This PR moves the helper code for the MoE MMQ kernel from host code to device code. This eliminates the need for synchronization. It also sets a tighter bound for the maximum number of columns that the kernel selection logic is optimizing for so there is less waste at small batch sizes.
Performance changes
@IMbackK if you could also check performance that would be appreciated.