CUDA: Add fastdiv to k_bin_bcast*, giving 1-3% E2E performance#15872
Conversation
The GGML specification for the tensor dimensions is |
fastdiv and fastmodulo to k_bin_bcast*, giving 1-3% E2E performancefastdiv to k_bin_bcast*, giving 1-3% E2E performance
…15872) * Add fastdiv and fastmodulo to k_bin_bcast kernel * Address review comments * `prod_` instead of `prod` suffix * Add test case for `k_bin_bcast_unravel` in CUDA backend
This PR applies
fastdivandfastmodulointroduced by #15715 tok_bin_bcastandk_bin_bcast_unravel, giving around 1-3% E2E performance on Ada Lovelace and Blackwell GPUs.While changing host logic in
launch_bin_bcast_packI was surprised to see we keepne*in 64 bit precision, but use only the 32 least significant bits in the actual kernel. This could potentially lead to some semantic bugs where we do not iterate over all elements ofsrc0/src1, or am I missing something here?Perf numbers