Weight gradient kernels for dense and MoE models#95
Conversation
|
I plan to merge it after #94, thanks! |
To clarify, you can refer to the profile-data repo for internal CUTLASS impl performance comparison. |
why the difference is nearly twice ? --update-- |
|
@hxdtest Thank you very much for your feedback. |
Thank you for your reply. After fix the test code, the results are close. |
Fantastic Work!I used DeepGemm and built a fp8 Linear layer to replace |
|
@hxdtest can you please share your Linear layer wrapper as a quick start util. It will be helpful. |
|
|
Thanks for your work on the FP8 linear module. But the implements have lots of unfused kernels, e.g. Just a reminder if you care about the end-to-end performance :) |
This Pull Request introduces
deepgemm.wgrad_gemm_fp8_fp8_fp32_ntandk_grouped_wgrad_gemm_fp8_fp8_fp32_nt, optimized weight gradient kernels for dense and MoE models. These kernels achieve a ~20% speedup compared to the internal CUTLASS implementation.For detailed usage, refer to the function documentation.
Weight gradient GEMMs for dense models
Grouped weight gradient GEMMs for MoE models