You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{"pull_number":"20644","title":"ggml-cuda: Add NVFP4 dp4a kernel","body":"This PR brings in the initial plumbing for basic CUDA support for NVFP4 - it includes one NVFP4xQ8_1 dp4a kernel. MMA or Blackwell kernels are not included here and were kept out for a separate PR. \r\n\r\n`vec_dot_mma` is is linked up the dp4a kernel so it still runs dp4a even when BLACKWELL_MMA_AVAILABLE.\r\nThere is a branch for NVFP4 in `mmq_write_back_mma` to push it back to the dp4a layout which will be removed when the MMA kernel is wired in.\r\n\r\nIt was tuned to bring as much performance as possible for DP4A. Comparisons below. \r\n**CPU vs DP4A**\r\n\r\n| Model | CPU (pp64) | DP4A (pp64) | Speedup | CPU (tg16) | DP4A (tg16) | Speedup |\r\n|---|---:|---:|---:|---:|---:|---:|\r\n| Qwen3.5-0.8B | 59.63 | 3557.08 | **59.65x** | 32.33 | 329.60 | **10.19x** |\r\n| Qwen3.5-27B | 1.27 | 594.14 | **467.83x** | 1.08 | 52.65 | **48.75x** |\r\n\r\n**pp512 / tg128**\r\n\r\n| Model | pp512 t/s | tg128 t/s |\r\n|---|---:|---:|\r\n| Qwen3-4B | 9031.22 | 271.69 |\r\n| Qwen3-8B | 4909.27 | 175.93 |\r\n| Qwen3.5-27B| 1482.89 | 63.47 |\r\n| Qwen3.5-0.8B | 25596.92 | 388.12 |\r\n| Qwen3.5-0.8B-Q4_K_M | 38339.57 | 521.69 |\r\n\r\nAI assistance was used in refactoring and writing some of this code. Each line has been scrutinized and this was hand-edited to be as neat and as minimal as possible. Test-backend-ops passes; CPU<>GPU parity was tested with a separate tool and is exact across multiple tile sizes, and kld/ppl was verified on several models and is as expected (tested on GPU only, CPU too slow).","pull_head_sha":"1d9aa514d5aca8f7670636fd6af8791e6810e99c","loci_pr_branch":"loci/pr-20644-nvfp4-dp4a","short_merge_base":"49bfdde","loci_main_branch":"loci/main-49bfdde","use_loci_base":0}
0 commit comments