UPSTREAM PR #19769: WIP: ggml : add NVFP4 quantization type support#1194
UPSTREAM PR #19769: WIP: ggml : add NVFP4 quantization type support#1194
Conversation
OverviewAnalysis of NVFP4 quantization support implementation across 111,670 functions (42 modified, 12 new, 0 removed, 111,616 unchanged) in 15 binaries shows minimal system-wide performance impact. Overall power consumption increased 0.11% (+1.7μJ). Binary Power Consumption Changes:
Function AnalysisQuantization Kernels (Performance-Critical):
Model Loading Functions (Non-Critical):
STL Template Functions:
Other analyzed functions showed minor changes (20-80ns) in initialization and utility paths, attributed to compiler artifacts rather than algorithmic modifications. Additional FindingsInference Hot Path Preserved: No changes to matrix multiplication (GEMM), attention mechanisms, or KV cache operations—the primary performance bottlenecks (70-90% of inference time). Quantization kernel regressions (25-32ns per call) translate to <0.25% impact on per-token generation time. Multi-Backend Integration: NVFP4 support added across CUDA, Vulkan, and Metal backends with MoE compatibility, extending the 40+ quantization format ecosystem. Implementation leverages existing kvalues_mxfp4 lookup table infrastructure. Root Causes: Performance changes stem from: (1) binary layout effects on instruction cache (quantization kernels), (2) STL template recompilation from enum expansion (model loading), and (3) GCC 13 ARM64 compiler optimizations (STL functions). No algorithmic inefficiencies or code quality issues identified. 🔎 Full breakdown: Loci Inspector. |
2cecc98 to
a92fe2a
Compare
Remove NVFP4 support from GPU backends and architecture-specific optimized dot products. These should be added in separate PRs so backend specialists can review them independently. Reverted files: - ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh, quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh - ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h, ggml-metal-ops.cpp - ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/* - ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c Core NVFP4 support (type definition, CPU fallback dot product, quantization, dequantization, conversion) is retained.
After shelving backend-specific SIMD implementations, the generic CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390 platforms that previously relied on arch-specific versions.
Previously, values with ue4m3_exp <= 0 were clamped to 0, causing all small scales to underflow. This made NVFP4 quantization via llama-quantize produce garbage (PPL = 5.8M) since typical transformer weights have amax/6.0 in the range 0.001-0.01, which falls in the UE4M3 subnormal range. Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7), matching the decode path in ggml_ue4m3_to_fp32. Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33), comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).
Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products. tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup
- Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy ggml_ue4m3_to_fp32() in the hot loop - Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32 - Accumulate with vfmaq_f32 into float32x4_t vector accumulators tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)
Alternative approach: rearrange q8 data to match the NVFP4 lo/hi nibble layout instead of rearranging the looked-up NVFP4 values. Eliminates vcombine_s8(vget_low, vget_low) shuffles. Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x block overhead from QK=16 vs QK=32, not the shuffle instructions.
6fa8e23 to
f2637dc
Compare
e6c519b to
59f2b25
Compare
89a1190 to
8fec234
Compare
Note
Source pull request: ggml-org/llama.cpp#19769
I'm not super experienced with the ggml/gguf internals so feedback is very welcome. Note on AI usage: Claude Opus 4.6 was used for navigating the codebase, debugging, and writing parts of the code. All changes have been reviewed and tested manually. Open to reworking anything that doesn't meet the project's standards.
This adds support for NVIDIA's NVFP4 quantization format (FP4 E2M1 weights, UE4M3 per-block scale, 16 elements per block). This is the format produced by NVIDIA ModelOpt's NVFP4 algo. The main difference is the scale encoding (UE4M3 vs E8M0).
What's in here:
Tested with models from https://huggingface.co/NVFP4 on NVIDIA Blackwell GPU with 13.1 driver, x86 AVX512, Apple M5 MacBook (CPU, Metal and emulated Vulkan via MoltenVK). Ran llama-bench and a basic server smoke test. Would appreciate help with that if someone has a good baseline to compare against.