Skip to content

UPSTREAM PR #19769: WIP: ggml : add NVFP4 quantization type support#1194

Open
loci-dev wants to merge 39 commits intomainfrom
loci/pr-19769-feat-nvfp4
Open

UPSTREAM PR #19769: WIP: ggml : add NVFP4 quantization type support#1194
loci-dev wants to merge 39 commits intomainfrom
loci/pr-19769-feat-nvfp4

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#19769

I'm not super experienced with the ggml/gguf internals so feedback is very welcome. Note on AI usage: Claude Opus 4.6 was used for navigating the codebase, debugging, and writing parts of the code. All changes have been reviewed and tested manually. Open to reworking anything that doesn't meet the project's standards.

This adds support for NVIDIA's NVFP4 quantization format (FP4 E2M1 weights, UE4M3 per-block scale, 16 elements per block). This is the format produced by NVIDIA ModelOpt's NVFP4 algo. The main difference is the scale encoding (UE4M3 vs E8M0).

What's in here:

  • New GGML_TYPE_NVFP4 type, block struct, UE4M3 conversion helpers, reference quantize/dequantize
  • convert_hf_to_gguf.py detects NVFP4 ModelOpt models and repacks into the GGUF block format
  • CPU backend: scalar dot product + ARM NEON + x86 AVX2/AVX512 optimized paths
  • CUDA backend: dequantize, mat-vec, mat-mat (MMQ with q8_1), on-device quantization
  • Metal backend: dequant functions, mat-vec, mat-mat, get_rows kernels
  • Vulkan backend: dequant shader, all mul_mat paths (scalar, coopmat2, vecq, mm, mmq), pipeline setup
  • gguf-py: type constant, quant/dequant, endian conversion
  • Tests added to test-backend-ops and test-quantize-fns

Tested with models from https://huggingface.co/NVFP4 on NVIDIA Blackwell GPU with 13.1 driver, x86 AVX512, Apple M5 MacBook (CPU, Metal and emulated Vulkan via MoltenVK). Ran llama-bench and a basic server smoke test. Would appreciate help with that if someone has a good baseline to compare against.

@loci-review
Copy link
Copy Markdown

loci-review bot commented Feb 21, 2026

Overview

Analysis of NVFP4 quantization support implementation across 111,670 functions (42 modified, 12 new, 0 removed, 111,616 unchanged) in 15 binaries shows minimal system-wide performance impact. Overall power consumption increased 0.11% (+1.7μJ).

Binary Power Consumption Changes:

  • build.bin.libllama.so: -0.046%
  • build.bin.libggml-base.so: +0.538%
  • build.bin.libggml-cpu.so: +0.87%
  • build.bin.llama-quantize: -0.001%
  • build.bin.llama-bench: -0.001%
  • build.bin.libggml.so, build.bin.llama-tokenize, build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-tts, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli: 0.0%

Function Analysis

Quantization Kernels (Performance-Critical):

  • dequantize_row_q8_0 (libggml-base.so): Response time +24.74ns (+4.20%), throughput +24.74ns (+5.86%). Zero source code changes; regression attributed to binary layout effects from NVFP4 function insertion.
  • dequantize_row_mxfp4 (libggml-base.so): Response time +31.89ns (+4.45%), throughput +31.89ns (+5.38%). Unchanged source code; instruction cache misalignment from adjacent NVFP4 additions.

Model Loading Functions (Non-Critical):

  • llama_model_ftype_name (libllama.so): Response time +81.82ns (+3.45%), throughput +60.03ns (+3.75%). Added NVFP4 case to format name switch statement.
  • std::vector<gguf_kv>::emplace_back (libggml-base.so): Response time -1.77ns (-0.01%), throughput +78.15ns (+37.92%). Template recompilation from ggml_type enum expansion (40→41 types).
  • gguf_set_val_u8 (libggml-base.so): Response time +42.75ns (+0.18%), throughput +41.82ns (+33.91%). No source changes; compiler optimization differences.

STL Template Functions:

  • std::vector::end() (libllama.so): Response time -183.29ns (-69.24%), throughput -183.29ns (-75.41%). Compiler optimization improvements in tokenization path.
  • std::unique_ptr::operator= variants (libllama.so): Bidirectional changes (+97.86% and -49.46% throughput) reflect compiler non-determinism in template instantiation for graph construction.

Other analyzed functions showed minor changes (20-80ns) in initialization and utility paths, attributed to compiler artifacts rather than algorithmic modifications.

Additional Findings

Inference Hot Path Preserved: No changes to matrix multiplication (GEMM), attention mechanisms, or KV cache operations—the primary performance bottlenecks (70-90% of inference time). Quantization kernel regressions (25-32ns per call) translate to <0.25% impact on per-token generation time.

Multi-Backend Integration: NVFP4 support added across CUDA, Vulkan, and Metal backends with MoE compatibility, extending the 40+ quantization format ecosystem. Implementation leverages existing kvalues_mxfp4 lookup table infrastructure.

Root Causes: Performance changes stem from: (1) binary layout effects on instruction cache (quantization kernels), (2) STL template recompilation from enum expansion (model loading), and (3) GCC 13 ARM64 compiler optimizations (STL functions). No algorithmic inefficiencies or code quality issues identified.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 6 times, most recently from 2cecc98 to a92fe2a Compare February 26, 2026 02:16
Remove NVFP4 support from GPU backends and architecture-specific
optimized dot products. These should be added in separate PRs so
backend specialists can review them independently.

Reverted files:
- ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh,
  quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh
- ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h,
  ggml-metal-ops.cpp
- ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
- ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c

Core NVFP4 support (type definition, CPU fallback dot product,
quantization, dequantization, conversion) is retained.
After shelving backend-specific SIMD implementations, the generic
CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390
platforms that previously relied on arch-specific versions.
Previously, values with ue4m3_exp <= 0 were clamped to 0, causing
all small scales to underflow. This made NVFP4 quantization via
llama-quantize produce garbage (PPL = 5.8M) since typical transformer
weights have amax/6.0 in the range 0.001-0.01, which falls in the
UE4M3 subnormal range.

Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7),
matching the decode path in ggml_ue4m3_to_fp32.

Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33),
comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).
Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using
vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.

tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup
- Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy
  ggml_ue4m3_to_fp32() in the hot loop
- Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
- Accumulate with vfmaq_f32 into float32x4_t vector accumulators

tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)
Alternative approach: rearrange q8 data to match the NVFP4 lo/hi
nibble layout instead of rearranging the looked-up NVFP4 values.
Eliminates vcombine_s8(vget_low, vget_low) shuffles.

Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x
block overhead from QK=16 vs QK=32, not the shuffle instructions.
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 6fa8e23 to f2637dc Compare March 15, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 12 times, most recently from e6c519b to 59f2b25 Compare March 23, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 89a1190 to 8fec234 Compare March 30, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants