UPSTREAM PR #19769: WIP: ggml : add NVFP4 quantization type support by loci-dev · Pull Request #1194 · auroralabs-loci/llama.cpp

loci-dev · 2026-02-21T02:17:05Z

Note

Source pull request: ggml-org/llama.cpp#19769

I'm not super experienced with the ggml/gguf internals so feedback is very welcome. Note on AI usage: Claude Opus 4.6 was used for navigating the codebase, debugging, and writing parts of the code. All changes have been reviewed and tested manually. Open to reworking anything that doesn't meet the project's standards.

This adds support for NVIDIA's NVFP4 quantization format (FP4 E2M1 weights, UE4M3 per-block scale, 16 elements per block). This is the format produced by NVIDIA ModelOpt's NVFP4 algo. The main difference is the scale encoding (UE4M3 vs E8M0).

What's in here:

New GGML_TYPE_NVFP4 type, block struct, UE4M3 conversion helpers, reference quantize/dequantize
convert_hf_to_gguf.py detects NVFP4 ModelOpt models and repacks into the GGUF block format
CPU backend: scalar dot product + ARM NEON + x86 AVX2/AVX512 optimized paths
CUDA backend: dequantize, mat-vec, mat-mat (MMQ with q8_1), on-device quantization
Metal backend: dequant functions, mat-vec, mat-mat, get_rows kernels
Vulkan backend: dequant shader, all mul_mat paths (scalar, coopmat2, vecq, mm, mmq), pipeline setup
gguf-py: type constant, quant/dequant, endian conversion
Tests added to test-backend-ops and test-quantize-fns

Tested with models from https://huggingface.co/NVFP4 on NVIDIA Blackwell GPU with 13.1 driver, x86 AVX512, Apple M5 MacBook (CPU, Metal and emulated Vulkan via MoltenVK). Ran llama-bench and a basic server smoke test. Would appreciate help with that if someone has a good baseline to compare against.

loci-review · 2026-02-21T04:41:08Z

Overview

Analysis of NVFP4 quantization support implementation across 111,670 functions (42 modified, 12 new, 0 removed, 111,616 unchanged) in 15 binaries shows minimal system-wide performance impact. Overall power consumption increased 0.11% (+1.7μJ).

Binary Power Consumption Changes:

build.bin.libllama.so: -0.046%
build.bin.libggml-base.so: +0.538%
build.bin.libggml-cpu.so: +0.87%
build.bin.llama-quantize: -0.001%
build.bin.llama-bench: -0.001%
build.bin.libggml.so, build.bin.llama-tokenize, build.bin.llama-cvector-generator, build.bin.libmtmd.so, build.bin.llama-tts, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli: 0.0%

Function Analysis

Quantization Kernels (Performance-Critical):

dequantize_row_q8_0 (libggml-base.so): Response time +24.74ns (+4.20%), throughput +24.74ns (+5.86%). Zero source code changes; regression attributed to binary layout effects from NVFP4 function insertion.
dequantize_row_mxfp4 (libggml-base.so): Response time +31.89ns (+4.45%), throughput +31.89ns (+5.38%). Unchanged source code; instruction cache misalignment from adjacent NVFP4 additions.

Model Loading Functions (Non-Critical):

llama_model_ftype_name (libllama.so): Response time +81.82ns (+3.45%), throughput +60.03ns (+3.75%). Added NVFP4 case to format name switch statement.
std::vector<gguf_kv>::emplace_back (libggml-base.so): Response time -1.77ns (-0.01%), throughput +78.15ns (+37.92%). Template recompilation from ggml_type enum expansion (40→41 types).
gguf_set_val_u8 (libggml-base.so): Response time +42.75ns (+0.18%), throughput +41.82ns (+33.91%). No source changes; compiler optimization differences.

STL Template Functions:

std::vector::end() (libllama.so): Response time -183.29ns (-69.24%), throughput -183.29ns (-75.41%). Compiler optimization improvements in tokenization path.
std::unique_ptr::operator= variants (libllama.so): Bidirectional changes (+97.86% and -49.46% throughput) reflect compiler non-determinism in template instantiation for graph construction.

Other analyzed functions showed minor changes (20-80ns) in initialization and utility paths, attributed to compiler artifacts rather than algorithmic modifications.

Additional Findings

Inference Hot Path Preserved: No changes to matrix multiplication (GEMM), attention mechanisms, or KV cache operations—the primary performance bottlenecks (70-90% of inference time). Quantization kernel regressions (25-32ns per call) translate to <0.25% impact on per-token generation time.

Multi-Backend Integration: NVFP4 support added across CUDA, Vulkan, and Metal backends with MoE compatibility, extending the 40+ quantization format ecosystem. Implementation leverages existing kvalues_mxfp4 lookup table infrastructure.

Root Causes: Performance changes stem from: (1) binary layout effects on instruction cache (quantization kernels), (2) STL template recompilation from enum expansion (model loading), and (3) GCC 13 ARM64 compiler optimizations (STL functions). No algorithmic inefficiencies or code quality issues identified.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

…er call

…up table

Remove NVFP4 support from GPU backends and architecture-specific optimized dot products. These should be added in separate PRs so backend specialists can review them independently. Reverted files: - ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh, quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh - ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h, ggml-metal-ops.cpp - ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/* - ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c Core NVFP4 support (type definition, CPU fallback dot product, quantization, dequantization, conversion) is retained.

After shelving backend-specific SIMD implementations, the generic CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390 platforms that previously relied on arch-specific versions.

Previously, values with ue4m3_exp <= 0 were clamped to 0, causing all small scales to underflow. This made NVFP4 quantization via llama-quantize produce garbage (PPL = 5.8M) since typical transformer weights have amax/6.0 in the range 0.001-0.01, which falls in the UE4M3 subnormal range. Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7), matching the decode path in ggml_ue4m3_to_fp32. Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33), comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).

Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products. tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup

- Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy ggml_ue4m3_to_fp32() in the hot loop - Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32 - Accumulate with vfmaq_f32 into float32x4_t vector accumulators tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)

Alternative approach: rearrange q8 data to match the NVFP4 lo/hi nibble layout instead of rearranging the looked-up NVFP4 values. Eliminates vcombine_s8(vget_low, vget_low) shuffles. Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x block overhead from QK=16 vs QK=32, not the shuffle instructions.

loci-dev temporarily deployed to PROD__AL_DEMO February 21, 2026 02:17 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 6 times, most recently from 2cecc98 to a92fe2a Compare February 26, 2026 02:16

richarddd added 22 commits February 26, 2026 15:45

WIP: add NVFP4 quantization support

f5a137d

tests

8b4e790

improve NVFP4 dot product implementation performance and fix bad sup…

e3e1330

…er call

typo

cfe0679

Use nvfp4 kvalues

984aaee

vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 look…

cd84cc3

…up table

vulcal and perf fixes

befad80

wip

cf1d533

Fix metal

7c730ba

fix vulcan

622a6e8

Rename threshold & fix wrong scale

3c6f4ca

Fix MOE

06e14c5

Fix arch-fallback.h: add NVFP4 generic fallback for all platforms

a8f8fba

After shelving backend-specific SIMD implementations, the generic CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390 platforms that previously relied on arch-specific versions.

quantize: add NVFP4 as a quantization type option

fe52c51

Restore ARM NEON NVFP4 dot product implementation

dc5a022

Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products. tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup

CPU only backend 64 super-block layout

36491e4

cleanup

fa01835

Remove unused LUT

68a6e2d

loci-dev force-pushed the main branch 9 times, most recently from 6fa8e23 to f2637dc Compare March 15, 2026 02:18

loci-dev force-pushed the main branch 12 times, most recently from e6c519b to 59f2b25 Compare March 23, 2026 02:17

loci-dev force-pushed the main branch 9 times, most recently from 89a1190 to 8fec234 Compare March 30, 2026 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19769: WIP: ggml : add NVFP4 quantization type support#1194

UPSTREAM PR #19769: WIP: ggml : add NVFP4 quantization type support#1194
loci-dev wants to merge 39 commits intomainfrom
loci/pr-19769-feat-nvfp4

loci-dev commented Feb 21, 2026

Uh oh!

loci-review bot commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Feb 21, 2026

Uh oh!

loci-review bot commented Feb 21, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants