ggml : add NVFP4 quantization type support by richarddd · Pull Request #19769 · ggml-org/llama.cpp

richarddd · 2026-02-20T21:50:47Z

I'm not super experienced with the ggml/gguf internals so feedback is very welcome. Note on AI usage: Claude Opus 4.6 was used for navigating the codebase, debugging, and writing parts of the code. All changes have been reviewed and tested manually. Open to reworking anything that doesn't meet the project's standards.

This adds support for NVIDIA's NVFP4 quantization format (FP4 E2M1 weights, UE4M3 per-block scale, 16 elements per block). This is the format produced by NVIDIA ModelOpt's NVFP4 algo. The main difference is the scale encoding (UE4M3 vs E8M0).

What's in here:

New GGML_TYPE_NVFP4 type, block struct, UE4M3 conversion helpers, reference quantize/dequantize
convert_hf_to_gguf.py detects NVFP4 ModelOpt models and repacks into the GGUF block format
CPU backend: scalar dot product + ARM NEON
gguf-py: type constant, quant/dequant, endian conversion
Tests added to test-backend-ops and test-quantize-fns

Tested with models from https://huggingface.co/NVFP4 Apple M5 MacBook (CPU, NEON) Ran llama-bench and a basic server smoke test. Would appreciate help with that if someone has a good baseline to compare against.

Here is a Qwen3-4B model to test with.

JohannesGaessler · 2026-02-20T22:16:40Z

As is clearly laid out in the llama.cpp contributing guidelines:

When adding support for a new model or feature, focus on CPU support only in the initial PR unless you have a good reason not to. Add support for other backends like CUDA in follow-up PRs

pwilkin · 2026-02-20T23:14:57Z

I would really love NVFP4 support and I appreciate the work done here, but as @JohannesGaessler has already mentioned, the ratio of verified information to maintainer-needed work is way too high with this PR.

Please:

shelf all the backend implementations for now, they should be added in separate PRs so people specialized in specific backends can look at them
provide a GGUF of a converted model, preferrably one that can be ran comfortably by most mtaintainers (i.e. rather 8B or 12B than 400B).
make a KLD analysis for a full FP16 version as documented here
make perplexity and KLD checks for your quantized model as well as a comparable "standard" quant (Q4_1 would probably be a good choice here)
run benchmark tests for a known benchmark (you can use a tool such as Inspect AI, a good quick general benchmark to run is for example ARC Challenge

jeffbolznv · 2026-02-20T23:21:44Z

It would be great if nvfp4 could be stored in larger blocks that are at least a multiple of 4B (16B would be better).

JohannesGaessler · 2026-02-20T23:36:45Z

I agree that memory alignment is relevant, as long as the tensor dimensions are multiples of e.g. 256 it should be feasible to permute the data upon load though (except for maybe CPU+GPU hybrid inference where the overhead could be relevant).

ggerganov · 2026-02-23T08:25:34Z

make a KLD analysis for a full FP16 version as documented here

make perplexity and KLD checks for your quantized model as well as a comparable "standard" quant (Q4_1 would probably be a good choice here)

run benchmark tests for a known benchmark (you can use a tool such as Inspect AI, a good quick general benchmark to run is for example ARC Challenge

Btw, @pwilkin these are not really necessary for NVFP4 - adding support for this data type would not depend on the outcome of these. They are good for sanity checks, but other than that do not matter much. The main use case of NVFP4 is to load models that are already trained in that format - not to quantize models with it.

Regarding the alignment - I guess we can make blocks of 256 which would result in alignment of 16 bytes. Though we risk not being able to load tensors with dimension that is not multiple of 256. There was the same dilemma for MXFP4 and gpt-oss unfortunately has shapes that are only divisible by 64 but not 256.

am17an · 2026-02-23T11:23:10Z

NVFP4 also has a separate per tensor float scale which this PR doesn't take into account, unless I'm wrong. Also this whole PR is pretty much AI generated from what I can see. I had plans to add nvfp4 support after mxfp4 but another developer had promised to do it but since has not delivered so I will also create a PR for nvfp4 support in the meantime.

pwilkin · 2026-02-23T11:36:00Z

@ggerganov I know but I meant it exactly as a sanity check.

pwilkin · 2026-02-23T11:36:40Z

NVFP4 also has a separate per tensor float scale which this PR doesn't take into account, unless I'm wrong. Also this whole PR is pretty much AI generated from what I can see. I had plans to add nvfp4 support after mxfp4 but another developer had promised to do it but since has not delivered so I will also create a PR for nvfp4 support in the meantime.

Yeah I'm pretty frustrated as I was also thinking about working on it and was hoping this PR goes somewhere but seems it's going nowhere so far :/

richarddd · 2026-02-23T12:33:10Z

NVFP4 also has a separate per tensor float scale which this PR doesn't take into account, unless I'm wrong. Also this whole PR is pretty much AI generated from what I can see. I had plans to add nvfp4 support after mxfp4 but another developer had promised to do it but since has not delivered so I will also create a PR for nvfp4 support in the meantime.

It's taken into account. And regarding AI, as mentioned in the PR, I leaned on AI and following principles patterns applied in the MXFP4 PR. I'll remove the half-baked backend implementation and stick with NEON + generic CPU implementation for now. Again, this is a WIP which proves the concept and implements a lot of the boilerplate. I'll also increased blocksize to 64.

am17an · 2026-02-23T13:11:42Z

It's taken into account.

It is not. Please see the f32 scale as presented here https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/

As a reminder: you are supposed to know the content of the PR even if the PR is written with AI help. See the contributing guidelines.

richarddd · 2026-02-23T13:18:08Z

I would really love NVFP4 support and I appreciate the work done here, but as @JohannesGaessler has already mentioned, the ratio of verified information to maintainer-needed work is way too high with this PR.

Please:

shelf all the backend implementations for now, they should be added in separate PRs so people specialized in specific backends can look at them

provide a GGUF of a converted model, preferrably one that can be ran comfortably by most mtaintainers (i.e. rather 8B or 12B than 400B).

make a KLD analysis for a full FP16 version as documented here

make perplexity and KLD checks for your quantized model as well as a comparable "standard" quant (Q4_1 would probably be a good choice here)

run benchmark tests for a known benchmark (you can use a tool such as Inspect AI, a good quick general benchmark to run is for example ARC Challenge

Addressed these comments.

Here are results for Qwen3-4B

	NVFP4 (5.0 BPW)	Q4_1 (5.15 BPW)
PPL	15.25 (+8.0%)	15.81 (+12.0%)
Mean KLD	0.110	0.112
tg128 t/s	15.2	14.7
ARC Challenge (Inspect AI)	80%

am17an · 2026-02-23T14:10:04Z

Okay, not sure if that works but if it does then it's great since it simplifies the implementation quite a bit. The current state of your PR is not ok though, I see random changes in the CUDA and Vulkan code. Can you fix it?

richarddd · 2026-02-23T14:36:33Z

I see random changes in the CUDA and Vulkan code. Can you fix it?

Thanks, I noticed that as well. The problem was a one-time thing from the shelf commit targeting an older master. PR should be clean now

CISC · 2026-03-11T20:03:30Z

I guess we could hold on merging this until we prototype this and make sure there aren't any surprises?

Ouch. :)

ggerganov · 2026-03-11T20:04:38Z

No worries, we don't have other alternatives either way, so if the repack does not work out we'll have to live with the 4 byte alignment.

CISC · 2026-03-11T20:09:19Z

No worries, we don't have other alternatives either way, so if the repack does not work out we'll have to live with the 4 byte alignment.

Well, come to think of it, can we not have two NVFP4 quants? One with 16-byte alignment and this one to fall back on if that won't fit?

ggerganov · 2026-03-11T20:15:28Z

Sounds like too much redundancy and extra complexity for not much benefit.

CISC · 2026-03-11T20:19:30Z

Sounds like too much redundancy and extra complexity for not much benefit.

True, let's hope repacking pans out.

am17an · 2026-03-12T02:33:53Z

4 byte alignment is already quite good. Each CUDA thread reading 4 bytes in a warp leads to a 128 byte transaction which is ideal.

* WIP: add NVFP4 quantization support * tests * improve NVFP4 dot product implementation performance and fix bad super call * typo * Use nvfp4 kvalues * vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table * vulcal and perf fixes * wip * Fix metal * fix vulcan * Rename threshold & fix wrong scale * Fix MOE * Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD) Remove NVFP4 support from GPU backends and architecture-specific optimized dot products. These should be added in separate PRs so backend specialists can review them independently. Reverted files: - ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh, quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh - ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h, ggml-metal-ops.cpp - ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/* - ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c Core NVFP4 support (type definition, CPU fallback dot product, quantization, dequantization, conversion) is retained. * Fix arch-fallback.h: add NVFP4 generic fallback for all platforms After shelving backend-specific SIMD implementations, the generic CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390 platforms that previously relied on arch-specific versions. * quantize: add NVFP4 as a quantization type option * Fix ggml_fp32_to_ue4m3: handle subnormal values Previously, values with ue4m3_exp <= 0 were clamped to 0, causing all small scales to underflow. This made NVFP4 quantization via llama-quantize produce garbage (PPL = 5.8M) since typical transformer weights have amax/6.0 in the range 0.001-0.01, which falls in the UE4M3 subnormal range. Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7), matching the decode path in ggml_ue4m3_to_fp32. Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33), comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15). * Restore ARM NEON NVFP4 dot product implementation Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products. tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup * Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq - Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy ggml_ue4m3_to_fp32() in the hot loop - Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32 - Accumulate with vfmaq_f32 into float32x4_t vector accumulators tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed) * ARM NEON NVFP4: rearrange q8 to match nibble layout Alternative approach: rearrange q8 data to match the NVFP4 lo/hi nibble layout instead of rearranging the looked-up NVFP4 values. Eliminates vcombine_s8(vget_low, vget_low) shuffles. Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x block overhead from QK=16 vs QK=32, not the shuffle instructions. * CPU only backend 64 super-block layout * cleanup * Remove unused LUT * int * exclude NVFP4 from unsupported ops in metal build * remove quantization for now * store scales as native UE4M3, preserve original model bits when possible * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * correct comment * format * reduce duplication and cleanup * Address comments * move detection to prepare_tensors * Use math instead of const * Move * fix comment * Shelf quantize tests * Rebase and move check * cleanup * lint * Update gguf-py/gguf/scripts/gguf_convert_endian.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Use fallback quant config * Simplify Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * organize * Refactor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * fix return type --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* 'master' of github.com:ggml-org/llama.cpp: (33 commits) convert : better mtp check and fix return [no ci] (ggml-org#20419) vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379) New conversations now auto-select the first loaded model (ggml-org#20403) ggml-virtgpu: Fix some build commands (ggml-org#20341) metal : avoid divisions in bin kernel (ggml-org#20426) ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154) vulkan: fix l2_norm epsilon handling (ggml-org#20350) vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296) vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059) opencl: use larger workgroup size for get_rows (ggml-org#20316) opencl: add cumsum op (ggml-org#18981) hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392) common/parser: add GigaChatV3/3.1 models support (ggml-org#19931) model : add support for Phi4ForCausalLMV (ggml-org#20168) graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427) common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416) ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230) llama : enable chunked fused GDN path (ggml-org#20340) llama : whitespace cleanup (ggml-org#20422) ggml : add NVFP4 quantization type support (ggml-org#19769) ...

JohannesGaessler · 2026-03-12T18:53:36Z

4 byte alignment is already quite good. Each CUDA thread reading 4 bytes in a warp leads to a 128 byte transaction which is ideal.

For synchronous data copies I agree, for asynchronous copies chunks of 16 bytes work better in my excperience.

michaelw9999 · 2026-03-12T19:06:36Z

I've got the current version working with CUDA converting to pack SoA (without 4/6 or any fancy stuff) but it's not as fast as it should be (about 13,000 tk/s on Qwen4-B). Should I post it anywhere or do we have a thread to discuss follow up NVFP4 tasks? Having issues converting models and have fixes for the py script. Hope I can contribute something. Thanks

richarddd · 2026-03-12T19:57:12Z

I've got the current version working with CUDA converting to pack SoA (without 4/6 or any fancy stuff) but it's not as fast as it should be (about 13,000 tk/s on Qwen4-B). Should I post it anywhere or do we have a thread to discuss follow up NVFP4 tasks? Having issues converting models and have fixes for the py script. Hope I can contribute something. Thanks

@michaelw9999 I think individual PRs. Small isolated onces. If improvements are incremental, they should rather be separate PR's IMO. For example, one with basic CUDA support, one for 4/6 and maybe some fancy stuff etc

JohannesGaessler · 2026-03-12T20:24:41Z

The CUDA code should have the following pieces for basic support: NVFP4 dequantization + cuBLAS, MMVQ support, MMQ support via dp4a, MMQ support via tensor cores. For new contributors please only as individual and self-contained PRs, for more experienced contributors I think it's fine to do multiple things at once. Fancy stuff should come after that with evidence that it is an improvement.

xkmire · 2026-03-12T20:26:53Z

Thanks very much for the NVFP4 work!!

I found two very interesting NVFP4 models on huggingface:

txn545/Qwen3.5-122B-A10B-NVFP4 quantized using the NVIDIA Model Optimizer.
AxionML/Qwen3.5-122B-A10B-NVFP4 quantized using NVIDIA TensorRT Model Optimizer.

I tried to convert them to gguf, but both failed.

ValueError: Can not map tensor 'model.language_model.layers.0.mlp.shared_expert.down_proj.weight'
ValueError: Can not map tensor 'model.language_model.layers.0.linear_attn.in_proj_a.weight'

I was just wondering, if this are the kind of models that is intended to work with the NVFP4 support I have seen going into llama.cpp the last days.

If yes, I tink I might have a go at trying to figure out why they fail. Not sure I will be able to find out how to fix, but eager to get my new expensive GPU to run at its best...

michaelw9999 · 2026-03-12T20:48:18Z

Hi @xkmire I put a proposed fix in. See #20505. It was working well with your Qwen3.5-122B-A10B-NVFP4 model :-))

…

Message ID: ***@***.***>

vbooka1 · 2026-03-13T11:23:27Z

Hello, I am getting error "Quant method is not yet supported: 'modelopt'" when trying to convert NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ( https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/ ) to .gguf

error log: #20411 (comment)

CISC · 2026-03-13T11:30:41Z

Hello, I am getting error "Quant method is not yet supported: 'modelopt'" when trying to convert NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ( https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/ ) to .gguf

error log: #20411 (comment)

Seems they have per-tensor quant_algo, which we don't check for, so repacking never kicks in.

ORippler · 2026-03-13T13:33:53Z

The CUDA code should have the following pieces for basic support: NVFP4 dequantization + cuBLAS, MMVQ support, MMQ support via dp4a, MMQ support via tensor cores.

Most likely stating the obvious: For MMVQ and MMQ dp4a path, it makes sense to do computations in BF16/FP16, as throughput is equal for FP and ALU in CUDA cores and we can save the I2F conversion via fp4 intrinsics (on the hardware that supports those of course).

Just wanted to point this out as the CPU path in this PR does ALU followed by I2F.

ORippler · 2026-03-13T13:39:31Z

4 byte alignment is already quite good. Each CUDA thread reading 4 bytes in a warp leads to a 128 byte transaction which is ideal.

For synchronous data copies I agree, for asynchronous copies chunks of 16 bytes work better in my excperience.

4 byte is the minimum we need to be able to issue LDGSTS via cg::memcpy_async and reduce register pressure by bypassing registers for the store op, and wider should always be better (as it doesn't depend on the MMU to pack LDGs issued across threads into the same read call and has higher IPC)

JohannesGaessler · 2026-03-14T10:02:56Z

Regarding MMVQ: currently the activations are unconditionally converted to q8_1, if we intend to use floating-point math we will need to extend this. More generally, if we add a path using floating-point math it may make sense to use it for small matrices to remove the overhead from quantizing the activations. This table doesn't seem to list the throughput of __dp4a but I would assume for a matrix vector multiplication it won't make much of a difference either way though. We should maybe also try to define what we want to put in mmvq.cu vs. mmvf.cu. The way that would make sense to me is to use MMVQ for block-wise src0 data types and to use MMVF for scalar data types. (MMVF is strictly not needed since cuBLAS could be used, we basically only have it for performance reasons, particularly for MoE models.)

ORippler · 2026-03-18T13:19:10Z

Regarding MMVQ: currently the activations are unconditionally converted to q8_1, if we intend to use floating-point math we will need to extend this. More generally, if we add a path using floating-point math it may make sense to use it for small matrices to remove the overhead from quantizing the activations. This table doesn't seem to list the throughput of __dp4a but I would assume for a matrix vector multiplication it won't make much of a difference either way though. We should maybe also try to define what we want to put in mmvq.cu vs. mmvf.cu. The way that would make sense to me is to use MMVQ for block-wise src0 data types and to use MMVF for scalar data types. (MMVF is strictly not needed since cuBLAS could be used, we basically only have it for performance reasons, particularly for MoE models.)

I took the time to setup a script to bench dp4a vs f16 paths for mxfp4 inputs (as the gist benches mxfp4/mxfp4 instead of mxfp4/q8_0 the raw gain of f16 vs. dp4a will be less, though we can offset it by foregoing activation-quantization as you mentioned, which would be the much bigger perf gain):

https://gist.github.com/ORippler/1ac0757dc9bc462e4bf5c19a71b67c67

Cross-posting/quoting entries from there:

Some numbers on BW system (SM120)

(.venv) ➜  scratchpad ./run_nvfp4_compare.sh --elements=16777216 --iters=50 --warmup=5 --repeats=5 
GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (SM 12.0)
elements=16777216 iters=50 warmup=5 repeats=5 blocks=752 threads=256 requested_scale_a=0.125 requested_scale_b=0.0625 e8m0_scale_a=0.125 e8m0_scale_b=0.0625

Path                                              avg ms          GMac/s        checksum
FP4(E2M1) -> INT8 + DP4A + post-scale              0.150        5591.570       31150.195
FP4(E2M1) -> f16 + f16-domain multiply             0.108        7739.711       31150.195

DP4A path speed vs f16 path: 0.722x
L1 relative output delta (f16 ref): 0.000
(.venv) ➜  scratchpad ./run_nvfp4_compare.sh --elements=167772160 --iters=50 --warmup=5 --repeats=5
GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (SM 12.0)
elements=167772160 iters=50 warmup=5 repeats=5 blocks=752 threads=256 requested_scale_a=0.125 requested_scale_b=0.0625 e8m0_scale_a=0.125 e8m0_scale_b=0.0625

Path                                              avg ms          GMac/s        checksum
FP4(E2M1) -> INT8 + DP4A + post-scale              3.301        2540.992       18451.855
FP4(E2M1) -> f16 + f16-domain multiply             3.217        2607.946       18451.855

DP4A path speed vs f16 path: 0.974x
L1 relative output delta (f16 ref): 0.000

Why is F16 path preferred even if throughput of DP4a is higher theoretically? Two-reasons:

We use ALU and FMA pipes instead of almost exclusively the ALU pipe (safe for the multiplication with chunk scale in DP4a path).
We have specific instructions available in our ISA to convert from fp4x2 to half2, yet none exist for conversion from packed fp4 to int32, meaning we have to spend more ALU instructions on this (basically the most costly part is repacking the loaded data in the registers)

Scaling of F16 vs. DP4a depends on the workload size. In the first setting, we stay within cache, whereas in the second setting we are at 100% memory bandwidth (and thus waiting for data most of time).

Added in ggml-org/llama.cpp#19769  --- > [!NOTE] > **Low Risk** > Low risk: adds a new GGUF quantization enum entry plus description/size metadata, with minimal impact beyond recognizing an additional quant label and computing its bpw. > > **Overview** > Adds support for the new `NVFP4` GGUF quantization type (from llama.cpp PR #19769). > > Updates quantization metadata to recognize `NVFP4` in filename parsing/ordering (`packages/tasks/src/gguf.ts`) and documents its description plus bits-per-weight calculation (`packages/gguf/src/quant-descriptions.ts`). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 0fc05ce. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>

michaelw9999 · 2026-03-18T17:48:38Z

Super helpful, thank you! Your numbers match almost perfectly the real numbers I am seeing with NVFP4. I've been working out the best path for updating the dp4a-only model. On F16 some models overflow (another issue) but forcing F32, which fixes that, slows max prefill by 0.72x; avg 9000tk/s to 6500tk/s.. == 0.72.

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32606 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3 4B BF16                  |   2.63 GiB |     4.02 B | CUDA       |  99 |           pp512 |     8914.99 ± 950.13 |
| qwen3 4B BF16                  |   2.63 GiB |     4.02 B | CUDA       |  99 |           tg128 |        280.78 ± 0.60 |

build: 6e5081b5a (8381)
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32606 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3 4B BF16                  |   2.63 GiB |     4.02 B | CUDA       |  99 |           pp512 |     6646.92 ± 567.05 |
| qwen3 4B BF16                  |   2.63 GiB |     4.02 B | CUDA       |  99 |           tg128 |        280.52 ± 0.88 |

I've been trying to find a way to use F32 without such a huge loss but I guess that is how it will be on SM120.
Side thought: need to fix convert_hf_... to stop naming models BF16 on the NVFP4 conversions.

* WIP: add NVFP4 quantization support * tests * improve NVFP4 dot product implementation performance and fix bad super call * typo * Use nvfp4 kvalues * vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table * vulcal and perf fixes * wip * Fix metal * fix vulcan * Rename threshold & fix wrong scale * Fix MOE * Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD) Remove NVFP4 support from GPU backends and architecture-specific optimized dot products. These should be added in separate PRs so backend specialists can review them independently. Reverted files: - ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh, quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh - ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h, ggml-metal-ops.cpp - ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/* - ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c Core NVFP4 support (type definition, CPU fallback dot product, quantization, dequantization, conversion) is retained. * Fix arch-fallback.h: add NVFP4 generic fallback for all platforms After shelving backend-specific SIMD implementations, the generic CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390 platforms that previously relied on arch-specific versions. * quantize: add NVFP4 as a quantization type option * Fix ggml_fp32_to_ue4m3: handle subnormal values Previously, values with ue4m3_exp <= 0 were clamped to 0, causing all small scales to underflow. This made NVFP4 quantization via llama-quantize produce garbage (PPL = 5.8M) since typical transformer weights have amax/6.0 in the range 0.001-0.01, which falls in the UE4M3 subnormal range. Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7), matching the decode path in ggml_ue4m3_to_fp32. Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33), comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15). * Restore ARM NEON NVFP4 dot product implementation Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products. tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup * Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq - Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy ggml_ue4m3_to_fp32() in the hot loop - Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32 - Accumulate with vfmaq_f32 into float32x4_t vector accumulators tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed) * ARM NEON NVFP4: rearrange q8 to match nibble layout Alternative approach: rearrange q8 data to match the NVFP4 lo/hi nibble layout instead of rearranging the looked-up NVFP4 values. Eliminates vcombine_s8(vget_low, vget_low) shuffles. Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x block overhead from QK=16 vs QK=32, not the shuffle instructions. * CPU only backend 64 super-block layout * cleanup * Remove unused LUT * int * exclude NVFP4 from unsupported ops in metal build * remove quantization for now * store scales as native UE4M3, preserve original model bits when possible * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * correct comment * format * reduce duplication and cleanup * Address comments * move detection to prepare_tensors * Use math instead of const * Move * fix comment * Shelf quantize tests * Rebase and move check * cleanup * lint * Update gguf-py/gguf/scripts/gguf_convert_endian.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Use fallback quant config * Simplify Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * organize * Refactor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * fix return type --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

JohannesGaessler · 2026-03-24T22:04:43Z

Tests added to test-backend-ops and test-quantize-fns

There aren't actually any NVFP4 test cases in test-backend-ops. Was this overlooked?

CISC · 2026-03-24T22:16:17Z

Tests added to test-backend-ops and test-quantize-fns

There aren't actually any NVFP4 test cases in test-backend-ops. Was this overlooked?

I think they were removed again, TBD at first backend support.

michaelw9999 · 2026-03-24T22:20:15Z

Tests added to test-backend-ops and test-quantize-fns

There aren't actually any NVFP4 test cases in test-backend-ops. Was this overlooked?

@JohannesGaessler @CISC I can add them in a new PR if you like. Have various different tests. I did not include any in the CUDA PR.

JohannesGaessler · 2026-03-24T23:15:31Z

No tests beyond test-backend-ops should be needed.

richarddd requested review from 0cc4m, CISC, JohannesGaessler and ggerganov as code owners February 20, 2026 21:50

richarddd force-pushed the feat/nvfp4 branch from 9cd0f58 to 86dd3fc Compare February 20, 2026 21:51

loci-dev mentioned this pull request Feb 21, 2026

UPSTREAM PR #19769: WIP: ggml : add NVFP4 quantization type support auroralabs-loci/llama.cpp#1194

Open

richarddd marked this pull request as draft February 23, 2026 12:27

This comment was marked as outdated.

Sign in to view

richarddd force-pushed the feat/nvfp4 branch from 5f8f21b to ffab58b Compare February 23, 2026 14:28

github-actions bot added the examples label Feb 23, 2026

richarddd marked this pull request as ready for review February 23, 2026 17:06

richarddd deleted the feat/nvfp4 branch March 11, 2026 20:35

richarddd mentioned this pull request Mar 11, 2026

graph : add optional scale parameter to build_lora_mm #20427

Merged

CISC mentioned this pull request Mar 18, 2026

Add NVFP4 GGUF QuantizationType huggingface/huggingface.js#2046

Merged

Conversation

richarddd commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Feb 20, 2026

Uh oh!

pwilkin commented Feb 20, 2026

Uh oh!

jeffbolznv commented Feb 20, 2026

Uh oh!

JohannesGaessler commented Feb 20, 2026

Uh oh!

ggerganov commented Feb 23, 2026

Uh oh!

am17an commented Feb 23, 2026

Uh oh!

pwilkin commented Feb 23, 2026

Uh oh!

pwilkin commented Feb 23, 2026

Uh oh!

richarddd commented Feb 23, 2026

Uh oh!

am17an commented Feb 23, 2026

Uh oh!

richarddd commented Feb 23, 2026

Uh oh!

This comment was marked as outdated.

am17an commented Feb 23, 2026

Uh oh!

richarddd commented Feb 23, 2026

Uh oh!

CISC commented Mar 11, 2026

Uh oh!

ggerganov commented Mar 11, 2026

Uh oh!

CISC commented Mar 11, 2026

Uh oh!

ggerganov commented Mar 11, 2026

Uh oh!

CISC commented Mar 11, 2026

Uh oh!

am17an commented Mar 12, 2026

Uh oh!

JohannesGaessler commented Mar 12, 2026

Uh oh!

michaelw9999 commented Mar 12, 2026

Uh oh!

richarddd commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Mar 12, 2026

Uh oh!

xkmire commented Mar 12, 2026

Uh oh!

michaelw9999 commented Mar 12, 2026 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vbooka1 commented Mar 13, 2026

Uh oh!

CISC commented Mar 13, 2026

Uh oh!

ORippler commented Mar 13, 2026

Uh oh!

ORippler commented Mar 13, 2026

Uh oh!

JohannesGaessler commented Mar 14, 2026

Uh oh!

ORippler commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelw9999 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Mar 24, 2026

Uh oh!

CISC commented Mar 24, 2026

Uh oh!

michaelw9999 commented Mar 24, 2026

Uh oh!

JohannesGaessler commented Mar 24, 2026

richarddd commented Feb 20, 2026 •

edited

Loading

richarddd commented Mar 12, 2026 •

edited

Loading

michaelw9999 commented Mar 12, 2026 via email •

edited

Loading

ORippler commented Mar 18, 2026 •

edited

Loading

michaelw9999 commented Mar 18, 2026 •

edited

Loading