ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations by ngdxzy · Pull Request #17977 · ggml-org/llama.cpp

ngdxzy · 2025-12-12T22:53:41Z

Description:

This PR implements true Q8_0 quantization for the Hexagon NPU backend, building on and integrating substantial previous work, to align its behavior with the CPU implementation in llama.cpp and improve the numerical accuracy of mixed-precision matmul operations.

Background:

In the current Hexagon NPU pipeline, quantization is performed on-the-fly during matrix multiplication, where FP32 activations are quantized and multiplied with already quantized weights. As a result, the quantization group size directly impacts the numerical behavior of these mixed-precision matmul operations, making alignment with the CPU Q8_0 scheme particularly important for correctness.

Previously, the Hexagon backend only supported quantize_block_fp32_q8x4, which uses a group size of 128. While functional, this does not match the standard Q8_0 definition used by the CPU backend in llama.cpp, leading to accuracy differences.

What's new:

Implemented true Q8_0 quantization kernels with smaller group sizes:

quantize_block_fp32_q8x1 with group size 32
quantize_block_fp32_q8x2 with group size 64

Retained the original quantize_block_fp32_q8x4 implementation (group size 128) for compatibility and performance comparisons.
Introduced a function–pointer–based dispatch mechanism to select the Q8 quantization kernel at runtime.

Enables dynamic switching between q8x1 / q8x2 / q8x4 without code duplication.
Facilitates future debugging, validation, and accuracy/performance trade-off studies.
Allows easier experimentation with different group sizes while keeping the call sites unchanged.

Aligned scale computation and quantization behavior with the CPU Q8_0 implementation in llama.cpp.

Why this matters:

Aligns Hexagon NPU Q8_0 quantization with the CPU implementation in llama.cpp
Improves quantization accuracy by using smaller group sizes
Reduces numerical discrepancies between CPU and NPU backends
Preserves the original q8x4 path for performance-oriented use cases
Validated on the K projection of layer 0 in the Qwen3-0.6B model, showing an over 35% reduction in L2 error with no observable performance regression.

Summary:

quantize_block_fp32_q8x1 → group size 32
quantize_block_fp32_q8x2 → group size 64
quantize_block_fp32_q8x4 → group size 128

This change aligns the Hexagon NPU backend with the true Q8_0 quantization scheme used on CPU, improving correctness while retaining flexibility for performance tuning and future experimentation.

max-krasnyansky · 2025-12-18T01:03:19Z

Had a chance to review and play with this now.

My original implementation of dyn.quant was heavily fp32-based and used block-size 32.
I spent quite a bit of time tuning it and eventually ended up with fp16-based version and block-size 128.
So my first reaction when seeing this that it's going to be quite a bit slower.
Surprisingly it's not too bad but it's still slower, it's kind of obvious from the code (uses a lot more instructions, more stores, etc) but somewhat tricky to profile because it's subject to L2/DDR perf variability.

Here is a quick test on Gen5 with llama-bench and OPMASK=0x3 (ie dyn.quant enabled and rest of matmul disabled).

OPMASK=0x3 ./scripts/.../run-tool.sh llama-bench --device HTP0 -m Llama-3.2-3B-Instruct-Q4_0.gguf -t 6 -p 128 -n 0 --cpu-mask 0xfc --cpu-strict 1 --poll 1000              

Before
| llama 3B Q4_0   |   1.78 GiB |   3.21 B | OpenCL,HTP |  99 | 6 | 0xfc  | 1 | 1000 | HTP0  | pp128 | 1740.01 ± 36.77  |

After

| llama 3B Q4_0   |   1.78 GiB |   3.21 B | OpenCL,HTP |  99 | 6 | 0xfc  | 1 | 1000 | HTP0  | pp128 | 1697.04 ± 38.54 |

Also if you were to compile with GGML_HEXAGON_HTP_DEBUG=ON

Before

CDSP0:[SU]: quantize-fp32-q8x4: 0/8 : n-rows 128 (0:16) row-size 32768 -> 8704 usec 172
CDSP0:[SU]: quantize-fp32-q8x4: 1/8 : n-rows 128 (16:32) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 2/8 : n-rows 128 (32:48) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 6/8 : n-rows 128 (96:112) row-size 32768 -> 8704 usec 180
CDSP0:[SU]: quantize-fp32-q8x4: 7/8 : n-rows 128 (112:128) row-size 32768 -> 8704 usec 180
CDSP0:[SU]: quantize-fp32-q8x4: 4/8 : n-rows 128 (64:80) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 3/8 : n-rows 128 (48:64) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 5/8 : n-rows 128 (80:96) row-size 32768 -> 8704 usec 180 

After

CDSP0:[SU]: quantize-fp32-q8x4: 7/8 : n-rows 128 (112:128) row-size 32768 -> 8704 usec 186
CDSP0:[SU]: quantize-fp32-q8x4: 2/8 : n-rows 128 (32:48) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 3/8 : n-rows 128 (48:64) row-size 32768 -> 8704 usec 185
CDSP0:[SU]: quantize-fp32-q8x4: 6/8 : n-rows 128 (96:112) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 1/8 : n-rows 128 (16:32) row-size 32768 -> 8704 usec 186
CDSP0:[SU]: quantize-fp32-q8x4: 0/8 : n-rows 128 (0:16) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 5/8 : n-rows 128 (80:96) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 4/8 : n-rows 128 (64:80) row-size 32768 -> 8704 usec 185

Often times the numbers are about the same but I do see consistently better perf from the quantizer in the master.
My thinking is that the small improvement in perplexity score (~0.0857) is not worth it. But it's good to have options.
Also whenever we get to enabling HMX-INT8/4 (we'll start with FP16 as I mentioned in other discussions) we might need to introduce other dyn-quantizer scheme that computes/uses intermediate INT scales.
With that mind, how about we introduce GGML_HEXAGON_DYNQUANT_TYPE cmake option?

The default would be Q8x4 (ie block-size 128) and folks can enable Q8x1 if their use-case benefit from improved precision.
In the future we can add additional types there which are more lossy but more performant.

With the cmake option (that adds a compile definition) we don't need to pass a function pointer.
Just add #define QUANTIZE_BLOCK_FP32 and set it based on the compile definition.

ngdxzy · 2025-12-18T16:17:38Z

Had a chance to review and play with this now.

My original implementation of dyn.quant was heavily fp32-based and used block-size 32. I spent quite a bit of time tuning it and eventually ended up with fp16-based version and block-size 128. So my first reaction when seeing this that it's going to be quite a bit slower. Surprisingly it's not too bad but it's still slower, it's kind of obvious from the code (uses a lot more instructions, more stores, etc) but somewhat tricky to profile because it's subject to L2/DDR perf variability.

Here is a quick test on Gen5 with llama-bench and OPMASK=0x3 (ie dyn.quant enabled and rest of matmul disabled).
OPMASK=0x3 ./scripts/.../run-tool.sh llama-bench --device HTP0 -m Llama-3.2-3B-Instruct-Q4_0.gguf -t 6 -p 128 -n 0 --cpu-mask 0xfc --cpu-strict 1 --poll 1000              

Before
| llama 3B Q4_0   |   1.78 GiB |   3.21 B | OpenCL,HTP |  99 | 6 | 0xfc  | 1 | 1000 | HTP0  | pp128 | 1740.01 ± 36.77  |

After

| llama 3B Q4_0   |   1.78 GiB |   3.21 B | OpenCL,HTP |  99 | 6 | 0xfc  | 1 | 1000 | HTP0  | pp128 | 1697.04 ± 38.54 |
Also if you were to compile with GGML_HEXAGON_HTP_DEBUG=ON
Before

CDSP0:[SU]: quantize-fp32-q8x4: 0/8 : n-rows 128 (0:16) row-size 32768 -> 8704 usec 172
CDSP0:[SU]: quantize-fp32-q8x4: 1/8 : n-rows 128 (16:32) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 2/8 : n-rows 128 (32:48) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 6/8 : n-rows 128 (96:112) row-size 32768 -> 8704 usec 180
CDSP0:[SU]: quantize-fp32-q8x4: 7/8 : n-rows 128 (112:128) row-size 32768 -> 8704 usec 180
CDSP0:[SU]: quantize-fp32-q8x4: 4/8 : n-rows 128 (64:80) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 3/8 : n-rows 128 (48:64) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 5/8 : n-rows 128 (80:96) row-size 32768 -> 8704 usec 180 

After

CDSP0:[SU]: quantize-fp32-q8x4: 7/8 : n-rows 128 (112:128) row-size 32768 -> 8704 usec 186
CDSP0:[SU]: quantize-fp32-q8x4: 2/8 : n-rows 128 (32:48) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 3/8 : n-rows 128 (48:64) row-size 32768 -> 8704 usec 185
CDSP0:[SU]: quantize-fp32-q8x4: 6/8 : n-rows 128 (96:112) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 1/8 : n-rows 128 (16:32) row-size 32768 -> 8704 usec 186
CDSP0:[SU]: quantize-fp32-q8x4: 0/8 : n-rows 128 (0:16) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 5/8 : n-rows 128 (80:96) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 4/8 : n-rows 128 (64:80) row-size 32768 -> 8704 usec 185
Often times the numbers are about the same but I do see consistently better perf from the quantizer in the master. My thinking is that the small improvement in perplexity score (~0.0857) is not worth it. But it's good to have options. Also whenever we get to enabling HMX-INT8/4 (we'll start with FP16 as I mentioned in other discussions) we might need to introduce other dyn-quantizer scheme that computes/uses intermediate INT scales. With that mind, how about we introduce GGML_HEXAGON_DYNQUANT_TYPE cmake option?

The default would be Q8x4 (ie block-size 128) and folks can enable Q8x1 if their use-case benefit from improved precision. In the future we can add additional types there which are more lossy but more performant.

With the cmake option (that adds a compile definition) we don't need to pass a function pointer. Just add #define QUANTIZE_BLOCK_FP32 and set it based on the compile definition.

That’s a good idea, and it makes sense to me. I can update the CMakeLists.txt to support this. Currently, group sizes of 32, 64, and 128 are implemented.

max-krasnyansky

Looks great. Thanks for the quick turnaround.

max-krasnyansky · 2025-12-18T19:41:41Z

Sorry. I just realized that the build fails if CMakePreset is not updated.

/workspace/ggml/src/ggml-hexagon/htp/matmul-ops.c:1798:2: error: "FP32_QUANTIZE_GROUP_SIZE must be 32, 64, or 128"

That's because the default is not kicking in properly. Here is one of the compiler commands with a clean build (ie no prev build dir).

/opt/hexagon/6.4.0.2/tools/HEXAGON_Tools/19.0.04/Tools/bin/hexagon-clang -DFP32_QUANTIZE_GROUP_SIZE=OFF  <<<<

ngdxzy · 2025-12-18T20:42:03Z

Sorry. I just realized that the build fails if CMakePreset is not updated.
/workspace/ggml/src/ggml-hexagon/htp/matmul-ops.c:1798:2: error: "FP32_QUANTIZE_GROUP_SIZE must be 32, 64, or 128"
That's because the default is not kicking in properly. Here is one of the compiler commands with a clean build (ie no prev build dir).
/opt/hexagon/6.4.0.2/tools/HEXAGON_Tools/19.0.04/Tools/bin/hexagon-clang -DFP32_QUANTIZE_GROUP_SIZE=OFF  <<<<

Sorry for that, I was using the UserPreset.json when building. It shall now be fixed.

…e accurate mixed-precision matmul operations (ggml-org#17977) * feat: implement real Q8_0 * feat: adding cmake option for configuring FP32 quantize group size * typo: set() shall be used --------- Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>

…e accurate mixed-precision matmul operations (#17977) * feat: implement real Q8_0 * feat: adding cmake option for configuring FP32 quantize group size * typo: set() shall be used --------- Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>

…e accurate mixed-precision matmul operations (ggml-org#17977) * feat: implement real Q8_0 * feat: adding cmake option for configuring FP32 quantize group size * typo: set() shall be used --------- Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>

feat: implement real Q8_0

9148eaa

ngdxzy requested review from lhez and max-krasnyansky as code owners December 12, 2025 22:53

ngdxzy changed the title ~~ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU (add q8x1 / q8x2 paths)~~ ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU (add q8x1 / q8x2 paths) for more accurate mixed-precision matmul operations Dec 12, 2025

ngdxzy changed the title ~~ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU (add q8x1 / q8x2 paths) for more accurate mixed-precision matmul operations~~ ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations Dec 12, 2025

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 13, 2025

chraac reviewed Dec 13, 2025

View reviewed changes

Comment thread ggml/src/ggml-hexagon/htp/matmul-ops.c Outdated

Merge branch 'master' of github.com:ngdxzy/llama.cpp into real_q8_0

44a309a

feat: adding cmake option for configuring FP32 quantize group size

172c3fc

max-krasnyansky approved these changes Dec 18, 2025

View reviewed changes

github-actions Bot added the documentation Improvements or additions to documentation label Dec 18, 2025

typo: set() shall be used

60d04d6

max-krasnyansky merged commit ce734a8 into ggml-org:master Dec 19, 2025
68 checks passed

wallentri88 mentioned this pull request Feb 24, 2026

Eval bug: qwen35 and qwen35moe graph split issues (Severe PP impact, crashes) #19864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations#17977

ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations#17977
max-krasnyansky merged 4 commits into
ggml-org:masterfrom
ngdxzy:real_q8_0

ngdxzy commented Dec 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

max-krasnyansky commented Dec 18, 2025

Uh oh!

ngdxzy commented Dec 18, 2025

Uh oh!

max-krasnyansky left a comment •

edited

Loading

Uh oh!

max-krasnyansky commented Dec 18, 2025

Uh oh!

ngdxzy commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ngdxzy commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Background:

What's new:

Why this matters:

Summary:

Uh oh!

Uh oh!

max-krasnyansky commented Dec 18, 2025

Uh oh!

ngdxzy commented Dec 18, 2025

Uh oh!

max-krasnyansky left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

max-krasnyansky commented Dec 18, 2025

Uh oh!

ngdxzy commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ngdxzy commented Dec 12, 2025 •

edited

Loading

max-krasnyansky left a comment •

edited

Loading