Skip to content

ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations#17977

Merged
max-krasnyansky merged 4 commits intoggml-org:masterfrom
ngdxzy:real_q8_0
Dec 19, 2025
Merged

ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations#17977
max-krasnyansky merged 4 commits intoggml-org:masterfrom
ngdxzy:real_q8_0

Conversation

@ngdxzy
Copy link
Contributor

@ngdxzy ngdxzy commented Dec 12, 2025

Description:

This PR implements true Q8_0 quantization for the Hexagon NPU backend, building on and integrating substantial previous work, to align its behavior with the CPU implementation in llama.cpp and improve the numerical accuracy of mixed-precision matmul operations.

Background:

In the current Hexagon NPU pipeline, quantization is performed on-the-fly during matrix multiplication, where FP32 activations are quantized and multiplied with already quantized weights. As a result, the quantization group size directly impacts the numerical behavior of these mixed-precision matmul operations, making alignment with the CPU Q8_0 scheme particularly important for correctness.

Previously, the Hexagon backend only supported quantize_block_fp32_q8x4, which uses a group size of 128. While functional, this does not match the standard Q8_0 definition used by the CPU backend in llama.cpp, leading to accuracy differences.

What's new:

  1. Implemented true Q8_0 quantization kernels with smaller group sizes:
  • quantize_block_fp32_q8x1 with group size 32
  • quantize_block_fp32_q8x2 with group size 64
  1. Retained the original quantize_block_fp32_q8x4 implementation (group size 128) for compatibility and performance comparisons.
  2. Introduced a function–pointer–based dispatch mechanism to select the Q8 quantization kernel at runtime.
  • Enables dynamic switching between q8x1 / q8x2 / q8x4 without code duplication.
  • Facilitates future debugging, validation, and accuracy/performance trade-off studies.
  • Allows easier experimentation with different group sizes while keeping the call sites unchanged.
  1. Aligned scale computation and quantization behavior with the CPU Q8_0 implementation in llama.cpp.

Why this matters:

  • Aligns Hexagon NPU Q8_0 quantization with the CPU implementation in llama.cpp
  • Improves quantization accuracy by using smaller group sizes
  • Reduces numerical discrepancies between CPU and NPU backends
  • Preserves the original q8x4 path for performance-oriented use cases
  • Validated on the K projection of layer 0 in the Qwen3-0.6B model, showing an over 35% reduction in L2 error with no observable performance regression.

Summary:

  • quantize_block_fp32_q8x1 → group size 32
  • quantize_block_fp32_q8x2 → group size 64
  • quantize_block_fp32_q8x4 → group size 128

This change aligns the Hexagon NPU backend with the true Q8_0 quantization scheme used on CPU, improving correctness while retaining flexibility for performance tuning and future experimentation.

@ngdxzy ngdxzy changed the title ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU (add q8x1 / q8x2 paths) ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU (add q8x1 / q8x2 paths) for more accurate mixed-precision matmul operations Dec 12, 2025
@ngdxzy ngdxzy changed the title ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU (add q8x1 / q8x2 paths) for more accurate mixed-precision matmul operations ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations Dec 12, 2025
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 13, 2025
@max-krasnyansky
Copy link
Member

Had a chance to review and play with this now.

My original implementation of dyn.quant was heavily fp32-based and used block-size 32.
I spent quite a bit of time tuning it and eventually ended up with fp16-based version and block-size 128.
So my first reaction when seeing this that it's going to be quite a bit slower.
Surprisingly it's not too bad but it's still slower, it's kind of obvious from the code (uses a lot more instructions, more stores, etc) but somewhat tricky to profile because it's subject to L2/DDR perf variability.

Here is a quick test on Gen5 with llama-bench and OPMASK=0x3 (ie dyn.quant enabled and rest of matmul disabled).

OPMASK=0x3 ./scripts/.../run-tool.sh llama-bench --device HTP0 -m Llama-3.2-3B-Instruct-Q4_0.gguf -t 6 -p 128 -n 0 --cpu-mask 0xfc --cpu-strict 1 --poll 1000              

Before
| llama 3B Q4_0   |   1.78 GiB |   3.21 B | OpenCL,HTP |  99 | 6 | 0xfc  | 1 | 1000 | HTP0  | pp128 | 1740.01 ± 36.77  |

After

| llama 3B Q4_0   |   1.78 GiB |   3.21 B | OpenCL,HTP |  99 | 6 | 0xfc  | 1 | 1000 | HTP0  | pp128 | 1697.04 ± 38.54 |

Also if you were to compile with GGML_HEXAGON_HTP_DEBUG=ON

Before

CDSP0:[SU]: quantize-fp32-q8x4: 0/8 : n-rows 128 (0:16) row-size 32768 -> 8704 usec 172
CDSP0:[SU]: quantize-fp32-q8x4: 1/8 : n-rows 128 (16:32) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 2/8 : n-rows 128 (32:48) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 6/8 : n-rows 128 (96:112) row-size 32768 -> 8704 usec 180
CDSP0:[SU]: quantize-fp32-q8x4: 7/8 : n-rows 128 (112:128) row-size 32768 -> 8704 usec 180
CDSP0:[SU]: quantize-fp32-q8x4: 4/8 : n-rows 128 (64:80) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 3/8 : n-rows 128 (48:64) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 5/8 : n-rows 128 (80:96) row-size 32768 -> 8704 usec 180 

After

CDSP0:[SU]: quantize-fp32-q8x4: 7/8 : n-rows 128 (112:128) row-size 32768 -> 8704 usec 186
CDSP0:[SU]: quantize-fp32-q8x4: 2/8 : n-rows 128 (32:48) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 3/8 : n-rows 128 (48:64) row-size 32768 -> 8704 usec 185
CDSP0:[SU]: quantize-fp32-q8x4: 6/8 : n-rows 128 (96:112) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 1/8 : n-rows 128 (16:32) row-size 32768 -> 8704 usec 186
CDSP0:[SU]: quantize-fp32-q8x4: 0/8 : n-rows 128 (0:16) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 5/8 : n-rows 128 (80:96) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 4/8 : n-rows 128 (64:80) row-size 32768 -> 8704 usec 185

Often times the numbers are about the same but I do see consistently better perf from the quantizer in the master.
My thinking is that the small improvement in perplexity score (~0.0857) is not worth it. But it's good to have options.
Also whenever we get to enabling HMX-INT8/4 (we'll start with FP16 as I mentioned in other discussions) we might need to introduce other dyn-quantizer scheme that computes/uses intermediate INT scales.
With that mind, how about we introduce GGML_HEXAGON_DYNQUANT_TYPE cmake option?

The default would be Q8x4 (ie block-size 128) and folks can enable Q8x1 if their use-case benefit from improved precision.
In the future we can add additional types there which are more lossy but more performant.

With the cmake option (that adds a compile definition) we don't need to pass a function pointer.
Just add #define QUANTIZE_BLOCK_FP32 and set it based on the compile definition.

@ngdxzy
Copy link
Contributor Author

ngdxzy commented Dec 18, 2025

Had a chance to review and play with this now.

My original implementation of dyn.quant was heavily fp32-based and used block-size 32. I spent quite a bit of time tuning it and eventually ended up with fp16-based version and block-size 128. So my first reaction when seeing this that it's going to be quite a bit slower. Surprisingly it's not too bad but it's still slower, it's kind of obvious from the code (uses a lot more instructions, more stores, etc) but somewhat tricky to profile because it's subject to L2/DDR perf variability.

Here is a quick test on Gen5 with llama-bench and OPMASK=0x3 (ie dyn.quant enabled and rest of matmul disabled).

OPMASK=0x3 ./scripts/.../run-tool.sh llama-bench --device HTP0 -m Llama-3.2-3B-Instruct-Q4_0.gguf -t 6 -p 128 -n 0 --cpu-mask 0xfc --cpu-strict 1 --poll 1000              

Before
| llama 3B Q4_0   |   1.78 GiB |   3.21 B | OpenCL,HTP |  99 | 6 | 0xfc  | 1 | 1000 | HTP0  | pp128 | 1740.01 ± 36.77  |

After

| llama 3B Q4_0   |   1.78 GiB |   3.21 B | OpenCL,HTP |  99 | 6 | 0xfc  | 1 | 1000 | HTP0  | pp128 | 1697.04 ± 38.54 |

Also if you were to compile with GGML_HEXAGON_HTP_DEBUG=ON

Before

CDSP0:[SU]: quantize-fp32-q8x4: 0/8 : n-rows 128 (0:16) row-size 32768 -> 8704 usec 172
CDSP0:[SU]: quantize-fp32-q8x4: 1/8 : n-rows 128 (16:32) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 2/8 : n-rows 128 (32:48) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 6/8 : n-rows 128 (96:112) row-size 32768 -> 8704 usec 180
CDSP0:[SU]: quantize-fp32-q8x4: 7/8 : n-rows 128 (112:128) row-size 32768 -> 8704 usec 180
CDSP0:[SU]: quantize-fp32-q8x4: 4/8 : n-rows 128 (64:80) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 3/8 : n-rows 128 (48:64) row-size 32768 -> 8704 usec 181 
CDSP0:[SU]: quantize-fp32-q8x4: 5/8 : n-rows 128 (80:96) row-size 32768 -> 8704 usec 180 

After

CDSP0:[SU]: quantize-fp32-q8x4: 7/8 : n-rows 128 (112:128) row-size 32768 -> 8704 usec 186
CDSP0:[SU]: quantize-fp32-q8x4: 2/8 : n-rows 128 (32:48) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 3/8 : n-rows 128 (48:64) row-size 32768 -> 8704 usec 185
CDSP0:[SU]: quantize-fp32-q8x4: 6/8 : n-rows 128 (96:112) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 1/8 : n-rows 128 (16:32) row-size 32768 -> 8704 usec 186
CDSP0:[SU]: quantize-fp32-q8x4: 0/8 : n-rows 128 (0:16) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 5/8 : n-rows 128 (80:96) row-size 32768 -> 8704 usec 184
CDSP0:[SU]: quantize-fp32-q8x4: 4/8 : n-rows 128 (64:80) row-size 32768 -> 8704 usec 185

Often times the numbers are about the same but I do see consistently better perf from the quantizer in the master. My thinking is that the small improvement in perplexity score (~0.0857) is not worth it. But it's good to have options. Also whenever we get to enabling HMX-INT8/4 (we'll start with FP16 as I mentioned in other discussions) we might need to introduce other dyn-quantizer scheme that computes/uses intermediate INT scales. With that mind, how about we introduce GGML_HEXAGON_DYNQUANT_TYPE cmake option?

The default would be Q8x4 (ie block-size 128) and folks can enable Q8x1 if their use-case benefit from improved precision. In the future we can add additional types there which are more lossy but more performant.

With the cmake option (that adds a compile definition) we don't need to pass a function pointer. Just add #define QUANTIZE_BLOCK_FP32 and set it based on the compile definition.

That’s a good idea, and it makes sense to me. I can update the CMakeLists.txt to support this. Currently, group sizes of 32, 64, and 128 are implemented.

Copy link
Member

@max-krasnyansky max-krasnyansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks for the quick turnaround.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 18, 2025
@max-krasnyansky
Copy link
Member

Sorry. I just realized that the build fails if CMakePreset is not updated.

/workspace/ggml/src/ggml-hexagon/htp/matmul-ops.c:1798:2: error: "FP32_QUANTIZE_GROUP_SIZE must be 32, 64, or 128"

That's because the default is not kicking in properly. Here is one of the compiler commands with a clean build (ie no prev build dir).

/opt/hexagon/6.4.0.2/tools/HEXAGON_Tools/19.0.04/Tools/bin/hexagon-clang -DFP32_QUANTIZE_GROUP_SIZE=OFF  <<<<

@ngdxzy
Copy link
Contributor Author

ngdxzy commented Dec 18, 2025

Sorry. I just realized that the build fails if CMakePreset is not updated.

/workspace/ggml/src/ggml-hexagon/htp/matmul-ops.c:1798:2: error: "FP32_QUANTIZE_GROUP_SIZE must be 32, 64, or 128"

That's because the default is not kicking in properly. Here is one of the compiler commands with a clean build (ie no prev build dir).

/opt/hexagon/6.4.0.2/tools/HEXAGON_Tools/19.0.04/Tools/bin/hexagon-clang -DFP32_QUANTIZE_GROUP_SIZE=OFF  <<<<

Sorry for that, I was using the UserPreset.json when building. It shall now be fixed.

@max-krasnyansky max-krasnyansky merged commit ce734a8 into ggml-org:master Dec 19, 2025
68 checks passed
Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
…e accurate mixed-precision matmul operations (ggml-org#17977)

* feat: implement real Q8_0

* feat: adding cmake option for configuring FP32 quantize group size

* typo: set() shall be used

---------

Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
…e accurate mixed-precision matmul operations (#17977)

* feat: implement real Q8_0

* feat: adding cmake option for configuring FP32 quantize group size

* typo: set() shall be used

---------

Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants