ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations#17977
Conversation
|
Had a chance to review and play with this now. My original implementation of dyn.quant was heavily fp32-based and used block-size 32. Here is a quick test on Gen5 with Also if you were to compile with Often times the numbers are about the same but I do see consistently better perf from the quantizer in the The default would be Q8x4 (ie block-size 128) and folks can enable Q8x1 if their use-case benefit from improved precision. With the cmake option (that adds a compile definition) we don't need to pass a function pointer. |
That’s a good idea, and it makes sense to me. I can update the CMakeLists.txt to support this. Currently, group sizes of 32, 64, and 128 are implemented. |
|
Sorry. I just realized that the build fails if CMakePreset is not updated. That's because the default is not kicking in properly. Here is one of the compiler commands with a clean build (ie no prev build dir). |
Sorry for that, I was using the UserPreset.json when building. It shall now be fixed. |
…e accurate mixed-precision matmul operations (ggml-org#17977) * feat: implement real Q8_0 * feat: adding cmake option for configuring FP32 quantize group size * typo: set() shall be used --------- Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>
…e accurate mixed-precision matmul operations (#17977) * feat: implement real Q8_0 * feat: adding cmake option for configuring FP32 quantize group size * typo: set() shall be used --------- Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>
Description:
This PR implements true Q8_0 quantization for the Hexagon NPU backend, building on and integrating substantial previous work, to align its behavior with the CPU implementation in llama.cpp and improve the numerical accuracy of mixed-precision matmul operations.
Background:
In the current Hexagon NPU pipeline, quantization is performed on-the-fly during matrix multiplication, where FP32 activations are quantized and multiplied with already quantized weights. As a result, the quantization group size directly impacts the numerical behavior of these mixed-precision matmul operations, making alignment with the CPU Q8_0 scheme particularly important for correctness.
Previously, the Hexagon backend only supported quantize_block_fp32_q8x4, which uses a group size of 128. While functional, this does not match the standard Q8_0 definition used by the CPU backend in llama.cpp, leading to accuracy differences.
What's new:
Why this matters:
Summary:
This change aligns the Hexagon NPU backend with the true Q8_0 quantization scheme used on CPU, improving correctness while retaining flexibility for performance tuning and future experimentation.