Skip to content

UPSTREAM PR #17951: ggml-cpu:fix RISC-V Q4_0 repack select and RVV feature reporting#531

Open
loci-dev wants to merge 2 commits into
mainfrom
upstream-PR17951-branch_ixgbe-fix_riscv_q4_0_repack_selection
Open

UPSTREAM PR #17951: ggml-cpu:fix RISC-V Q4_0 repack select and RVV feature reporting#531
loci-dev wants to merge 2 commits into
mainfrom
upstream-PR17951-branch_ixgbe-fix_riscv_q4_0_repack_selection

Conversation

@loci-dev

Copy link
Copy Markdown

Mirrored from ggml-org/llama.cpp#17951

Changes included:

  • Add ggml_cpu_get_rvv_cnt() and RVV vector-length initialization.
  • Export RVV_CNT in CPU feature list.
  • Update ggml_repack_get_optimal_repack_type() to enable Q4_0 repack when
    ggml_cpu_has_riscv_v() and rvv_cnt >= QK4_0.

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
@loci-review

loci-review Bot commented Dec 12, 2025

Copy link
Copy Markdown

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #531

Overview

PR #531 introduces RISC-V vector extension (RVV) support for Q4_0 quantization repacking. The changes add runtime detection of RVV vector length and enable optimized 8x8 block processing when hardware supports vectors ≥256 bits. This is a platform-specific enhancement affecting 4 files with 33 additions and 1 deletion.

Code Changes Analysis

The implementation adds ggml_cpu_get_rvv_cnt() to query RISC-V vector register length at runtime, mirroring the existing ARM SVE pattern. The core modification updates ggml_repack_get_optimal_repack_type() in repack.cpp to include RISC-V in the platform selection logic alongside existing AVX2 and SVE paths. The changes are isolated to RISC-V-specific code paths with appropriate compilation guards, ensuring zero impact on x86-64 and ARM platforms.

Performance Impact

Inference Performance:
No functions in the critical inference path (llama_decode, llama_encode, llama_tokenize) were modified. The changes affect preprocessing during model loading, not the hot path execution. Therefore, tokens per second remains unchanged for all platforms.

Power Consumption:
Analysis shows negligible power consumption changes across all binaries:

  • libggml-cpu.so: -0.3% (116,901 nJ → 116,550 nJ)
  • libllama.so: -0.0% (195,495 nJ, no meaningful change)
  • All other binaries: 0.0% change

The 0.3% reduction in libggml-cpu.so represents 351 nJ absolute change, which is within measurement noise and does not indicate actual power savings.

RISC-V-Specific Impact:
For RISC-V platforms with RVV ≥256 bits, the repack optimization enables vectorized 8x8 block processing during model loading. This is a one-time preprocessing cost with no runtime inference impact. The optimization improves memory access patterns for subsequent Q4_0 matrix operations but does not affect the functions analyzed in previous performance reports (quantize_row_q4_K, quantize_row_q6_K, parameter setters).

Key Findings

No Impact on Analyzed Performance Metrics:
The 10 functions with highest response time changes identified in prior analysis (ggml_vec_argmax_f32 +74 ns, quantize_row_q6_K +14 ns, parameter setters +11-21 ns) are unrelated to this PR. Those regressions stem from validation logic and quantization algorithm changes in the baseline comparison, not from RVV support additions.

Platform Isolation:
All changes are conditionally compiled for RISC-V only. x86-64 and ARM code paths remain identical, confirmed by zero performance delta on non-RISC-V binaries.

Preprocessing vs Runtime:
The repack selection logic executes during model loading, not during token generation. The ggml_repack_get_optimal_repack_type() function determines data layout for subsequent operations but is not called in the inference loop.

@loci-review

loci-review Bot commented Dec 12, 2025

Copy link
Copy Markdown

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #531

Version Comparison: 09fbc8c1 vs fd9769c0
Binary: build.bin.libggml-cpu.so


Analysis Classification: Condition 1

This PR introduces RISC-V RVV support for Q4_0 quantization without modifying core computational logic on x86_64 architecture. The observed performance variations are within measurement noise and do not represent functional changes to the inference pipeline.

Performance Metrics:

  • Power consumption change: -0.48% (-566 nJ) in libggml-cpu.so
  • All other binaries: 0.00% change
  • Largest absolute changes: 74 ns (ggml_vec_argmax_f32), 11 ns (parameter accessors)

Code Changes:

  • Added ggml_cpu_get_rvv_vlen() API for RISC-V vector length detection
  • Modified ggml_repack_get_optimal_repack_type() to enable Q4_0 8x8 repacking on RISC-V when vector length is sufficient
  • Changes are architecture-specific and conditionally compiled for RISC-V only

Inference Impact:
No functions in the tokenization or inference pipeline (llama_decode, llama_encode, llama_tokenize) were modified. The changes affect only RISC-V-specific feature detection and quantization path selection. On x86_64 systems, tokens per second remains unchanged.

@loci-dev loci-dev force-pushed the main branch 24 times, most recently from f70847d to 45e0e28 Compare December 14, 2025 22:08
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 9f1f66d to ec69147 Compare December 19, 2025 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants