get_rows & dequantize function implementation for repacked weights of type q4_0 (q4_0x8) by swetha097 · Pull Request #3223 · ggml-org/whisper.cpp

swetha097 · 2025-06-03T13:09:17Z

This implements the GGML_OP_GET_ROWS operation specifically for repacked (block interleaved) 4-bit quantized format (q4_0x8) defined within ggml-cpu-aarch64.cpp
The following gains were observed by the changes made in the PR - The changes allow for increased usage of the GEMM function (ggml_gemm_q4_0_8x8_q8_0) for q4_0 type defined within ggml-cpu-aarch64.cpp

master branch commit - 82f461eaa4e6a1ba29fc0dbdaa415a9934ee8a1d
development (get_rows) branch commit - d6cc466bd48dd27474ecb00c3baba2e8a887f6c4

The PR was tested in AMD Raphael 7600X which supports the following flags :

system_info: n_threads = 4 / 12 | WHISPER : COREML = 0 | OPENVINO = 0 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Model for performance tests Downloaded from : https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-base.en.bin and quantized to q4_0

This patch of the code was also tested with llama.cpp repository & the perplexity of Q4_0 models were ensured to be the same before and after the changes made :

Final estimate: PPL = 5.9625 +/- 0.03348

Model used for perplexity test quantized from - https://huggingface.co/meta-llama/Llama-2-7b

…ghts of type q4_0

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

src/whisper.cpp

swetha097 · 2025-06-06T10:57:13Z

@danbev Addressed the review comments

ggerganov · 2025-06-06T12:15:31Z

Let's merge this after the refactoring in ggml-org/llama.cpp#13892 to avoid resolving the conflicts.

danbev · 2025-06-16T05:43:16Z

@swetha097 Would you be able to take a look at resolving the conflict reported here?

…ghts of type q4_0

…r.cpp into swe_pr/get_rows

swetha097 · 2025-06-17T13:25:09Z

@danbev Resolved the conflicts.

Performance with the gcc compiler is stable and matches the previous results.
To see similar performance gains on clang compiler, we still need to pull in the latest updates from the llama.cpp's ggml library.

danbev · 2025-06-17T13:31:19Z

@swetha097 Thanks! I think there will be a sync of ggml/llama.cpp shortly, hopefully this won't cause conflict but I'm not sure.

ggerganov · 2025-06-18T09:57:35Z

@swetha097 Hm, I am a bit confused about why this would make a difference for the performance. The only place that we use ggml_get_rows is here:

whisper.cpp/src/whisper.cpp

Lines 2543 to 2548 in 1e72e4b

    
           // token encoding + position encoding 
        
           struct ggml_tensor * cur = 
        
               ggml_add(ctx0, 
        
                       ggml_get_rows(ctx0, model.d_te, embd), 
        
                       ggml_get_rows(ctx0, model.d_pe, position));

How does this change lead to increased usage of the ggml_gemm_q4_0_8x8_q8_0?

swetha097 · 2025-06-18T10:39:01Z

@ggerganov
As noted in whisper-arch.h,

    // Note: ASR_TENSOR_DEC_TOKEN_EMBD_WEIGHT is also used by GGML_OP_MAT_MUL. Need to figure out a way how to handle
    // weight tensors that are used by multiple different operators when extra_buffer_type implementations accelerate
    // more than just GGML_OP_MUL_MAT.
    {ASR_TENSOR_DEC_TOKEN_EMBD_WEIGHT, GGML_OP_GET_ROWS},

this tensor is used by both GGML_OP_MAT_MUL and GGML_OP_GET_ROWS. Because the original get_rows operation could not handle repacked weights, the tensor remained in a standard format, preventing the GGML_OP_MAT_MUL operation from using its most optimized kernel (ggml_gemm_q4_0_8x8_q8_0), instead it was using gemm4xN kernel previously.

ggerganov · 2025-06-18T11:24:13Z

@swetha097 Thanks, got it.

@eddnjjn Could you help reviewing this change?

…ghts of type q4_0

…r.cpp into swe_pr/get_rows

slaren · 2025-07-04T12:33:01Z

ggml/src/ggml-cpu/repack.cpp

+            const char * base_ptr_for_higher_dims_in_src0 = (const char *)src0->data + i11 * nb02 + i12 * nb03;
+
+            // Pointer to the first block_q4_0x8 of the identified row_group_idx
+            const block_q4_0x8 * p_first_repacked_block_of_group_x8 = (const block_q4_0x8 *)(base_ptr_for_higher_dims_in_src0 + row_group_idx * stride_between_actual_row_groups);


Wouldn't this fail if the type of this template is actually block_q4_0x4?

I think we should add support for testing extra buffer types to test-backend-ops before adding more complexity to this code.

Yes, I agree. We'll have to add tests before merging more stuff to the repacking.

swetha097 · 2025-07-16T11:30:54Z

@slaren @ggerganov

I have updated the code with templated changes which helps to differentiate between different block_q4_0's (q4_0x8, q4_0x4).

Please do have a look at the changes done and share your feedback on the same.

swetha097 · 2025-07-29T11:46:49Z

@slaren @ggerganov

Are there any other dependencies that should be resolved for the PR to move forward. Please share your thoughts.

ggerganov · 2025-07-31T10:00:45Z

The main concern here is that we don't have a way to test these changes as noted earlier by slaren. The PR should wait until we extend test-backend-ops to work with extra buffer types so that we can verify modifications.

swetha097 · 2025-09-15T06:39:33Z

@ggerganov

Given that llama.cpp already includes test-backend-ops, would it be acceptable to open this PR in llama.cpp directly?
My changes target the ggml library, which is common to both whisper.cpp and llama.cpp.

Get_Rows & Dequantize implementation adapted to work for repacked wei…

d6cc466

…ghts of type q4_0

danbev approved these changes Jun 6, 2025

View reviewed changes

Resolve PR comments

994e02a

swetha097 added 4 commits June 16, 2025 00:28

Get_Rows & Dequantize implementation adapted to work for repacked wei…

ed1d3a2

…ghts of type q4_0

Resolve PR comments

6959d41

Merge branch 'swe_pr/get_rows' of https://github.com/swetha097/whispe…

1a79d18

…r.cpp into swe_pr/get_rows

Merge branch 'master' into swe_pr/get_rows

1e72e4b

swetha097 added 5 commits June 19, 2025 01:06

Get_Rows & Dequantize implementation adapted to work for repacked wei…

ed85572

…ghts of type q4_0

Resolve PR comments

aea3175

Merge branch 'swe_pr/get_rows' of https://github.com/swetha097/whispe…

ac7f6f5

…r.cpp into swe_pr/get_rows

Merge remote-tracking branch 'upstream/master' into swe_pr/get_rows

c7f988f

Merge remote-tracking branch 'origin' into swe_pr/get_rows

d3e56e5

slaren reviewed Jul 4, 2025

View reviewed changes

ggerganov mentioned this pull request Jul 14, 2025

kleidiai: add support for get_rows ggml-org/llama.cpp#14676

Merged

swetha097 added 3 commits July 15, 2025 04:01

Templating to differenciate the block_q4_0

b0c631c

Templating to differenciate the block_q4_0's in get_rows function

d39e4e6

Merge remote-tracking branch 'origin/master' into swe_pr/get_rows

c96dd0b

Minor changes

ce73bd3

ggerganov mentioned this pull request Mar 17, 2026

get_rows & dequantize function implementation for repacked weights of type q6_K (q6_Kx8) ggml-org/llama.cpp#16743

Open

Conversation

swetha097 commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

swetha097 commented Jun 6, 2025

Uh oh!

ggerganov commented Jun 6, 2025

Uh oh!

danbev commented Jun 16, 2025

Uh oh!

swetha097 commented Jun 17, 2025

Uh oh!

danbev commented Jun 17, 2025

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

swetha097 commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jun 18, 2025

Uh oh!

slaren Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

swetha097 commented Jul 16, 2025

Uh oh!

swetha097 commented Jul 29, 2025

Uh oh!

ggerganov commented Jul 31, 2025

Uh oh!

swetha097 commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

swetha097 commented Jun 18, 2025 •

edited

Loading