Skip to content

get_rows & dequantize function implementation for repacked weights of type q6_K (q6_Kx8) #16743

Open
swetha097 wants to merge 2 commits intoggml-org:masterfrom
swetha097:q6_K/get_rows_and_dequantize
Open

get_rows & dequantize function implementation for repacked weights of type q6_K (q6_Kx8) #16743
swetha097 wants to merge 2 commits intoggml-org:masterfrom
swetha097:q6_K/get_rows_and_dequantize

Conversation

@swetha097
Copy link
Copy Markdown

NOTE: Creating the PR changes required for whisper.cpp here as llama.cpp already includes test-backend-op

  • This implements the GGML_OP_GET_ROWS operation specifically for repacked (block interleaved) 6-bit quantized format (q6_Kx8).
  • The following gains were observed by the changes made in the PR - The changes allow for increased usage of the GEMM function (ggml_gemm_q6_K_8x8_q8_0) for q6_K type.
  • The PR was tested in AMD Raphael 7600X for whisper - which supports the following flags :
    system_info: n_threads = 4 / 12 | WHISPER : COREML = 0 | OPENVINO = 0 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
image

master branch commit - swetha097/whisper.cpp@fc45bb8

q6K repacking commit ( block interleaving approach for Q6_K quantization for x64/x86 SIMD Architecture )- swetha097/whisper.cpp@d89aaf2

development (get_rows) branch commit - swetha097/whisper.cpp@de9839e

Model for performance tests Downloaded from : https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-base.en.bin and quantized to q6_K

This patch of the code was also tested with llama.cpp repository & the perplexity of Q6_K models were ensured to be the same before and after the changes made :

Final estimate: PPL = 5.3669 +/- 0.13305

Model used for perplexity test quantized from - https://huggingface.co/meta-llama/Llama-2-7b

This PR is to merged after - Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2) #15275

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 23, 2025
@Alcpz
Copy link
Copy Markdown
Collaborator

Alcpz commented Mar 11, 2026

@swetha097 I've come across an issue due to the lack of support with GET_ROWS in CPU_REPACK (described in #20396). I think your PR essentially solves the problem, but it's been open for a while. Can I help you somehow to move this forward (testing, rebasing). This seemed blocked due to the lack of support for Q6_K repack, but despite #15275 being still open, other contributions already enabled q6_K, so it should be fine to review and merge this in.

@swetha097
Copy link
Copy Markdown
Author

swetha097 commented Mar 17, 2026

@swetha097 I've come across an issue due to the lack of support with GET_ROWS in CPU_REPACK (described in #20396). I think your PR essentially solves the problem, but it's been open for a while. Can I help you somehow to move this forward (testing, rebasing). This seemed blocked due to the lack of support for Q6_K repack, but despite #15275 being still open, other contributions already enabled q6_K, so it should be fine to review and merge this in.

Hi @Alcpz
We are rebasing and testing this PR which is in progress. Also our team has issued the Q6_K PR - block interleaving for the x86 architechture -#19706, can you help us in merging this PR to the master branch.

@Alcpz
Copy link
Copy Markdown
Collaborator

Alcpz commented Mar 17, 2026

I can give a hand on the other PR. This one should be pretty straight forward. As I mentioned, #20396 is essentially this PR adapted a bit. Once you are confident it's good to go, ping ggerganov. Slaren is taking a break, so he won't be able to help with the review.

Edit: Also ping me if you need help here as well.

@ggerganov
Copy link
Copy Markdown
Member

@Alcpz It's difficult to extend the repack logic without having a testing infrastructure for the extra buffer types in place first (see the discussion in ggml-org/whisper.cpp#3223). So for now we avoid such changes.

@Alcpz
Copy link
Copy Markdown
Collaborator

Alcpz commented Mar 17, 2026

I understand. It's been a pain to test my PRs and honestly, I'm grateful that you agreed to merge those (which add a few new repack types).
So this is blocked until #16004 gets in right?

I had a chat with @tdakhran and he found that currently models with tied embeddings duplicate the tensors and after investigating I landed in this PR. The lack of GET_ROWS causes this duplication, as we now need one tensor repacked and one without repack. For small models we get a very high memory footprint that heavily affects low memory devices, so I was looking if I could help move this forward somehow.

@ggerganov
Copy link
Copy Markdown
Member

So this is blocked until #16004 gets in right?

I think so, though it's quite low priority on my end.

We basically need a mechanism to exercise and verify all of the repack logic. This also requires CI workflows and respective hardware that would run it regularly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants