Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2) by Srihari-mcw · Pull Request #15275 · ggml-org/llama.cpp

Srihari-mcw · 2025-08-12T19:05:03Z

The PR contains block interleaving approach for Q6_K quantization for x64/x86 SIMD Architecture
Initial gains were observed with prompt processing with the above changes compared to the tested Q6_K model
The GEMM function was implemented for AVX512/AVX2 and GEMV functions are implemented for the AVX2 architecture
repack_q6_K_to_q6_K_8_bl function rearranges the weight in Q6_K format to Q6_Kx8 format(block_q6_Kx8)

Block Interleaving Formats

Block_Q6_Kx8 :

Used to contain data of 8 Q6_K blocks in interleaved fashion
uint8 scales[128] - Scales from source Q6_K blocks are taken. Every 16 byte here is packed such that it contains scales for corresponding sub blocks from Q6_K structure - There are 16 sub blocks in original Q6_K structure
The d values from source Q6_K blocks are stored together in an array
Quant values (hbits and lbits) from the source Q6_K blocks are sequentially extracted and interleaved into groups of eight bytes

Performance numbers with llama2 7B model quantized to Q6_K is attached here

GCC Linux :

Q6_K Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	pp 512	40.22 ± 0.04		79c116 - Base Commit
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	pp 512	45.51 ± 0.07	13.15%	3b3d551 - AVX2 Version
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	pp 512	59.81 ± 0.11	48.71%	3b3d551 - AVX512 Version
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	tg 128	10.55 ± 0.00		79c116 - Base Commit
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	tg 128	10.29 ± 0.00	-2.46%	3b3d551 - AVX2 Version
llama 7B Q6_K	5.15 GiB	6.74 B	CPU	6	tg 128	10.29 ± 0.00	-2.46%	3b3d551 - AVX512 Version

GCC Version = 12.3

The PR was tested in AMD Granite Ridge 9600X which supports the following flags by default :

Srihari-mcw · 2025-08-14T02:54:12Z

The perplexity results with llama2 7B are tabulated as follows :

model	perplexity (Final estimate PPL)	Commit id
llama 7B Q6_K	5.8164 +/- 0.03250	79c116 - Base Commit
llama 7B Q6_K	5.8163 +/- 0.03250	3b3d551 - Updated Commit

jukofyork · 2025-08-14T11:17:40Z

AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1

Interesting the AVX512 is so much faster prompt processing. Which of these is making the most difference?

Srihari-mcw · 2025-08-15T13:59:32Z

AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1

Interesting the AVX512 is so much faster prompt processing. Which of these is making the most difference?

@jukofyork Repacking of weights enables much more efficient usage of AVX512 which is not the case with existing setup. Thanks

Srihari-mcw · 2025-08-15T14:00:02Z

Update : Scalar code accuracy issues are fixed and the code is ready for further review. Thanks

jukofyork · 2025-08-15T14:40:32Z

AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1

Interesting the AVX512 is so much faster prompt processing. Which of these is making the most difference?

@jukofyork Repacking of weights enables much more efficient usage of AVX512 which is not the case with existing setup. Thanks

Thanks - when it gets finalised then I will give this a try with my dual Xeon Gold 6248:

https://en.wikichip.org/wiki/intel/xeon_gold/6248

system_info: n_threads = 80 (n_threads_batch = 80) / 80 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

the main thing they have is AVX512_VNNI = 1 due to being Cascade Lake, so will be interesting to see what effect this PR has.

I currently run large MoE models with everything in Q6_K apart from the non-shared expert tensors which I use Q4_K, and the Q4_K is only kept on the CPU/RAM for small batch sizes where the cost of offloading to the GPU via the PCI-E bus is too much.

Srihari-mcw · 2025-11-12T04:29:31Z

Hi @slaren / @ggerganov , any thoughts on further steps with regards to this PR. Thanks

ggerganov

Some minor formatting comments.

The main issue as usual is that we don't have CI for AVX512 and hard to approve these changes. Will ping you if we encounter any problems in the future.

ggerganov · 2025-11-12T12:59:45Z

ggml/src/ggml-cpu/repack.cpp

+    block_q6_Kx8* dst = (block_q6_Kx8*)t->data;
+    const block_q6_K* src = (const block_q6_K*)data;


Suggested change

block_q6_Kx8* dst = (block_q6_Kx8*)t->data;

const block_q6_K* src = (const block_q6_K*)data;

block_q6_Kx8 * dst = (block_q6_Kx8 *)t->data;

const block_q6_K * src = (const block_q6_K *)data;

ggerganov · 2025-11-12T12:59:56Z

ggml/src/ggml-cpu/repack.cpp

    GGML_UNUSED(data_size);
 }

+static int repack_q6_K_to_q6_K_8_bl(struct ggml_tensor* t, int interleave_block, const void* GGML_RESTRICT data, size_t data_size) {


Suggested change

static int repack_q6_K_to_q6_K_8_bl(struct ggml_tensor* t, int interleave_block, const void* GGML_RESTRICT data, size_t data_size) {

static int repack_q6_K_to_q6_K_8_bl(struct ggml_tensor * t, int interleave_block, const void * GGML_RESTRICT data, size_t data_size) {

ggerganov · 2025-11-12T13:00:11Z

ggml/src/ggml-cpu/repack.cpp

+    }
+    return out;
+


Suggested change

}

return out;

}

return out;

ggerganov · 2025-11-12T13:00:23Z

ggml/src/ggml-cpu/repack.cpp

+    for (int i = 0; i < 128; i++) {
+
+        // Index for selecting which q6k super block


Suggested change

for (int i = 0; i < 128; i++) {

// Index for selecting which q6k super block

for (int i = 0; i < 128; i++) {

// Index for selecting which q6k super block

ggerganov · 2025-11-12T13:00:35Z

ggml/src/ggml-cpu/repack.cpp

 }

+
+static block_q6_Kx8 make_block_q6_Kx8(block_q6_K* in, unsigned int blck_size_interleave) {


Suggested change

static block_q6_Kx8 make_block_q6_Kx8(block_q6_K* in, unsigned int blck_size_interleave) {

static block_q6_Kx8 make_block_q6_Kx8(block_q6_K * in, unsigned int blck_size_interleave) {

ggerganov · 2025-11-12T13:01:06Z

ggml/src/ggml-cpu/repack.cpp

+                    const int8_t *scales_0 = b_ptr[l].scales + (k / 4) * 64;
+                    const int8_t *scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16;
+                    const int8_t *scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32;
+                    const int8_t *scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48;


Suggested change

const int8_t *scales_0 = b_ptr[l].scales + (k / 4) * 64;

const int8_t *scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16;

const int8_t *scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32;

const int8_t *scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48;

const int8_t * scales_0 = b_ptr[l].scales + (k / 4) * 64;

const int8_t * scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16;

const int8_t * scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32;

const int8_t * scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48;

ggerganov · 2025-11-12T13:01:32Z

ggml/src/ggml-cpu/repack.cpp

+                const int8_t *scales_0 = b_ptr[l].scales + (k / 4) * 64;
+                const int8_t *scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16;
+                const int8_t *scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32;
+                const int8_t *scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48;


Suggested change

const int8_t *scales_0 = b_ptr[l].scales + (k / 4) * 64;

const int8_t *scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16;

const int8_t *scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32;

const int8_t *scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48;

const int8_t * scales_0 = b_ptr[l].scales + (k / 4) * 64;

const int8_t * scales_1 = b_ptr[l].scales + (k / 4) * 64 + 16;

const int8_t * scales_2 = b_ptr[l].scales + (k / 4) * 64 + 32;

const int8_t * scales_3 = b_ptr[l].scales + (k / 4) * 64 + 48;

ggerganov · 2025-11-12T13:02:03Z

ggml/src/ggml-cpu/arch/x86/repack.cpp

+                    const __m256i rhs_mat_0145_30_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_30, 221); //B30(4-7) B31(4-7) B30(4-7) B31(4-7) B34(4-7) B35(4-7) B34(4-7) B35(4-7)
+                    const __m256i rhs_mat_2367_30_sp2 = _mm256_shuffle_epi32(rhs_mat_2367_30, 221); //B32(4-7) B33(4-7) B32(4-7) B33(4-7) B36(4-7) B37(4-7) B36(4-7) B37(4-7)
+
+                    const __m256i rhs_mat_0145_31_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_31, 221); //B30(12-15) B31(12-15) B30(12-15) B31(12-15) B34(12-15) B35(12-15) B34(12-15) B35(12-15)
+                    const __m256i rhs_mat_2367_31_sp2 = _mm256_shuffle_epi32(rhs_mat_2367_31, 221); //B32(12-15) B33(12-15) B32(12-15) B33(12-15) B36(12-15) B37(12-15) B36(12-15) B37(12-15)
+
+                    const __m256i rhs_mat_0145_40_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_40, 221); //B40(4-7) B41(4-7) B40(4-7) B41(4-7) B44(4-7) B45(4-7) B44(4-7) B45(4-7)
+                    const __m256i rhs_mat_2367_40_s


Suggested change

const __m256i rhs_mat_0145_30_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_30, 221); //B30(4-7) B31(4-7) B30(4-7) B31(4-7) B34(4-7) B35(4-7) B34(4-7) B35(4-7)

const __m256i rhs_mat_2367_30_sp2 = _mm256_shuffle_epi32(rhs_mat_2367_30, 221); //B32(4-7) B33(4-7) B32(4-7) B33(4-7) B36(4-7) B37(4-7) B36(4-7) B37(4-7)

const __m256i rhs_mat_0145_31_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_31, 221); //B30(12-15) B31(12-15) B30(12-15) B31(12-15) B34(12-15) B35(12-15) B34(12-15) B35(12-15)

const __m256i rhs_mat_2367_31_sp2 = _mm256_shuffle_epi32(rhs_mat_2367_31, 221); //B32(12-15) B33(12-15) B32(12-15) B33(12-15) B36(12-15) B37(12-15) B36(12-15) B37(12-15)

const __m256i rhs_mat_0145_40_sp2 = _mm256_shuffle_epi32(rhs_mat_0145_40, 221); //B40(4-7) B41(4-7) B40(4-7) B41(4-7) B44(4-7) B45(4-7) B44(4-7) B45(4-7)

const __m256i rhs_mat_2367_40_s

}

#else

ggml_gemm_q6_K_8x8_q8_K_generic(n, s, bs, vx, vy, nr, nc);

#endif

ggerganov · 2025-11-14T08:51:36Z

This failure is a bit suspicious: https://github.com/ggml-org/llama.cpp/actions/runs/19328651203/job/55285977638?pr=15275

Will rerun the CI and see if it happens again.

ggerganov · 2025-12-17T18:56:14Z

The CI does not pass: https://github.com/ggml-org/llama.cpp/actions/runs/20304981395/job/58319854410?pr=15275

Srihari-mcw · 2025-12-24T06:14:47Z

Hi @ggerganov , most CI/CD issues were fixed. Can you pls comment on next steps here? Thanks

Manogna-Sree · 2026-03-17T07:44:24Z

This PR has been closed due to multiple merge conflicts with the master branch. A new PR(PR-19706) has been created with similar changes and there is no impact on the performance numbers that were shared earlier

ggerganov · 2026-03-17T13:08:02Z

Without having hardware to run this, it's difficult to review and accept. We should provision AVX512-capable runners to the CI first.

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 12, 2025

swetha097 mentioned this pull request Oct 23, 2025

get_rows & dequantize function implementation for repacked weights of type q6_K (q6_Kx8) #16743

Open

ggerganov approved these changes Nov 12, 2025

View reviewed changes

Srihari-mcw requested a review from slaren as a code owner November 13, 2025 10:36

Manogna-Sree and others added 19 commits December 16, 2025 22:50

Initial interleaving support for Q6_K Block Interleaving

aedac0d

Fix for inaccuracy of GEMM Q6K

4630b51

Initial implementation of GEMM Q6_K for edge handling case

5311e52

Avx512 implementation of GEMM Q6K

684c4ca

Avx512 implementation of GEMM Q6K for edge handling case

ed66268

GEMV scalar implementation

61a8c04

GEMM scalar implementation

6e46dc1

Initial cleanup of GEMM

56b1f7d

Further cleanup of GEMM

e1c3c05

Cleanup commit for AVX2 GEMM bigger loop

d6fb079

Cleanup of smaller loop of AVX2'

266fa80

Further cleanup

c29ac56

Add further fixes and updates to scalar code

4806d6a

Cleanup GEMV Code

5c851ca

Fix issues with scalar version

be80640

Rename variables to maintain convention in other functions

a3957d1

Remove print

ac42365

Fix for inaccuracies in the scalar version

b407f18

Fix CI/CD issues

75712bc

Manogna-Sree and others added 3 commits December 16, 2025 22:50

Remove empty line

9976c21

Remove trailing whitespaces

2913ac9

Address review comments

55f21c8

Srihari-mcw force-pushed the block_interleaving_implementation_of_q6k branch from b759685 to 55f21c8 Compare December 17, 2025 13:46

Fix CI issues

2383eb2

Manogna-Sree mentioned this pull request Feb 18, 2026

Q6_K - Block Interleaving Implementation for x86 SIMD (AVX512/AVX2) #19706

Open

		block_q6_Kx8* dst = (block_q6_Kx8*)t->data;
		const block_q6_K* src = (const block_q6_K*)data;

	static int repack_q6_K_to_q6_K_8_bl(struct ggml_tensor* t, int interleave_block, const void* GGML_RESTRICT data, size_t data_size) {
	static int repack_q6_K_to_q6_K_8_bl(struct ggml_tensor * t, int interleave_block, const void * GGML_RESTRICT data, size_t data_size) {

		for (int i = 0; i < 128; i++) {

		// Index for selecting which q6k super block

		}


		static block_q6_Kx8 make_block_q6_Kx8(block_q6_K* in, unsigned int blck_size_interleave) {

Conversation

Srihari-mcw commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Srihari-mcw commented Aug 14, 2025

Uh oh!

jukofyork commented Aug 14, 2025

Uh oh!

Srihari-mcw commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Srihari-mcw commented Aug 15, 2025

Uh oh!

jukofyork commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Srihari-mcw commented Nov 12, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Nov 14, 2025

Uh oh!

ggerganov commented Dec 17, 2025

Uh oh!

Srihari-mcw commented Dec 24, 2025

Uh oh!

Manogna-Sree commented Mar 17, 2026

Uh oh!

ggerganov commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Srihari-mcw commented Aug 12, 2025 •

edited

Loading

Srihari-mcw commented Aug 15, 2025 •

edited

Loading

jukofyork commented Aug 15, 2025 •

edited

Loading