Skip to content

Kimi Linear chunk size = 16#19827

Merged
ggerganov merged 26 commits intoggml-org:masterfrom
ymcki:dn
Mar 5, 2026
Merged

Kimi Linear chunk size = 16#19827
ggerganov merged 26 commits intoggml-org:masterfrom
ymcki:dn

Conversation

@ymcki
Copy link
Contributor

@ymcki ymcki commented Feb 23, 2026

Make sure to read the contributing guidelines before submitting a PR

~15% gain in pp.

Also, CUDA0 compute buffer of IQ3_M at 192k context reduces from 1693.51MB to 991.29MB on 3090. This allows context to further increase to 272k.

Also, tried to implement the same thing for gated delta net. However, both pp speed lost and CUDA0 compute buffer lost is observed. Anyway, I included the code for GDN if someone would like to explore whether it is possible to speed up GDN via block.

orig, pp 849t/s, tg 61t/s
| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |      64 |   pp512 @ d8192 |        504.32 ± 1.84 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |      64 |    tg32 @ d8192 |         61.20 ± 1.85 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     128 |   pp512 @ d8192 |        522.16 ± 1.00 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     128 |    tg32 @ d8192 |         61.25 ± 1.71 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     256 |   pp512 @ d8192 |        690.77 ± 2.34 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     256 |    tg32 @ d8192 |         61.07 ± 1.95 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     512 |   pp512 @ d8192 |        849.02 ± 6.51 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     512 |    tg32 @ d8192 |         60.70 ± 1.63 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    1024 |   pp512 @ d8192 |        844.93 ± 6.80 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    1024 |    tg32 @ d8192 |         60.74 ± 1.64 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    2048 |   pp512 @ d8192 |        844.30 ± 7.04 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    2048 |    tg32 @ d8192 |         60.49 ± 1.42 |

build: 23cccea2 (8102)
block, pp 974t/s, tg 61t/s
| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |      64 |   pp512 @ d8192 |        519.94 ± 1.99 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |      64 |    tg32 @ d8192 |         62.09 ± 1.99 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     128 |   pp512 @ d8192 |        556.54 ± 1.10 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     128 |    tg32 @ d8192 |         61.78 ± 1.71 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     256 |   pp512 @ d8192 |        762.60 ± 0.48 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     256 |    tg32 @ d8192 |         61.53 ± 1.98 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     512 |   pp512 @ d8192 |       973.75 ± 12.92 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     512 |    tg32 @ d8192 |         61.44 ± 1.80 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    1024 |   pp512 @ d8192 |       969.33 ± 12.86 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    1024 |    tg32 @ d8192 |         61.35 ± 1.62 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    2048 |   pp512 @ d8192 |       968.47 ± 14.16 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    2048 |    tg32 @ d8192 |         61.28 ± 1.60 |

build: 23cccea2 (8102)
block implementation for gated delta net
        const int64_t HB = H_k * n_seqs;
        kb = ggml_new_tensor_4d(ctx0, GGML_TYPE_F32, CS, CS, n_chunks, HB);
        kb = ggml_clamp(ctx0, kb, 0.0f, 0.0f);
        kq = ggml_new_tensor_4d(ctx0, GGML_TYPE_F32, CS, CS, n_chunks, HB);
        kq = ggml_clamp(ctx0, kq, 0.0f, 0.0f);
        const int64_t block_size = BLOCK_SIZE;
        const int64_t n_blocks = CHUNK_SIZE / BLOCK_SIZE;

        ggml_tensor * k_block[n_blocks];
        ggml_tensor * q_block[n_blocks];
        ggml_tensor * g_block[n_blocks];
        ggml_tensor * g_block_bc[n_blocks];
        for (int64_t j = 0; j < n_blocks; ++j) {
            int64_t j_start = j * block_size;

            // k_i_block: [S, block_size, C, HB]
            k_block[j] = ggml_view_4d(ctx0, k,
                S_k, block_size, n_chunks, HB,
                k->nb[1], k->nb[2], k->nb[3],
                j_start * k->nb[1]);

            // q_i_block: [S, block_size, C, HB]
            q_block[j] = ggml_view_4d(ctx0, q,
                    S_k, block_size, n_chunks, HB,
                    q->nb[1], q->nb[2], q->nb[3],
                    j_start * q->nb[1]);

            // g_j_block: [S, block_size, C, HB]
            g_block[j] = ggml_cont(ctx0, ggml_view_4d(ctx0, g_cs,
                block_size, 1, n_chunks, HB,
                g_cs->nb[1], g_cs->nb[2], g_cs->nb[3],
                j_start * g_cs->nb[0]));

            g_block_bc[j] = ggml_reshape_4d(ctx0, g_block[j], 1, block_size, n_chunks, HB);
            g_block_bc[j] = ggml_repeat_4d(ctx0, g_block_bc[j], block_size, block_size, n_chunks, HB);
        }

        for (int64_t j = 0; j < n_blocks; ++j) {
            int64_t j_start = j * block_size;

            for (int64_t i = 0; i <= j; ++i) {
                int64_t i_start = i * block_size;

                ggml_tensor * decay_mask = ggml_sub(ctx0, g_block_bc[j], g_block[i]);
                cb(decay_mask, "decay_mask", il);

                // Apply diag_mask only at diagnoal blocks
                if (i == j) {
                    decay_mask = ggml_tri(ctx0, decay_mask, GGML_TRI_TYPE_LOWER_DIAG);
                }
                decay_mask = ggml_exp(ctx0, decay_mask);

                ggml_tensor * Akk_block = ggml_mul_mat(ctx0, k_block[i], k_block[j]);
                ggml_tensor * Aqk_block = ggml_mul_mat(ctx0, q_block[i], k_block[j]);
                Akk_block = ggml_mul(ctx0, Akk_block, decay_mask);
                Aqk_block = ggml_mul(ctx0, Aqk_block, decay_mask);

                if (i == j) {
                    Aqk_block = ggml_tri(ctx0, Aqk_block, GGML_TRI_TYPE_LOWER_DIAG);
                    Akk_block = ggml_tri(ctx0, Akk_block, GGML_TRI_TYPE_LOWER);
                }

                // Accumulate into Akk and Aqk at position [j_start:j_end, i_start:i_end]
                kb = ggml_set(ctx0, kb, Akk_block,
                    kb->nb[1], kb->nb[2], kb->nb[3],
                    i_start * kb->nb[0] + j_start * kb->nb[1]);
                kq = ggml_set(ctx0, kq, Aqk_block,
                    kq->nb[1], kq->nb[2], kq->nb[3],
            }
        }
        kb = ggml_mul(ctx0, kb, b);
        cb(kq, "kq", il);

@ymcki ymcki requested a review from CISC as a code owner February 23, 2026 12:14
@github-actions github-actions bot added the model Model specific label Feb 23, 2026
Comment on lines +96 to +99
kb = ggml_new_tensor_4d(ctx0, GGML_TYPE_F32, CS, CS, n_chunks, HB);
kb = ggml_clamp(ctx0, kb, 0.0f, 0.0f);
kq = ggml_new_tensor_4d(ctx0, GGML_TYPE_F32, CS, CS, n_chunks, HB);
kq = ggml_clamp(ctx0, kq, 0.0f, 0.0f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we just manage to get rid of these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know of a better way to initialize ggml tensor to zeroes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best way is not at all, was talking about the ggml_new_tensors specifically, secondly ggml_fill. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean I should get rid of ggml_new_tensor? I can think of ggml_concat-ing the rows for kb and kq. Would that work for you?

@CISC CISC requested a review from ggerganov February 23, 2026 13:24
@ymcki
Copy link
Contributor Author

ymcki commented Feb 24, 2026

Changed from new_tensor+clamp to concat+pad. pp goes from 974t/s to 939t/s but CUDA0 compute buffer reduces from 991.29MB to 831.29MB. This allows IQ3_M to increase context from 272k to 288k on 3090. Since I am a context over speed person, so this version is better for me.

concat+pad, pp 939t/s tg 61t/s
| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |      64 |   pp512 @ d8192 |        346.78 ± 0.52 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |      64 |    tg32 @ d8192 |         60.96 ± 2.75 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     128 |   pp512 @ d8192 |        502.59 ± 1.13 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     128 |    tg32 @ d8192 |         61.22 ± 2.01 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     256 |   pp512 @ d8192 |        717.97 ± 0.85 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     256 |    tg32 @ d8192 |         60.87 ± 2.16 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     512 |   pp512 @ d8192 |       939.05 ± 13.21 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     512 |    tg32 @ d8192 |         60.96 ± 1.86 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    1024 |   pp512 @ d8192 |       937.45 ± 10.85 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    1024 |    tg32 @ d8192 |         60.59 ± 2.02 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    2048 |   pp512 @ d8192 |       936.86 ± 13.87 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    2048 |    tg32 @ d8192 |         60.71 ± 1.84 |

build: ac46b38d (8154)

@ymcki
Copy link
Contributor Author

ymcki commented Mar 1, 2026

Committed a version with binary concat on the rows for the concat+pad version,. This saves one ggml_concat, so pp is slightly faster.

Here is a comparison of three implementation when running IQ3_M at 192k context on 3090

Version compute buffer pp t/s tg t/s
current 1693.51MB 849 61
naive block 991.29MB 974 61
pad+binary row concat 831.29MB 944 61

Do we want faster naive block or slower pad+concat that can run more context? Or is it possible to improve either one to get the best of both worlds?

pad+binary row concat: pp 944t/s tg 61t/s
| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |      64 |   pp512 @ d8192 |        347.74 ± 4.80 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |      64 |    tg32 @ d8192 |         61.32 ± 2.82 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     128 |   pp512 @ d8192 |        504.89 ± 0.95 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     128 |    tg32 @ d8192 |         61.23 ± 1.99 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     256 |   pp512 @ d8192 |        720.77 ± 1.01 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     256 |    tg32 @ d8192 |         61.20 ± 2.16 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     512 |   pp512 @ d8192 |       944.48 ± 12.16 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     512 |    tg32 @ d8192 |         61.09 ± 2.05 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    1024 |   pp512 @ d8192 |       943.39 ± 14.09 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    1024 |    tg32 @ d8192 |         60.80 ± 1.81 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    2048 |   pp512 @ d8192 |       943.56 ± 14.52 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    2048 |    tg32 @ d8192 |         60.79 ± 1.85 |

build: ec25a26c (8155)

@ymcki
Copy link
Contributor Author

ymcki commented Mar 2, 2026

I find that by setting chunk size to 16 for kda, you can get 1112t/s pp, ie 31% gain compute buffer also reduced to 705.29MB for IQ3_M at 192k context. This allows me to run 304k context for IQ3_M on 3090.

@CISC, I think this commit should be an easy review. Just need to confirm the performance gain then it should be ready to merge.

I heard that increase in shared memory in CUDA card can increase the optimal chunk size, so probably in the long run, this number should be configurable?

I tried both kda and gdn. gdn's optimal seems to be 64 just like before. They both crashed when chunk size == 8 with this error:

/home/user/llama.cpp/ggml/src/ggml.c:1706: GGML_ASSERT(obj_new) failed
chunk_size 128 64 32 16 8
Qwen3-Coder-Next TQ1 1447.20 1473.92 1421.61 1276.84 crash
Kimi-Linear Q2K 623.22 879.73 1057.31 1112.49 crash
chunk size = 16 for kda, pp 1112t/s tg 61t/s
| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |      64 |   pp512 @ d8192 |        594.94 ± 2.41 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |      64 |    tg32 @ d8192 |         61.60 ± 2.89 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     128 |   pp512 @ d8192 |        618.10 ± 1.40 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     128 |    tg32 @ d8192 |         61.67 ± 2.01 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     256 |   pp512 @ d8192 |        855.32 ± 3.23 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     256 |    tg32 @ d8192 |         61.38 ± 1.89 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     512 |   pp512 @ d8192 |      1112.49 ± 18.23 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |     512 |    tg32 @ d8192 |         60.66 ± 1.88 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    1024 |   pp512 @ d8192 |      1108.38 ± 20.01 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    1024 |    tg32 @ d8192 |         60.89 ± 1.78 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    2048 |   pp512 @ d8192 |      1106.38 ± 22.11 |
| kimi-linear 48B.A3B Q2_K - Medium |  16.78 GiB |    49.12 B | CUDA       |  99 |    2048 |    tg32 @ d8192 |         61.11 ± 1.89 |

build: 8c96f826 (8156)

@ymcki ymcki changed the title Kimi Linear block implementation Kimi Linear chunk size = 16 Mar 2, 2026
@pwilkin
Copy link
Member

pwilkin commented Mar 2, 2026

@ymcki for 8 you'd have to increase the max graph nodes, it assumes chunk size 64 as of now.

@ymcki
Copy link
Contributor Author

ymcki commented Mar 2, 2026

@ymcki for 8 you'd have to increase the max graph nodes, it assumes chunk size 64 as of now.

Doubled max graph nodes, so now it can run chunk size = 8

pp t/s:

chunk_size 128 64 32 16 8
Qwen3-Coder-Next TQ1 1447.20 1473.92 1421.61 1276.84 737.54
Kimi-Linear Q2K 623.22 879.73 1057.31 1112.49 1096.31

CUDA0 compute buffer:

chunk_size 128 64 32 16 8
Qwen3-Coder-Next TQ1 160k 516.01 516.01 516.01 516.01 522.73
Kimi-Linear IQ3M 192k OOM 1693.51 961.29 705.29 669.51

Interestingly, compute buffer stays the same for Q3N but decreasing for KL. While chunk size = 8 for KL has further saving in VRAM for KL, not sure if it is worth it as the bs=512 graph nodes ballooned from 14232 to 23832.

So based on these numbers, Q3N can stay at 64. KL can be either 16 or 8. 8 can be slightly better if too many graph nodes is not a bad thing.

b = ggml_permute(ctx0, b, 0, 2, 1, 3); // [ 1, n_tokens, H_v, n_seqs]

const int CS = CHUNK_SIZE;
const int CS = kda ? 16 : 64; // chunk size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the PR to change only this line - the rest of the changes are not really needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other change I made was removing the line "#define CHUNK_SIZE 64". Why is it still needed when no one is referencing it any more?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, keep it. I was referring to the rest of the changes in the graph implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ok now.

@ggerganov ggerganov merged commit a0ed91a into ggml-org:master Mar 5, 2026
78 checks passed
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 10, 2026
* models : add llm_build_delta_net_base

* cont : keep qwen35 and qwen35moe graphs intact

* cont : add comments [no ci]

* add kimi linear to delta-net-base

* removed unnecessary ggml_cont from g_exp_t

* removed ggml_cont from g_diff_exp_t. moved ggml_cont for o to kimi-linear.cpp

* removed unnecessary diag mask

* cont : simplify

* cont : avoid graph splits

* scale q after mul instead of beginning

* scale q after mul instead of beginning

* identical ppl

* cont : fix scale and decay mask

* minor : remove TODO

* block implementation for kda

* remove space at the end of line 101

* concat+pad

* pad+binary row concat

* chunk size 16 for kda

* removed minor differences to master

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
* models : add llm_build_delta_net_base

* cont : keep qwen35 and qwen35moe graphs intact

* cont : add comments [no ci]

* add kimi linear to delta-net-base

* removed unnecessary ggml_cont from g_exp_t

* removed ggml_cont from g_diff_exp_t. moved ggml_cont for o to kimi-linear.cpp

* removed unnecessary diag mask

* cont : simplify

* cont : avoid graph splits

* scale q after mul instead of beginning

* scale q after mul instead of beginning

* identical ppl

* cont : fix scale and decay mask

* minor : remove TODO

* block implementation for kda

* remove space at the end of line 101

* concat+pad

* pad+binary row concat

* chunk size 16 for kda

* removed minor differences to master

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants