Kimi Linear chunk size = 16#19827
Conversation
src/models/delta-net-base.cpp
Outdated
| kb = ggml_new_tensor_4d(ctx0, GGML_TYPE_F32, CS, CS, n_chunks, HB); | ||
| kb = ggml_clamp(ctx0, kb, 0.0f, 0.0f); | ||
| kq = ggml_new_tensor_4d(ctx0, GGML_TYPE_F32, CS, CS, n_chunks, HB); | ||
| kq = ggml_clamp(ctx0, kq, 0.0f, 0.0f); |
There was a problem hiding this comment.
Didn't we just manage to get rid of these?
There was a problem hiding this comment.
Do you know of a better way to initialize ggml tensor to zeroes?
There was a problem hiding this comment.
The best way is not at all, was talking about the ggml_new_tensors specifically, secondly ggml_fill. :)
There was a problem hiding this comment.
Do you mean I should get rid of ggml_new_tensor? I can think of ggml_concat-ing the rows for kb and kq. Would that work for you?
|
Changed from new_tensor+clamp to concat+pad. pp goes from 974t/s to 939t/s but CUDA0 compute buffer reduces from 991.29MB to 831.29MB. This allows IQ3_M to increase context from 272k to 288k on 3090. Since I am a context over speed person, so this version is better for me. concat+pad, pp 939t/s tg 61t/s |
|
Committed a version with binary concat on the rows for the concat+pad version,. This saves one ggml_concat, so pp is slightly faster. Here is a comparison of three implementation when running IQ3_M at 192k context on 3090
Do we want faster naive block or slower pad+concat that can run more context? Or is it possible to improve either one to get the best of both worlds? pad+binary row concat: pp 944t/s tg 61t/s |
|
I find that by setting chunk size to 16 for kda, you can get 1112t/s pp, ie 31% gain compute buffer also reduced to 705.29MB for IQ3_M at 192k context. This allows me to run 304k context for IQ3_M on 3090. @CISC, I think this commit should be an easy review. Just need to confirm the performance gain then it should be ready to merge. I heard that increase in shared memory in CUDA card can increase the optimal chunk size, so probably in the long run, this number should be configurable? I tried both kda and gdn. gdn's optimal seems to be 64 just like before. They both crashed when chunk size == 8 with this error:
chunk size = 16 for kda, pp 1112t/s tg 61t/s |
|
@ymcki for 8 you'd have to increase the max graph nodes, it assumes chunk size 64 as of now. |
Doubled max graph nodes, so now it can run chunk size = 8 pp t/s:
CUDA0 compute buffer:
Interestingly, compute buffer stays the same for Q3N but decreasing for KL. While chunk size = 8 for KL has further saving in VRAM for KL, not sure if it is worth it as the bs=512 graph nodes ballooned from 14232 to 23832. So based on these numbers, Q3N can stay at 64. KL can be either 16 or 8. 8 can be slightly better if too many graph nodes is not a bad thing. |
| b = ggml_permute(ctx0, b, 0, 2, 1, 3); // [ 1, n_tokens, H_v, n_seqs] | ||
|
|
||
| const int CS = CHUNK_SIZE; | ||
| const int CS = kda ? 16 : 64; // chunk size |
There was a problem hiding this comment.
Update the PR to change only this line - the rest of the changes are not really needed.
There was a problem hiding this comment.
The other change I made was removing the line "#define CHUNK_SIZE 64". Why is it still needed when no one is referencing it any more?
There was a problem hiding this comment.
Yes, keep it. I was referring to the rest of the changes in the graph implementation.
* models : add llm_build_delta_net_base * cont : keep qwen35 and qwen35moe graphs intact * cont : add comments [no ci] * add kimi linear to delta-net-base * removed unnecessary ggml_cont from g_exp_t * removed ggml_cont from g_diff_exp_t. moved ggml_cont for o to kimi-linear.cpp * removed unnecessary diag mask * cont : simplify * cont : avoid graph splits * scale q after mul instead of beginning * scale q after mul instead of beginning * identical ppl * cont : fix scale and decay mask * minor : remove TODO * block implementation for kda * remove space at the end of line 101 * concat+pad * pad+binary row concat * chunk size 16 for kda * removed minor differences to master --------- Co-authored-by: Georgi Gerganov <[email protected]>
* models : add llm_build_delta_net_base * cont : keep qwen35 and qwen35moe graphs intact * cont : add comments [no ci] * add kimi linear to delta-net-base * removed unnecessary ggml_cont from g_exp_t * removed ggml_cont from g_diff_exp_t. moved ggml_cont for o to kimi-linear.cpp * removed unnecessary diag mask * cont : simplify * cont : avoid graph splits * scale q after mul instead of beginning * scale q after mul instead of beginning * identical ppl * cont : fix scale and decay mask * minor : remove TODO * block implementation for kda * remove space at the end of line 101 * concat+pad * pad+binary row concat * chunk size 16 for kda * removed minor differences to master --------- Co-authored-by: Georgi Gerganov <[email protected]>
Make sure to read the contributing guidelines before submitting a PR
~15% gain in pp.
Also, CUDA0 compute buffer of IQ3_M at 192k context reduces from 1693.51MB to 991.29MB on 3090. This allows context to further increase to 272k.
Also, tried to implement the same thing for gated delta net. However, both pp speed lost and CUDA0 compute buffer lost is observed. Anyway, I included the code for GDN if someone would like to explore whether it is possible to speed up GDN via block.
orig, pp 849t/s, tg 61t/s
block, pp 974t/s, tg 61t/s
block implementation for gated delta net