Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR by ikawrakow · Pull Request #1183 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-01-23T17:24:29Z

The GLM4-MoE models are notorious for strong inference performance decline with increasing context length, which is due to the unfortunate GQA ratio of 12.

This PR remedies the situation to some extent. It uses a similar technique as in PR #1182 to improve long-context TG performance on CUDA for the GLM4-MoE series of models. But unlike #1182, where there is a single KV head and hence simple views are sufficient to split the FA computation in two parts, here we have 8 KV heads (fewer with split mode graph), so one needs to incarnate two contiguous copies of the Q tensor to obtain the required splits.

Caveat: the PR does not improve the performance when quantized KV cache is used. Implementing the optimization for quantized KV cache is a bit more involved, so it is left for a follow up PR.

The following graph shows TG performance as a function of context length for GLM-4.5-AIR-IQ1_KT on a 4x3090 system (but for the 2x3090 data points only 2 GPUs are selected). Mani branch and PR coincide up to a certain point because the split is only done above a given threshold that depends on the number of participating GPUs. For 8 GPUs the split only kicks above 64k tokens, so is not shown here. For split mode layer we gain ~30% at context of 64k. For 2 GPUs and split mode graph, where we can only go up to context of 32k tokens with this model, the gain is about 13%. For 4 GPUs and split mode graph speedup is ~10% at 64k tokens.

abc-nix · 2026-01-23T20:50:17Z

This makes no sense. It has improved on mixed CPU+GPU inference from 5.54 tokens/s (in this comment) to 6.20 t/s for same conditions. Output has changed, but seems very coherent. I need to experiment a bit more with other GLM models, but this looks very good.

Thanks for this upgrade!

magikRUKKOLA · 2026-01-23T21:02:11Z

Output has changed, but seems very coherent.

uh oh? Have you measured the PPL?

ikawrakow · 2026-01-24T07:27:16Z

Output has changed, but seems very coherent.

uh oh? Have you measured the PPL?

When you split FA into two parts, that modifies the order in which multiply-adds are accumulated, and that modifies the accumulated result due to finite floating point arithmetic precision. So, not getting the exact same sequence of tokens (using the same random number seed) is expected.

But yes, I did indeed verify that PPL is the same within numerical round-off precision.

Geechan · 2026-01-25T01:43:02Z

I unfortunately seem to be getting worse performance with this PR below 32k context. I haven't tested above this context size (perhaps performance is better past 32k), but the results aren't promising otherwise. Maybe it has something to do with my specific override tensor configuration, or my hardware?

OS: Arch Linux
CPU: Epyc 7763 (64 cores, 128 threads)
GPU(s): 2x RTX 8000 Quadros (NVLink)
RAM: 8 channel 3200MHz DDR4
Model: GLM 4.7
Layers: layers 29 onwards have up/gate/down tensors offloaded to CPU, rest on GPUs
NCCL installed; peer access enabled

Command structure:

llama-server \
    --n-gpu-layers 999 --threads 64 --threads-batch 64 --batch-size 4096 --ubatch-size 4096 -sm graph --no-mmap \
    --override-tensor "blk\.(29|[3-9][0-9])\.ffn_(up|gate|down)_exps\.weight=CPU" \
    --ctx-size 38912 -fa on -gr -smgs --cache-ram 32768 --port 15000 --host 0.0.0.0 \
    --model "/mnt/GGUF/GLM/4.7/GLM-4.7-Q4_K_M/GLM-4.7-Q4_K_M-00001-of-00006.gguf"

Commit `2a7cc09`

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	18.762	218.32	91.271	11.22
4096	1024	4096	19.565	209.35	94.418	10.85
4096	1024	8192	20.531	199.51	97.614	10.49
4096	1024	12288	21.444	191.01	102.179	10.02
4096	1024	16384	22.584	181.37	103.762	9.87
4096	1024	20480	23.534	174.04	104.913	9.76
4096	1024	24576	24.291	168.62	107.528	9.52
4096	1024	28672	25.049	163.52	110.484	9.27
4096	1024	32768	25.993	157.58	113.302	9.04

PR

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	18.846	217.35	91.742	11.16
4096	1024	4096	19.623	208.74	95.259	10.75
4096	1024	8192	20.726	197.62	98.006	10.45
4096	1024	12288	21.947	186.63	102.294	10.01
4096	1024	16384	22.896	178.89	106.959	9.57
4096	1024	20480	23.732	172.60	109.421	9.36
4096	1024	24576	24.487	167.27	112.971	9.06
4096	1024	28672	25.408	161.21	116.765	8.77
4096	1024	32768	26.405	155.12	120.420	8.50

ikawrakow added 2 commits January 23, 2026 14:51

Similar hack to #1182 for GLM-4.5/6/7

f5754f5

Refinements

485d23d

Disable when the KV cache is not f16

c663eea

ikawrakow merged commit 04beeff into main Jan 24, 2026

ikawrakow mentioned this pull request Jan 25, 2026

Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (part 2) #1190

Merged

Geechan mentioned this pull request Jan 27, 2026

Faster hybrid inference when shared experts #1191

Merged

abc-nix mentioned this pull request Jan 27, 2026

Much faster long-context TG for GLM-4.5/4.6/4.7/AIR #1193

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR#1183

Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR#1183
ikawrakow merged 3 commits intomainfrom
ik/glm45_tg_fa_hack

ikawrakow commented Jan 23, 2026 •

edited

Loading

Uh oh!

abc-nix commented Jan 23, 2026

Uh oh!

magikRUKKOLA commented Jan 23, 2026

Uh oh!

ikawrakow commented Jan 24, 2026

Uh oh!

Geechan commented Jan 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ikawrakow commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abc-nix commented Jan 23, 2026

Uh oh!

magikRUKKOLA commented Jan 23, 2026

Uh oh!

ikawrakow commented Jan 24, 2026

Uh oh!

Geechan commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commit 2a7cc09

PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ikawrakow commented Jan 23, 2026 •

edited

Loading

Geechan commented Jan 25, 2026 •

edited

Loading

Commit `2a7cc09`