Skip to content

Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR#1183

Merged
ikawrakow merged 3 commits intomainfrom
ik/glm45_tg_fa_hack
Jan 24, 2026
Merged

Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR#1183
ikawrakow merged 3 commits intomainfrom
ik/glm45_tg_fa_hack

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

@ikawrakow ikawrakow commented Jan 23, 2026

The GLM4-MoE models are notorious for strong inference performance decline with increasing context length, which is due to the unfortunate GQA ratio of 12.

This PR remedies the situation to some extent. It uses a similar technique as in PR #1182 to improve long-context TG performance on CUDA for the GLM4-MoE series of models. But unlike #1182, where there is a single KV head and hence simple views are sufficient to split the FA computation in two parts, here we have 8 KV heads (fewer with split mode graph), so one needs to incarnate two contiguous copies of the Q tensor to obtain the required splits.

Caveat: the PR does not improve the performance when quantized KV cache is used. Implementing the optimization for quantized KV cache is a bit more involved, so it is left for a follow up PR.

The following graph shows TG performance as a function of context length for GLM-4.5-AIR-IQ1_KT on a 4x3090 system (but for the 2x3090 data points only 2 GPUs are selected). Mani branch and PR coincide up to a certain point because the split is only done above a given threshold that depends on the number of participating GPUs. For 8 GPUs the split only kicks above 64k tokens, so is not shown here. For split mode layer we gain ~30% at context of 64k. For 2 GPUs and split mode graph, where we can only go up to context of 32k tokens with this model, the gain is about 13%. For 4 GPUs and split mode graph speedup is ~10% at 64k tokens.

glm45

@abc-nix
Copy link
Copy Markdown
Contributor

abc-nix commented Jan 23, 2026

This makes no sense. It has improved on mixed CPU+GPU inference from 5.54 tokens/s (in this comment) to 6.20 t/s for same conditions. Output has changed, but seems very coherent. I need to experiment a bit more with other GLM models, but this looks very good.

Thanks for this upgrade!

@magikRUKKOLA
Copy link
Copy Markdown

Output has changed, but seems very coherent.

uh oh? Have you measured the PPL?

@ikawrakow
Copy link
Copy Markdown
Owner Author

Output has changed, but seems very coherent.

uh oh? Have you measured the PPL?

When you split FA into two parts, that modifies the order in which multiply-adds are accumulated, and that modifies the accumulated result due to finite floating point arithmetic precision. So, not getting the exact same sequence of tokens (using the same random number seed) is expected.

But yes, I did indeed verify that PPL is the same within numerical round-off precision.

@ikawrakow ikawrakow merged commit 04beeff into main Jan 24, 2026
@Geechan
Copy link
Copy Markdown

Geechan commented Jan 25, 2026

I unfortunately seem to be getting worse performance with this PR below 32k context. I haven't tested above this context size (perhaps performance is better past 32k), but the results aren't promising otherwise. Maybe it has something to do with my specific override tensor configuration, or my hardware?

OS: Arch Linux
CPU: Epyc 7763 (64 cores, 128 threads)
GPU(s): 2x RTX 8000 Quadros (NVLink)
RAM: 8 channel 3200MHz DDR4
Model: GLM 4.7
Layers: layers 29 onwards have up/gate/down tensors offloaded to CPU, rest on GPUs
NCCL installed; peer access enabled

Command structure:

llama-server \
    --n-gpu-layers 999 --threads 64 --threads-batch 64 --batch-size 4096 --ubatch-size 4096 -sm graph --no-mmap \
    --override-tensor "blk\.(29|[3-9][0-9])\.ffn_(up|gate|down)_exps\.weight=CPU" \
    --ctx-size 38912 -fa on -gr -smgs --cache-ram 32768 --port 15000 --host 0.0.0.0 \
    --model "/mnt/GGUF/GLM/4.7/GLM-4.7-Q4_K_M/GLM-4.7-Q4_K_M-00001-of-00006.gguf"

Commit 2a7cc09

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 18.762 218.32 91.271 11.22
4096 1024 4096 19.565 209.35 94.418 10.85
4096 1024 8192 20.531 199.51 97.614 10.49
4096 1024 12288 21.444 191.01 102.179 10.02
4096 1024 16384 22.584 181.37 103.762 9.87
4096 1024 20480 23.534 174.04 104.913 9.76
4096 1024 24576 24.291 168.62 107.528 9.52
4096 1024 28672 25.049 163.52 110.484 9.27
4096 1024 32768 25.993 157.58 113.302 9.04

PR

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 18.846 217.35 91.742 11.16
4096 1024 4096 19.623 208.74 95.259 10.75
4096 1024 8192 20.726 197.62 98.006 10.45
4096 1024 12288 21.947 186.63 102.294 10.01
4096 1024 16384 22.896 178.89 106.959 9.57
4096 1024 20480 23.732 172.60 109.421 9.36
4096 1024 24576 24.487 167.27 112.971 9.06
4096 1024 28672 25.408 161.21 116.765 8.77
4096 1024 32768 26.405 155.12 120.420 8.50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants