Skip to content

Faster hybrid inference when shared experts#1191

Merged
ikawrakow merged 1 commit intomainfrom
ik/shexps_better_hybrid
Jan 26, 2026
Merged

Faster hybrid inference when shared experts#1191
ikawrakow merged 1 commit intomainfrom
ik/shexps_better_hybrid

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

@ikawrakow ikawrakow commented Jan 25, 2026

This PR improves hybrid CPU/GPU performance for MoE models with shared experts (assuming the shared experts are in VRAM) when using split mode graph.

It is nothing major, but not entirely negligible either. I do see 4-5% better performance for GLM-4.5-AIR with all routed experts left on the CPU.

Oh, people running with tiny batch sizes are unlikely to see PP improvement.

Update

It looks like the positive effect is greater with more GPUs, at least for PP. As an example, below are results for GLM-4.7-3.35bpw running on 4x3090 and Ryzem-3995WX CPU, all routed experts left in RAM. Here I observe ~8% better PP performance.

shared shared_tg
Main branch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 12.286 333.38 6.231 10.27
4096 64 4096 12.251 334.35 6.456 9.91
4096 64 8192 12.513 327.33 6.590 9.71
4096 64 12288 12.791 320.23 6.614 9.68
4096 64 16384 13.005 314.96 6.720 9.52
4096 64 20480 13.236 309.46 6.747 9.49
4096 64 24576 13.557 302.12 6.852 9.34
4096 64 28672 13.883 295.03 6.926 9.24
4096 64 32768 14.013 292.29 7.132 8.97
4096 64 36864 14.292 286.59 7.040 9.09
4096 64 40960 14.545 281.60 7.108 9.00
4096 64 45056 14.853 275.77 7.179 8.91
4096 64 49152 15.171 269.99 7.405 8.64
4096 64 53248 15.418 265.67 7.361 8.69
4096 64 57344 15.665 261.48 7.369 8.69
4096 64 61440 15.974 256.41 7.418 8.63
This PR
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 64 0 11.365 360.39 6.006 10.66
4096 64 4096 11.330 361.51 6.172 10.37
4096 64 8192 11.571 353.98 6.408 9.99
4096 64 12288 11.817 346.62 6.392 10.01
4096 64 16384 12.072 339.30 6.424 9.96
4096 64 20480 12.320 332.48 6.489 9.86
4096 64 24576 12.652 323.73 6.584 9.72
4096 64 28672 12.850 318.75 6.821 9.38
4096 64 32768 13.117 312.27 6.782 9.44
4096 64 36864 13.415 305.34 6.764 9.46
4096 64 40960 13.680 299.42 6.815 9.39
4096 64 45056 13.996 292.66 6.883 9.30
4096 64 49152 14.283 286.77 6.911 9.26
4096 64 53248 14.584 280.85 7.177 8.92
4096 64 57344 14.853 275.76 7.042 9.09
4096 64 61440 15.144 270.46 7.206 8.88

@Nexesenex
Copy link
Copy Markdown
Contributor

Nexesenex commented Jan 25, 2026

With this PR 1191 (and without #1190 merged), I gain a solid and precious 4-5% TG compared to without in Hybrid inference with GLM 4.6.

CD /D Q:\LLAMA_IK_TP_TEST llama-server -m X:\GGUF-Tool-Suite\GLM-4.6-S\GLM-4.6-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01760.gguf -t 18 -ngl 150 -sm graph -smgs -mea 0 -b 128 -mg 0 --device CUDA0,CUDA1 -ts 47,47 -fa 1 -cuda fusion=1,offload-batch-size=128,mmq-id-size=128,enable-p2p=0 -ot "^output.weight$=CUDA0" -ot "^blk.(17).ffn_down_exps.weight$=CUDA1" -ot "^blk.([1][7-9]|[2][0-9])\.ffn_(up|down|gate)_exps\.weight$=CPU" -ot "^blk.([3-7][0-9]|9[0-1]).ffn_(up|down|gate)_exps.weight$=CPU" -no-ooae -mqkv -gr -ger --chat-template chatglm4 --override-kv glm4moe.expert_used_count=int:7 -ser 6,0.2 -c 81920 -ctk q6_0 -ctv iq4_nl -khad --context-shift 1 --host 127.0.0.1 --port 8080 -cram 0 -cram-n-min 999999

@ikawrakow ikawrakow merged commit 30381fc into main Jan 26, 2026
@Quairon-Nailo
Copy link
Copy Markdown

This is magic. With this I've finally been able to break the 11 t/s barrier after being at 10.5 for so long. The only issue I've found is that, for some reason, now after a long idle time (even without closing the server), the first generation is noticeably slower (like 9-9.5 t/s), but after that it works fine, I don't know why that happens now, it never slowed down that much in the first generation before. In any case, overall I'm happy with the speed.

@Geechan
Copy link
Copy Markdown

Geechan commented Jan 27, 2026

As an addendum to #1183 (comment), I get the following speeds with this PR. Still not as good as 2a7cc09, but slightly better than #1183. This PR is definitely working to improve speeds, but seems to be counteracted by the old PR.

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 18.687 219.19 90.707 11.29
4096 1024 4096 19.487 210.19 94.338 10.85
4096 1024 8192 20.510 199.70 97.727 10.48
4096 1024 12288 21.692 188.82 101.717 10.07
4096 1024 16384 22.837 179.36 106.055 9.66
4096 1024 20480 23.573 173.75 108.504 9.44
4096 1024 24576 24.281 168.69 112.288 9.12
4096 1024 28672 25.220 162.41 115.906 8.83
4096 1024 32768 26.221 156.21 119.853 8.54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants