Faster hybrid inference when shared experts by ikawrakow · Pull Request #1191 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-01-25T14:58:45Z

This PR improves hybrid CPU/GPU performance for MoE models with shared experts (assuming the shared experts are in VRAM) when using split mode graph.

It is nothing major, but not entirely negligible either. I do see 4-5% better performance for GLM-4.5-AIR with all routed experts left on the CPU.

Oh, people running with tiny batch sizes are unlikely to see PP improvement.

Update

It looks like the positive effect is greater with more GPUs, at least for PP. As an example, below are results for GLM-4.7-3.35bpw running on 4x3090 and Ryzem-3995WX CPU, all routed experts left in RAM. Here I observe ~8% better PP performance.

Main branch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	12.286	333.38	6.231	10.27
4096	64	4096	12.251	334.35	6.456	9.91
4096	64	8192	12.513	327.33	6.590	9.71
4096	64	12288	12.791	320.23	6.614	9.68
4096	64	16384	13.005	314.96	6.720	9.52
4096	64	20480	13.236	309.46	6.747	9.49
4096	64	24576	13.557	302.12	6.852	9.34
4096	64	28672	13.883	295.03	6.926	9.24
4096	64	32768	14.013	292.29	7.132	8.97
4096	64	36864	14.292	286.59	7.040	9.09
4096	64	40960	14.545	281.60	7.108	9.00
4096	64	45056	14.853	275.77	7.179	8.91
4096	64	49152	15.171	269.99	7.405	8.64
4096	64	53248	15.418	265.67	7.361	8.69
4096	64	57344	15.665	261.48	7.369	8.69
4096	64	61440	15.974	256.41	7.418	8.63

This PR

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	64	0	11.365	360.39	6.006	10.66
4096	64	4096	11.330	361.51	6.172	10.37
4096	64	8192	11.571	353.98	6.408	9.99
4096	64	12288	11.817	346.62	6.392	10.01
4096	64	16384	12.072	339.30	6.424	9.96
4096	64	20480	12.320	332.48	6.489	9.86
4096	64	24576	12.652	323.73	6.584	9.72
4096	64	28672	12.850	318.75	6.821	9.38
4096	64	32768	13.117	312.27	6.782	9.44
4096	64	36864	13.415	305.34	6.764	9.46
4096	64	40960	13.680	299.42	6.815	9.39
4096	64	45056	13.996	292.66	6.883	9.30
4096	64	49152	14.283	286.77	6.911	9.26
4096	64	53248	14.584	280.85	7.177	8.92
4096	64	57344	14.853	275.76	7.042	9.09
4096	64	61440	15.144	270.46	7.206	8.88

Nexesenex · 2026-01-25T17:41:01Z

With this PR 1191 (and without #1190 merged), I gain a solid and precious 4-5% TG compared to without in Hybrid inference with GLM 4.6.

CD /D Q:\LLAMA_IK_TP_TEST llama-server -m X:\GGUF-Tool-Suite\GLM-4.6-S\GLM-4.6-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01760.gguf -t 18 -ngl 150 -sm graph -smgs -mea 0 -b 128 -mg 0 --device CUDA0,CUDA1 -ts 47,47 -fa 1 -cuda fusion=1,offload-batch-size=128,mmq-id-size=128,enable-p2p=0 -ot "^output.weight$=CUDA0" -ot "^blk.(17).ffn_down_exps.weight$=CUDA1" -ot "^blk.([1][7-9]|[2][0-9])\.ffn_(up|down|gate)_exps\.weight$=CPU" -ot "^blk.([3-7][0-9]|9[0-1]).ffn_(up|down|gate)_exps.weight$=CPU" -no-ooae -mqkv -gr -ger --chat-template chatglm4 --override-kv glm4moe.expert_used_count=int:7 -ser 6,0.2 -c 81920 -ctk q6_0 -ctv iq4_nl -khad --context-shift 1 --host 127.0.0.1 --port 8080 -cram 0 -cram-n-min 999999

Quairon-Nailo · 2026-01-26T17:24:23Z

This is magic. With this I've finally been able to break the 11 t/s barrier after being at 10.5 for so long. The only issue I've found is that, for some reason, now after a long idle time (even without closing the server), the first generation is noticeably slower (like 9-9.5 t/s), but after that it works fine, I don't know why that happens now, it never slowed down that much in the first generation before. In any case, overall I'm happy with the speed.

Geechan · 2026-01-27T11:27:42Z

As an addendum to #1183 (comment), I get the following speeds with this PR. Still not as good as 2a7cc09, but slightly better than #1183. This PR is definitely working to improve speeds, but seems to be counteracted by the old PR.

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	18.687	219.19	90.707	11.29
4096	1024	4096	19.487	210.19	94.338	10.85
4096	1024	8192	20.510	199.70	97.727	10.48
4096	1024	12288	21.692	188.82	101.717	10.07
4096	1024	16384	22.837	179.36	106.055	9.66
4096	1024	20480	23.573	173.75	108.504	9.44
4096	1024	24576	24.281	168.69	112.288	9.12
4096	1024	28672	25.220	162.41	115.906	8.83
4096	1024	32768	26.221	156.21	119.853	8.54

Faster hybrid inference when shared experts

109686a

ikawrakow merged commit 30381fc into main Jan 26, 2026

Geechan mentioned this pull request Jan 27, 2026

Much faster long-context TG for GLM-4.5/4.6/4.7/AIR #1193

Merged

This was referenced Feb 4, 2026

Much faster rng sampling #1187

Merged

--numa mirror: mirror model weights to every Numa node in the system ggml-org/llama.cpp#16000

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster hybrid inference when shared experts#1191

Faster hybrid inference when shared experts#1191
ikawrakow merged 1 commit intomainfrom
ik/shexps_better_hybrid

ikawrakow commented Jan 25, 2026 •

edited

Loading

Uh oh!

Nexesenex commented Jan 25, 2026 •

edited

Loading

Uh oh!

Quairon-Nailo commented Jan 26, 2026

Uh oh!

Geechan commented Jan 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ikawrakow commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update

Uh oh!

Nexesenex commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Quairon-Nailo commented Jan 26, 2026

Uh oh!

Geechan commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ikawrakow commented Jan 25, 2026 •

edited

Loading

Nexesenex commented Jan 25, 2026 •

edited

Loading

Geechan commented Jan 27, 2026 •

edited

Loading