Merge ffn_up and ffn_gate experts tensors (part 2) by ikawrakow · Pull Request #1139 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-01-12T16:48:25Z

This PR is a follow up of PR #1137 and adds the ability to merge the ffn_up_exps and ffn_gate_exps tensors at run time to more models:

GLM-4.5-AIR/4.6/4.7
Minimax-M2
Mimo2-Flash
DeepSeek-Lite/V3/R1, Kimi-K2
Baling-MoE (Ling/Ring)
Qwen2-MoE
LlaMA-4
Hunyuan-MoE
Qwen3VL-MoE

Here some quick llama-bench results for a bunch of models on 3090 GPUs

Model	PP-2048 (no -mqkv)	PP-2048 (with -mqkv)	Speedup
minimax-m2 230B.A10B Q2_K_M	1433.98	1514.93	1.056
glm4moe 106B.A12B IQ1_KT	2044.11	2189.89	1.071
glm4moe 355B.A32B IQ2_KS	804.41	838.59	1.043
mimo2 310B.A15B IQ2_XXS	759.83	808.13	1.064
deepseek2 16B Q4_0	11760.35	12504.46	1.063
bailingmoe2 16B.A1B Q4_K_M	18827.73	20541.12	1.091

MrHills-rs · 2026-01-12T18:50:27Z

build: 3f24a6a (334)

Qwen3-VL-235B-A22B (IQ3_XS — 3.3 bpw)

Variant	VRAM	Effective Params	Backend	ngl	n_batch	n_ubatch	ctx_k/v	amb	offload expr	muge	test	Speed
no muge	90.55 GiB	238.44 B	CUDA	95	8192	4096	q8_0	512	blk.(?:[0-9]\|[1-6][0-9]\|[8][0-4]).ffn._exps.=CPU	0	pp8192	869.00 ± 6.40 t/s
no muge	90.55 GiB	238.44 B	CUDA	95	8192	4096	q8_0	512	blk.(?:[0-9]\|[1-6][0-9]\|[8][0-4]).ffn._exps.=CPU	0	tg512	8.69 ± 0.01 t/s
no muge	90.55 GiB	238.44 B	CUDA	95	8192	4096	q8_0	512	blk.(?:[0-9]\|[1-6][0-9]\|[8][0-4]).ffn._exps.=CPU	0	tg128	8.69 ± 0.04 t/s
muge	146.15 GiB	389.84 B	CUDA	95	8192	4096	q8_0	512	blk.(?:[0-9]\|[1-6][0-9]\|[8][0-4]).ffn._exps.=CPU	1	pp8192	907.32 ± 8.40 t/s
muge	146.15 GiB	389.84 B	CUDA	95	8192	4096	q8_0	512	blk.(?:[0-9]\|[1-6][0-9]\|[8][0-4]).ffn._exps.=CPU	1	tg512	8.49 ± 0.00 t/s
muge	146.15 GiB	389.84 B	CUDA	95	8192	4096	q8_0	512	blk.(?:[0-9]\|[1-6][0-9]\|[8][0-4]).ffn._exps.=CPU	1	tg128	8.73 ± 0.03 t/s

MiniMax-M2.1 (230B.A10B — IQ3_S mix — 3.66 bpw)

Variant	VRAM	Effective Params	Backend	ngl	n_batch	n_ubatch	ctx_k/v	amb	offload expr	muge	test	Speed
no muge	93.12 GiB	228.69 B	CUDA	95	8192	8192	q8_0	512	blk.(?:[0-9]\|[1-4][0-9]\|[5][0-4]).ffn._exps.=CPU	0	pp8192	1419.85 ± 6.63 t/s
no muge	93.12 GiB	228.69 B	CUDA	95	8192	8192	q8_0	512	blk.(?:[0-9]\|[1-4][0-9]\|[5][0-4]).ffn._exps.=CPU	0	tg512	13.35 ± 0.02 t/s
no muge	93.12 GiB	228.69 B	CUDA	95	8192	8192	q8_0	512	blk.(?:[0-9]\|[1-4][0-9]\|[5][0-4]).ffn._exps.=CPU	0	tg128	13.43 ± 0.02 t/s
muge	153.06 GiB	378.48 B	CUDA	95	8192	8192	q8_0	512	blk.(?:[0-9]\|[1-4][0-9]\|[5][0-4]).ffn._exps.=CPU	1	pp8192	1492.79 ± 16.75 t/s
muge	153.06 GiB	378.48 B	CUDA	95	8192	8192	q8_0	512	blk.(?:[0-9]\|[1-4][0-9]\|[5][0-4]).ffn._exps.=CPU	1	tg128	13.38 ± 0.03 t/s
muge	153.06 GiB	378.48 B	CUDA	95	8192	8192	q8_0	512	blk.(?:[0-9]\|[1-4][0-9]\|[5][0-4]).ffn._exps.=CPU	1	tg512	13.41 ± 0.04 t/s

So basically marginal bump for pp, tg unchanged. Cool.

7800x3d
128gb ddr5 6000mt
5090 pcie5

(With muge it's showing me high memory usage, i don't have enough RAM for that, albeit i take it's not impacting performance much)

Commands:

build/bin/llama-bench -m models/Qwen3-VL-235B-A22B-Thinking.i1-IQ3_XS.gguf -ot "blk.(?:[0-9]|[1-6][0-9]|[8][0-4]).ffn._exps.=CPU" -b 8192 -ub 4096 -ctk q8_0 -ctv q8_0 --threads 8 -ngl 95 -amb 512 -p x -n x -mqkv 1 -muge (0 or 1)

build/bin/llama-bench -m models/MiniMax-M2.1-IQ3_M.gguf -ot "blk.(?:[0-9]|[1-4][0-9]|[5][0-4]).ffn._exps.=CPU" -b 8192 -ub 8192 -ctk q8_0 -ctv q8_0 --threads 8 -ngl 95 -amb 512 -p x -n x -mqkv 1 -muge (0 or 1) -r 3

ikawrakow and others added 4 commits January 12, 2026 18:32

Add ability to merge up+gate exps to more models

60ccbe7

We need to of course pass the merged tensor to build_ffn

c771666

All the others

0a18f1f

Also Qwen3VL-MoE

a50bd82

ikawrakow mentioned this pull request Jan 12, 2026

Merge ffn_up and ffn_gate experts tensors #1137

Merged

ikawrakow merged commit 978202a into main Jan 13, 2026

ubergarm mentioned this pull request Mar 23, 2026

Misc. bug: TG performance degradation with mixed offload using fused up + gate models ggml-org/llama.cpp#20883

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge ffn_up and ffn_gate experts tensors (part 2)#1139

Merge ffn_up and ffn_gate experts tensors (part 2)#1139
ikawrakow merged 4 commits intomainfrom
ik/merge_up_gate_exps_3

ikawrakow commented Jan 12, 2026 •

edited

Loading

Uh oh!

MrHills-rs commented Jan 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ikawrakow commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MrHills-rs commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3-VL-235B-A22B (IQ3_XS — 3.3 bpw)

MiniMax-M2.1 (230B.A10B — IQ3_S mix — 3.66 bpw)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ikawrakow commented Jan 12, 2026 •

edited

Loading

MrHills-rs commented Jan 12, 2026 •

edited

Loading