Skip to content

Merge ffn_up and ffn_gate experts tensors (part 2)#1139

Merged
ikawrakow merged 4 commits intomainfrom
ik/merge_up_gate_exps_3
Jan 13, 2026
Merged

Merge ffn_up and ffn_gate experts tensors (part 2)#1139
ikawrakow merged 4 commits intomainfrom
ik/merge_up_gate_exps_3

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

@ikawrakow ikawrakow commented Jan 12, 2026

This PR is a follow up of PR #1137 and adds the ability to merge the ffn_up_exps and ffn_gate_exps tensors at run time to more models:

  • GLM-4.5-AIR/4.6/4.7
  • Minimax-M2
  • Mimo2-Flash
  • DeepSeek-Lite/V3/R1, Kimi-K2
  • Baling-MoE (Ling/Ring)
  • Qwen2-MoE
  • LlaMA-4
  • Hunyuan-MoE
  • Qwen3VL-MoE

Here some quick llama-bench results for a bunch of models on 3090 GPUs

Model PP-2048 (no -mqkv) PP-2048 (with -mqkv) Speedup
minimax-m2 230B.A10B Q2_K_M 1433.98 1514.93 1.056
glm4moe 106B.A12B IQ1_KT 2044.11 2189.89 1.071
glm4moe 355B.A32B IQ2_KS 804.41 838.59 1.043
mimo2 310B.A15B IQ2_XXS 759.83 808.13 1.064
deepseek2 16B Q4_0 11760.35 12504.46 1.063
bailingmoe2 16B.A1B Q4_K_M 18827.73 20541.12 1.091

@MrHills-rs
Copy link
Copy Markdown

MrHills-rs commented Jan 12, 2026

build: 3f24a6a (334)

Qwen3-VL-235B-A22B (IQ3_XS — 3.3 bpw)

Variant VRAM Effective Params Backend ngl n_batch n_ubatch ctx_k/v amb offload expr muge test Speed
no muge 90.55 GiB 238.44 B CUDA 95 8192 4096 q8_0 512 blk.(?:[0-9]|[1-6][0-9]|[8][0-4]).ffn._exps.=CPU 0 pp8192 869.00 ± 6.40 t/s
no muge 90.55 GiB 238.44 B CUDA 95 8192 4096 q8_0 512 blk.(?:[0-9]|[1-6][0-9]|[8][0-4]).ffn._exps.=CPU 0 tg512 8.69 ± 0.01 t/s
no muge 90.55 GiB 238.44 B CUDA 95 8192 4096 q8_0 512 blk.(?:[0-9]|[1-6][0-9]|[8][0-4]).ffn._exps.=CPU 0 tg128 8.69 ± 0.04 t/s
muge 146.15 GiB 389.84 B CUDA 95 8192 4096 q8_0 512 blk.(?:[0-9]|[1-6][0-9]|[8][0-4]).ffn._exps.=CPU 1 pp8192 907.32 ± 8.40 t/s
muge 146.15 GiB 389.84 B CUDA 95 8192 4096 q8_0 512 blk.(?:[0-9]|[1-6][0-9]|[8][0-4]).ffn._exps.=CPU 1 tg512 8.49 ± 0.00 t/s
muge 146.15 GiB 389.84 B CUDA 95 8192 4096 q8_0 512 blk.(?:[0-9]|[1-6][0-9]|[8][0-4]).ffn._exps.=CPU 1 tg128 8.73 ± 0.03 t/s

MiniMax-M2.1 (230B.A10B — IQ3_S mix — 3.66 bpw)

Variant VRAM Effective Params Backend ngl n_batch n_ubatch ctx_k/v amb offload expr muge test Speed
no muge 93.12 GiB 228.69 B CUDA 95 8192 8192 q8_0 512 blk.(?:[0-9]|[1-4][0-9]|[5][0-4]).ffn._exps.=CPU 0 pp8192 1419.85 ± 6.63 t/s
no muge 93.12 GiB 228.69 B CUDA 95 8192 8192 q8_0 512 blk.(?:[0-9]|[1-4][0-9]|[5][0-4]).ffn._exps.=CPU 0 tg512 13.35 ± 0.02 t/s
no muge 93.12 GiB 228.69 B CUDA 95 8192 8192 q8_0 512 blk.(?:[0-9]|[1-4][0-9]|[5][0-4]).ffn._exps.=CPU 0 tg128 13.43 ± 0.02 t/s
muge 153.06 GiB 378.48 B CUDA 95 8192 8192 q8_0 512 blk.(?:[0-9]|[1-4][0-9]|[5][0-4]).ffn._exps.=CPU 1 pp8192 1492.79 ± 16.75 t/s
muge 153.06 GiB 378.48 B CUDA 95 8192 8192 q8_0 512 blk.(?:[0-9]|[1-4][0-9]|[5][0-4]).ffn._exps.=CPU 1 tg128 13.38 ± 0.03 t/s
muge 153.06 GiB 378.48 B CUDA 95 8192 8192 q8_0 512 blk.(?:[0-9]|[1-4][0-9]|[5][0-4]).ffn._exps.=CPU 1 tg512 13.41 ± 0.04 t/s

So basically marginal bump for pp, tg unchanged. Cool.

7800x3d
128gb ddr5 6000mt
5090 pcie5

(With muge it's showing me high memory usage, i don't have enough RAM for that, albeit i take it's not impacting performance much)

Commands:

build/bin/llama-bench -m models/Qwen3-VL-235B-A22B-Thinking.i1-IQ3_XS.gguf -ot "blk.(?:[0-9]|[1-6][0-9]|[8][0-4]).ffn._exps.=CPU" -b 8192 -ub 4096 -ctk q8_0 -ctv q8_0 --threads 8 -ngl 95 -amb 512 -p x -n x -mqkv 1 -muge (0 or 1)

build/bin/llama-bench -m models/MiniMax-M2.1-IQ3_M.gguf -ot "blk.(?:[0-9]|[1-4][0-9]|[5][0-4]).ffn._exps.=CPU" -b 8192 -ub 8192 -ctk q8_0 -ctv q8_0 --threads 8 -ngl 95 -amb 512 -p x -n x -mqkv 1 -muge (0 or 1) -r 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants