Skip to content

Mistral 4 support#1450

Merged
ikawrakow merged 3 commits intomainfrom
ik/mistral4
Mar 18, 2026
Merged

Mistral 4 support#1450
ikawrakow merged 3 commits intomainfrom
ik/mistral4

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

ik_llama.cpp is not important or famous, so we cannot have day-0 support for new models. But in this particular case, we get day-1.

CPU and CUDA, including flash attention for the new head size combination of 320, 256.

Tested with UD_IQ2_XSS from Unsloth (because I wanted to have full GPU offload on a 2x3090 system).

CUDA performance

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 0.717 2855.61 0.911 140.53
2048 128 2048 0.689 2972.12 0.948 135.08
2048 128 4096 0.743 2756.42 1.017 125.91
2048 128 6144 0.799 2562.77 1.031 124.18
2048 128 8192 0.854 2397.46 1.052 121.72
2048 128 10240 0.912 2244.70 1.062 120.49
2048 128 12288 0.965 2122.12 1.081 118.38
2048 128 14336 1.025 1998.21 1.099 116.44
2048 128 16384 1.079 1898.76 1.117 114.58
2048 128 18432 1.130 1813.10 1.132 113.09
2048 128 20480 1.189 1722.30 1.143 111.97
2048 128 22528 1.246 1644.04 1.171 109.29
2048 128 24576 1.305 1569.44 1.235 103.65
2048 128 26624 1.359 1506.80 1.243 103.00
2048 128 28672 1.417 1445.34 1.256 101.89
2048 128 30720 1.471 1392.16 1.258 101.77
2048 128 32768 1.527 1340.94 1.275 100.39
2048 128 34816 1.587 1290.35 1.276 100.31
2048 128 36864 1.642 1247.46 1.282 99.82
2048 128 38912 1.702 1203.28 1.293 99.02
2048 128 40960 1.754 1167.65 1.294 98.89
2048 128 43008 1.821 1124.48 1.295 98.87
2048 128 45056 1.873 1093.54 1.299 98.51
2048 128 47104 1.943 1053.90 1.310 97.70
2048 128 49152 1.996 1026.10 1.360 94.13
2048 128 51200 2.054 997.25 1.370 93.41
2048 128 53248 2.116 967.69 1.389 92.12
2048 128 55296 2.166 945.53 1.396 91.68
2048 128 57344 2.223 921.16 1.412 90.63
2048 128 59392 2.289 894.66 1.416 90.39
2048 128 61440 2.339 875.77 1.428 89.64
2048 128 63488 2.408 850.43 1.433 89.32

llama.cpp does not have CUDA FA support as of this writing, so we cannot get very far with context length. Here is as far as it gets on the 2x3090 system:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 0.818 2505.14 1.253 102.18
2048 128 2048 0.975 2101.28 1.783 71.78
2048 128 4096 1.158 1768.20 2.290 55.90
2048 128 6144 1.328 1542.32 2.813 45.50
2048 128 8192 1.503 1362.18 3.300 38.78

CPU performance

Running on a Ryzen-3995WX CPU.

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 64 0 9.891 207.07 1.588 40.31
2048 64 2048 11.430 179.18 1.817 35.22
2048 64 4096 12.539 163.32 1.878 34.08
2048 64 6144 12.365 165.63 1.969 32.50
2048 64 8192 15.007 136.47 2.109 30.35
2048 64 10240 14.743 138.92 2.104 30.42
2048 64 12288 17.252 118.71 2.140 29.91
2048 64 14336 17.025 120.30 2.235 28.64
2048 64 16384 17.007 120.42 2.219 28.84

And here is what we get with llama.cpp. On the CPU FA is enabled.

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 64 0 19.789 103.49 2.479 25.82
2048 64 2048 23.348 87.72 3.013 21.24
2048 64 4096 26.811 76.39 3.266 19.60
2048 64 6144 29.218 70.09 3.378 18.94
2048 64 8192 32.176 63.65 3.513 18.22
2048 64 10240 34.809 58.83 3.683 17.38
2048 64 12288 37.803 54.18 3.834 16.69
2048 64 14336 40.979 49.98 3.993 16.03
2048 64 16384 43.620 46.95 4.108 15.58

@ikawrakow
Copy link
Copy Markdown
Owner Author

Hmm, interesting. It seems that for Mistral-4 my original indirect matrix multiplication implementation is faster for prompt processing than the default. The heuristic used to determine which implementation to use is

  • If u-batch <= mmq-id-size * n_experts use the new implementation
  • Else use the original implementation

In the above, the default value for mmq-id-size is 32, so for Mistral 4 (128 experts) this works out to use the new implementation up to u-batch size of 4096. mmq-id-size can be changed via command line arguments -cuda mmq-id-size=X where X is an integer value. The graph below shows a comparison for PP-2048 with u-batch = 2048 between the default and -cuda mmq-id-size=1. The difference is not big, but mmq-id-size=1 being better than the default persists down to u-batch size of 256.

m4_pp

@ikawrakow ikawrakow merged commit 56477c7 into main Mar 18, 2026
@vvverily
Copy link
Copy Markdown

ik_llama.cpp is important to me :3 with this PR I can run IQ3 of the model at 600 t/s prefill and 11 t/s decode on my 8gb VRAM 64GB RAM laptop, thank you for your work !

@dinerburger
Copy link
Copy Markdown

Yeah, gotta say, I'm seeing 2x speed uplift over mainline on Qwen3.5-27B for PP. ik_llama.cpp remains undefeated and incredibly important.

@aviallon
Copy link
Copy Markdown

@ikawrakow would you be interested in making a CPU WASM target? Given how far ahead mainline you are for CPU inference, it could make in-browser small agents much better (or even feasible).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants