Mistral 4 support by ikawrakow · Pull Request #1450 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-03-17T14:22:55Z

ik_llama.cpp is not important or famous, so we cannot have day-0 support for new models. But in this particular case, we get day-1.

CPU and CUDA, including flash attention for the new head size combination of 320, 256.

Tested with UD_IQ2_XSS from Unsloth (because I wanted to have full GPU offload on a 2x3090 system).

CUDA performance

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	0.717	2855.61	0.911	140.53
2048	128	2048	0.689	2972.12	0.948	135.08
2048	128	4096	0.743	2756.42	1.017	125.91
2048	128	6144	0.799	2562.77	1.031	124.18
2048	128	8192	0.854	2397.46	1.052	121.72
2048	128	10240	0.912	2244.70	1.062	120.49
2048	128	12288	0.965	2122.12	1.081	118.38
2048	128	14336	1.025	1998.21	1.099	116.44
2048	128	16384	1.079	1898.76	1.117	114.58
2048	128	18432	1.130	1813.10	1.132	113.09
2048	128	20480	1.189	1722.30	1.143	111.97
2048	128	22528	1.246	1644.04	1.171	109.29
2048	128	24576	1.305	1569.44	1.235	103.65
2048	128	26624	1.359	1506.80	1.243	103.00
2048	128	28672	1.417	1445.34	1.256	101.89
2048	128	30720	1.471	1392.16	1.258	101.77
2048	128	32768	1.527	1340.94	1.275	100.39
2048	128	34816	1.587	1290.35	1.276	100.31
2048	128	36864	1.642	1247.46	1.282	99.82
2048	128	38912	1.702	1203.28	1.293	99.02
2048	128	40960	1.754	1167.65	1.294	98.89
2048	128	43008	1.821	1124.48	1.295	98.87
2048	128	45056	1.873	1093.54	1.299	98.51
2048	128	47104	1.943	1053.90	1.310	97.70
2048	128	49152	1.996	1026.10	1.360	94.13
2048	128	51200	2.054	997.25	1.370	93.41
2048	128	53248	2.116	967.69	1.389	92.12
2048	128	55296	2.166	945.53	1.396	91.68
2048	128	57344	2.223	921.16	1.412	90.63
2048	128	59392	2.289	894.66	1.416	90.39
2048	128	61440	2.339	875.77	1.428	89.64
2048	128	63488	2.408	850.43	1.433	89.32

llama.cpp does not have CUDA FA support as of this writing, so we cannot get very far with context length. Here is as far as it gets on the 2x3090 system:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	0.818	2505.14	1.253	102.18
2048	128	2048	0.975	2101.28	1.783	71.78
2048	128	4096	1.158	1768.20	2.290	55.90
2048	128	6144	1.328	1542.32	2.813	45.50
2048	128	8192	1.503	1362.18	3.300	38.78

CPU performance

Running on a Ryzen-3995WX CPU.

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	64	0	9.891	207.07	1.588	40.31
2048	64	2048	11.430	179.18	1.817	35.22
2048	64	4096	12.539	163.32	1.878	34.08
2048	64	6144	12.365	165.63	1.969	32.50
2048	64	8192	15.007	136.47	2.109	30.35
2048	64	10240	14.743	138.92	2.104	30.42
2048	64	12288	17.252	118.71	2.140	29.91
2048	64	14336	17.025	120.30	2.235	28.64
2048	64	16384	17.007	120.42	2.219	28.84

And here is what we get with llama.cpp. On the CPU FA is enabled.

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	64	0	19.789	103.49	2.479	25.82
2048	64	2048	23.348	87.72	3.013	21.24
2048	64	4096	26.811	76.39	3.266	19.60
2048	64	6144	29.218	70.09	3.378	18.94
2048	64	8192	32.176	63.65	3.513	18.22
2048	64	10240	34.809	58.83	3.683	17.38
2048	64	12288	37.803	54.18	3.834	16.69
2048	64	14336	40.979	49.98	3.993	16.03
2048	64	16384	43.620	46.95	4.108	15.58

ikawrakow · 2026-03-17T16:07:59Z

Hmm, interesting. It seems that for Mistral-4 my original indirect matrix multiplication implementation is faster for prompt processing than the default. The heuristic used to determine which implementation to use is

If u-batch <= mmq-id-size * n_experts use the new implementation
Else use the original implementation

In the above, the default value for mmq-id-size is 32, so for Mistral 4 (128 experts) this works out to use the new implementation up to u-batch size of 4096. mmq-id-size can be changed via command line arguments -cuda mmq-id-size=X where X is an integer value. The graph below shows a comparison for PP-2048 with u-batch = 2048 between the default and -cuda mmq-id-size=1. The difference is not big, but mmq-id-size=1 being better than the default persists down to u-batch size of 256.

vvverily · 2026-03-19T03:12:46Z

ik_llama.cpp is important to me :3 with this PR I can run IQ3 of the model at 600 t/s prefill and 11 t/s decode on my 8gb VRAM 64GB RAM laptop, thank you for your work !

dinerburger · 2026-03-19T13:15:54Z

Yeah, gotta say, I'm seeing 2x speed uplift over mainline on Qwen3.5-27B for PP. ik_llama.cpp remains undefeated and incredibly important.

aviallon · 2026-03-21T23:24:57Z

@ikawrakow would you be interested in making a CPU WASM target? Given how far ahead mainline you are for CPU inference, it could make in-browser small agents much better (or even feasible).

ikawrakow added 3 commits March 17, 2026 11:38

WIP: mistral4

70f43a7

CPU FA

3269190

CUDA FA 320, 256

cd115cb

ikawrakow merged commit 56477c7 into main Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistral 4 support#1450

Mistral 4 support#1450
ikawrakow merged 3 commits intomainfrom
ik/mistral4

ikawrakow commented Mar 17, 2026

Uh oh!

ikawrakow commented Mar 17, 2026

Uh oh!

vvverily commented Mar 19, 2026

Uh oh!

dinerburger commented Mar 19, 2026

Uh oh!

aviallon commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ikawrakow commented Mar 17, 2026

CUDA performance

CPU performance

Uh oh!

ikawrakow commented Mar 17, 2026

Uh oh!

vvverily commented Mar 19, 2026

Uh oh!

dinerburger commented Mar 19, 2026

Uh oh!

aviallon commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants