Step-3.5-Flash support by ikawrakow · Pull Request #1231 · ikawrakow/ik_llama.cpp

ikawrakow · 2026-02-04T11:45:33Z

This PR adds support for the Step-3.5-Flash model.

I'm observing a very peculiar PP performance, and I wasn't able to figure out the root cause. The graph shows the effect on a 8x3090 system with full offload. The strange drop in performance somewhere between 16k and 20k context may have something to do with GPU cache sizes, but the jump in performance for u_batch = 512 seems really mysterious. I'll keep trying to sort out what is going on, but in the meantime putting the PR out there for testing.

Update: See this comment for the explanation of the observed "mystery".

Caveat: ik_llama.cpp does not implement KV cache size reduction for SWA models. Step-3.5-Flash is a heavy SWA user, with 3 out of 4 layers using SWA with a window size of 512. Hence, KV cache in ik_llama.cpp will be significantly larger than in mainline. On the other hand, one does not need to worry about KV snapshots to be able to rewind/reuse the KV cache.

Closes #1230

ikawrakow · 2026-02-05T06:13:18Z

Where is the excitement?

leflakk · 2026-02-05T10:49:36Z

Where is the excitement?

Thank you so much! Tested with opencode + 7x3090 and it looks good. Noob question: I see a drop in token generation at > 50k tokens (compared to Minimax), is this due to sm graph not supported?

ikawrakow · 2026-02-05T11:33:21Z

OK, performance drop mystery solved: For this model and sufficiently long context, the FA calculation results in NaNs due to f16 range overflow (see #1196, #1198). Once this happens, and we have NaNs in the KV cache, from there on the measured performance becomes meaningless because we pick an unrealistic set of experts, etc.

Running sweep-bench with -cuda fa-offset=0.6931 solves the performance drop. Here is what I get with that on a 6x3090 system:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	128	0	1.272	1610.60	1.469	87.12
2048	128	2048	1.278	1602.86	1.469	87.16
2048	128	4096	1.301	1574.14	1.502	85.24
2048	128	6144	1.327	1543.89	1.519	84.27
2048	128	8192	1.349	1518.55	1.549	82.62
2048	128	10240	1.368	1496.88	1.568	81.63
2048	128	12288	1.405	1458.06	1.585	80.75
2048	128	14336	1.423	1439.60	1.598	80.08
2048	128	16384	1.456	1406.86	1.621	78.96
2048	128	18432	1.480	1383.69	1.642	77.96
2048	128	20480	1.507	1358.65	1.665	76.89
2048	128	22528	1.530	1338.95	1.681	76.15
2048	128	24576	1.568	1305.96	1.703	75.15
2048	128	26624	1.594	1284.64	1.741	73.53
2048	128	28672	1.621	1263.60	1.752	73.06
2048	128	30720	1.650	1241.36	1.767	72.45
2048	128	32768	1.686	1214.81	1.781	71.89
2048	128	34816	1.710	1197.76	1.810	70.71
2048	128	36864	1.745	1173.67	1.838	69.62
2048	128	38912	1.765	1160.33	1.849	69.24
2048	128	40960	1.795	1140.91	1.876	68.23
2048	128	43008	1.822	1123.83	1.892	67.67
2048	128	45056	1.856	1103.68	1.914	66.89
2048	128	47104	1.885	1086.42	1.941	65.96
2048	128	49152	1.913	1070.80	1.947	65.75
2048	128	51200	1.931	1060.52	1.974	64.86
2048	128	53248	1.964	1042.67	2.005	63.85
2048	128	55296	2.221	922.01	2.034	62.93
2048	128	57344	2.013	1017.28	2.034	62.92
2048	128	59392	2.046	1000.93	2.062	62.07
2048	128	61440	2.066	991.46	2.083	61.44
2048	128	63488	2.084	982.74	2.108	60.71

It is now the second model we observe the FA overflow, so I guess I'll change the FA offset to be 0.6931 by default

ubergarm · 2026-02-05T20:31:18Z

Thanks for adding initial support for this one and all the work you've been doing lately, ik! I'm slowly catching up after a rough couple weeks of life, and sorry I missed everyone at FOSDEM26.

Looking into quantizing this Step3.5-Flash model now. Two things came up as the unmerged mainline implementation is still in flux with their PR ggml-org/llama.cpp#19283 (comment)

Specifically:

mainline now supports sliding window pattern gguf metadata of type arr[bool, ...] where before typically we saw arr[i32, ...] e.g. step35.attention.sliding_window_pattern arr[bool,45] = [false, true, true, true, false, true... model-loader : support bool array sliding window pattern ggml-org/llama.cpp#18850
Even after casting the above to int, I hit another snag with error loading model hyperparameters: key not found in model: step35.rope.dimension_count_per_layer likely due to disabling kv shift here: Support Step3.5-Flash ggml-org/llama.cpp#19283 (comment)

I'll hold off on releasing ik quants today to avoid potential future incompatibilities.

gapeleon · 2026-02-06T04:36:27Z

Where is the excitement?

I've been (quietly) excited by a lot of things you've been doing in this project for a while now!
I just didn't want to spam comments if I don't have any useful data to provide haha.

Off the top of my head recently:

-> Seed-OSS support merge + graph parallel is really awesome, I've been using that model since you added these.
It's an underrated gem I'd completely missed.
-> I couldn't find quants for Step-3.5 and my /full_models drive is 94% full
-> That --k-cache-hadamard feature you added is great because I'm able to use a larger --ctx-size

Thank you for accepting my control-vector API PR, I wasn't sure if it's too niche a feature.

ikawrakow · 2026-02-06T05:32:36Z

@ubergarm

Thank you for letting me know about these incompatibilities.

I used this model for development and testing. I see they have uploaded a new version, not sure I want to re-download before the smoke has settled. But I did check the latest version of the mainline PR (2c0bba974d4674bbe785f75327965873a21b25e6), and I'm finding that

The PR is still using uint32_t for step35.attention.sliding_window_pattern arr and getting it as uint32_t array, so not sure why this is causing issues for you.
They did indeed remove step35.rope.dimension_count_per_layer. Haha, a project that is otherwise striving for full generality, changed the fully general way of setting the RoPE rotation length per layer using an array, to the hacky solution of n_rot_l = is_swa ? hparams.n_rot : (hparams.n_rot / 2) (where there is absolutely zero theoretical foundation that that's how it needs to be). Anyway, I think I can easily adjust.
They also renamed the clamps applied to the up and gate activations in the FFN part from %s.swiglu_limits and %s.swiglu_limits_shared to %s.swiglu_clamp_exp and %s.swiglu_clamp_shexp

These all seems to very minor cosmetic changes that can be easily handled. Maybe I'll push a PR later today.

saood06 · 2026-02-06T09:45:37Z

Seed-OSS support merge
It's an underrated gem I'd completely missed.

I agree, I was really surprised nobody ever opened an issue requesting support or just added it. GLM 4.7 Flash reminded me I like having a fully offloaded (to my single 3090) model to use, but it left a bad taste in my mouth with it's quality. I remembered Seed had came out and it had caught me curiosity but I hadn't ever tried it which is why I added support (with the only downside being the limited context window due to not using MLA) so that I could use it and give it a try.

There are other architectures that aren't here yet that I have on my radar to add.

ikawrakow added 10 commits February 3, 2026 14:13

WIP

2524a7c

This works but is slow

df22d49

Turn off the up / gate clamps for now

809c864

OK we need the clamping

176865f

Fuse the clamp (CUDA)

6bfba9c

Fuse the clamp (CPU)

71580cd

WIP

e9f8535

Be able to use merged q, k, v

259e431

Be able to use merged up/gate experts

25e9151

Fuse the clamp (CUDA mmvq)

421b70b

ikawrakow merged commit 9c1c74a into main Feb 5, 2026

Quairon-Nailo mentioned this pull request Feb 5, 2026

Bug: GLM 4.5 AIR fails to load after #1231 #1237

Closed

ikawrakow mentioned this pull request Feb 6, 2026

Step-3.5: llama.cpp compatibility changes #1240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step-3.5-Flash support#1231

Step-3.5-Flash support#1231
ikawrakow merged 10 commits intomainfrom
ik/step35

ikawrakow commented Feb 4, 2026 •

edited

Loading

Uh oh!

ikawrakow commented Feb 5, 2026

Uh oh!

leflakk commented Feb 5, 2026

Uh oh!

ikawrakow commented Feb 5, 2026

Uh oh!

ubergarm commented Feb 5, 2026 •

edited

Loading

Uh oh!

gapeleon commented Feb 6, 2026

Uh oh!

ikawrakow commented Feb 6, 2026

Uh oh!

saood06 commented Feb 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ikawrakow commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Feb 5, 2026

Uh oh!

leflakk commented Feb 5, 2026

Uh oh!

ikawrakow commented Feb 5, 2026

Uh oh!

ubergarm commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gapeleon commented Feb 6, 2026

Uh oh!

ikawrakow commented Feb 6, 2026

Uh oh!

saood06 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ikawrakow commented Feb 4, 2026 •

edited

Loading

ubergarm commented Feb 5, 2026 •

edited

Loading

saood06 commented Feb 6, 2026 •

edited

Loading