Skip to content

Step-3.5-Flash support#1231

Merged
ikawrakow merged 10 commits intomainfrom
ik/step35
Feb 5, 2026
Merged

Step-3.5-Flash support#1231
ikawrakow merged 10 commits intomainfrom
ik/step35

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

@ikawrakow ikawrakow commented Feb 4, 2026

This PR adds support for the Step-3.5-Flash model.

I'm observing a very peculiar PP performance, and I wasn't able to figure out the root cause. The graph shows the effect on a 8x3090 system with full offload. The strange drop in performance somewhere between 16k and 20k context may have something to do with GPU cache sizes, but the jump in performance for u_batch = 512 seems really mysterious. I'll keep trying to sort out what is going on, but in the meantime putting the PR out there for testing.

Update: See this comment for the explanation of the observed "mystery".

xxxx3

Caveat: ik_llama.cpp does not implement KV cache size reduction for SWA models. Step-3.5-Flash is a heavy SWA user, with 3 out of 4 layers using SWA with a window size of 512. Hence, KV cache in ik_llama.cpp will be significantly larger than in mainline. On the other hand, one does not need to worry about KV snapshots to be able to rewind/reuse the KV cache.

Closes #1230

@ikawrakow
Copy link
Copy Markdown
Owner Author

Where is the excitement?

@ikawrakow ikawrakow merged commit 9c1c74a into main Feb 5, 2026
@leflakk
Copy link
Copy Markdown

leflakk commented Feb 5, 2026

Where is the excitement?

Thank you so much! Tested with opencode + 7x3090 and it looks good. Noob question: I see a drop in token generation at > 50k tokens (compared to Minimax), is this due to sm graph not supported?

@ikawrakow
Copy link
Copy Markdown
Owner Author

OK, performance drop mystery solved: For this model and sufficiently long context, the FA calculation results in NaNs due to f16 range overflow (see #1196, #1198). Once this happens, and we have NaNs in the KV cache, from there on the measured performance becomes meaningless because we pick an unrealistic set of experts, etc.

Running sweep-bench with -cuda fa-offset=0.6931 solves the performance drop. Here is what I get with that on a 6x3090 system:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 128 0 1.272 1610.60 1.469 87.12
2048 128 2048 1.278 1602.86 1.469 87.16
2048 128 4096 1.301 1574.14 1.502 85.24
2048 128 6144 1.327 1543.89 1.519 84.27
2048 128 8192 1.349 1518.55 1.549 82.62
2048 128 10240 1.368 1496.88 1.568 81.63
2048 128 12288 1.405 1458.06 1.585 80.75
2048 128 14336 1.423 1439.60 1.598 80.08
2048 128 16384 1.456 1406.86 1.621 78.96
2048 128 18432 1.480 1383.69 1.642 77.96
2048 128 20480 1.507 1358.65 1.665 76.89
2048 128 22528 1.530 1338.95 1.681 76.15
2048 128 24576 1.568 1305.96 1.703 75.15
2048 128 26624 1.594 1284.64 1.741 73.53
2048 128 28672 1.621 1263.60 1.752 73.06
2048 128 30720 1.650 1241.36 1.767 72.45
2048 128 32768 1.686 1214.81 1.781 71.89
2048 128 34816 1.710 1197.76 1.810 70.71
2048 128 36864 1.745 1173.67 1.838 69.62
2048 128 38912 1.765 1160.33 1.849 69.24
2048 128 40960 1.795 1140.91 1.876 68.23
2048 128 43008 1.822 1123.83 1.892 67.67
2048 128 45056 1.856 1103.68 1.914 66.89
2048 128 47104 1.885 1086.42 1.941 65.96
2048 128 49152 1.913 1070.80 1.947 65.75
2048 128 51200 1.931 1060.52 1.974 64.86
2048 128 53248 1.964 1042.67 2.005 63.85
2048 128 55296 2.221 922.01 2.034 62.93
2048 128 57344 2.013 1017.28 2.034 62.92
2048 128 59392 2.046 1000.93 2.062 62.07
2048 128 61440 2.066 991.46 2.083 61.44
2048 128 63488 2.084 982.74 2.108 60.71

It is now the second model we observe the FA overflow, so I guess I'll change the FA offset to be 0.6931 by default

@ubergarm
Copy link
Copy Markdown
Contributor

ubergarm commented Feb 5, 2026

Thanks for adding initial support for this one and all the work you've been doing lately, ik! I'm slowly catching up after a rough couple weeks of life, and sorry I missed everyone at FOSDEM26.

Looking into quantizing this Step3.5-Flash model now. Two things came up as the unmerged mainline implementation is still in flux with their PR ggml-org/llama.cpp#19283 (comment)

Specifically:

  1. mainline now supports sliding window pattern gguf metadata of type arr[bool, ...] where before typically we saw arr[i32, ...] e.g. step35.attention.sliding_window_pattern arr[bool,45] = [false, true, true, true, false, true... model-loader : support bool array sliding window pattern ggml-org/llama.cpp#18850
  2. Even after casting the above to int, I hit another snag with error loading model hyperparameters: key not found in model: step35.rope.dimension_count_per_layer likely due to disabling kv shift here: Support Step3.5-Flash ggml-org/llama.cpp#19283 (comment)

I'll hold off on releasing ik quants today to avoid potential future incompatibilities.

@gapeleon
Copy link
Copy Markdown
Contributor

gapeleon commented Feb 6, 2026

Where is the excitement?

I've been (quietly) excited by a lot of things you've been doing in this project for a while now!
I just didn't want to spam comments if I don't have any useful data to provide haha.

Off the top of my head recently:

-> Seed-OSS support merge + graph parallel is really awesome, I've been using that model since you added these.
It's an underrated gem I'd completely missed.
-> I couldn't find quants for Step-3.5 and my /full_models drive is 94% full
-> That --k-cache-hadamard feature you added is great because I'm able to use a larger --ctx-size

Thank you for accepting my control-vector API PR, I wasn't sure if it's too niche a feature.

@ikawrakow
Copy link
Copy Markdown
Owner Author

@ubergarm

Thank you for letting me know about these incompatibilities.

I used this model for development and testing. I see they have uploaded a new version, not sure I want to re-download before the smoke has settled. But I did check the latest version of the mainline PR (2c0bba974d4674bbe785f75327965873a21b25e6), and I'm finding that

  • The PR is still using uint32_t for step35.attention.sliding_window_pattern arr and getting it as uint32_t array, so not sure why this is causing issues for you.
  • They did indeed remove step35.rope.dimension_count_per_layer. Haha, a project that is otherwise striving for full generality, changed the fully general way of setting the RoPE rotation length per layer using an array, to the hacky solution of n_rot_l = is_swa ? hparams.n_rot : (hparams.n_rot / 2) (where there is absolutely zero theoretical foundation that that's how it needs to be). Anyway, I think I can easily adjust.
  • They also renamed the clamps applied to the up and gate activations in the FFN part from %s.swiglu_limits and %s.swiglu_limits_shared to %s.swiglu_clamp_exp and %s.swiglu_clamp_shexp

These all seems to very minor cosmetic changes that can be easily handled. Maybe I'll push a PR later today.

@saood06
Copy link
Copy Markdown
Collaborator

saood06 commented Feb 6, 2026

Seed-OSS support merge
It's an underrated gem I'd completely missed.

I agree, I was really surprised nobody ever opened an issue requesting support or just added it. GLM 4.7 Flash reminded me I like having a fully offloaded (to my single 3090) model to use, but it left a bad taste in my mouth with it's quality. I remembered Seed had came out and it had caught me curiosity but I hadn't ever tried it which is why I added support (with the only downside being the limited context window due to not using MLA) so that I could use it and give it a try.

There are other architectures that aren't here yet that I have on my radar to add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Support Step3.5-Flash

5 participants