Conversation
|
Where is the excitement? |
Thank you so much! Tested with opencode + 7x3090 and it looks good. Noob question: I see a drop in token generation at > 50k tokens (compared to Minimax), is this due to sm graph not supported? |
|
OK, performance drop mystery solved: For this model and sufficiently long context, the FA calculation results in NaNs due to Running
It is now the second model we observe the FA overflow, so I guess I'll change the FA offset to be |
|
Thanks for adding initial support for this one and all the work you've been doing lately, ik! I'm slowly catching up after a rough couple weeks of life, and sorry I missed everyone at FOSDEM26. Looking into quantizing this Step3.5-Flash model now. Two things came up as the unmerged mainline implementation is still in flux with their PR ggml-org/llama.cpp#19283 (comment) Specifically:
I'll hold off on releasing ik quants today to avoid potential future incompatibilities. |
I've been (quietly) excited by a lot of things you've been doing in this project for a while now! Off the top of my head recently: -> Seed-OSS support merge + graph parallel is really awesome, I've been using that model since you added these. Thank you for accepting my control-vector API PR, I wasn't sure if it's too niche a feature. |
|
Thank you for letting me know about these incompatibilities. I used this model for development and testing. I see they have uploaded a new version, not sure I want to re-download before the smoke has settled. But I did check the latest version of the mainline PR (
These all seems to very minor cosmetic changes that can be easily handled. Maybe I'll push a PR later today. |
I agree, I was really surprised nobody ever opened an issue requesting support or just added it. GLM 4.7 Flash reminded me I like having a fully offloaded (to my single 3090) model to use, but it left a bad taste in my mouth with it's quality. I remembered Seed had came out and it had caught me curiosity but I hadn't ever tried it which is why I added support (with the only downside being the limited context window due to not using MLA) so that I could use it and give it a try. There are other architectures that aren't here yet that I have on my radar to add. |
This PR adds support for the Step-3.5-Flash model.
I'm observing a very peculiar PP performance, and I wasn't able to figure out the root cause. The graph shows the effect on a 8x3090 system with full offload. The strange drop in performance somewhere between 16k and 20k context may have something to do with GPU cache sizes, but the jump in performance for
u_batch = 512seems really mysterious. I'll keep trying to sort out what is going on, but in the meantime putting the PR out there for testing.Update: See this comment for the explanation of the observed "mystery".
Caveat:
ik_llama.cppdoes not implement KV cache size reduction for SWA models. Step-3.5-Flash is a heavy SWA user, with 3 out of 4 layers using SWA with a window size of 512. Hence, KV cache inik_llama.cppwill be significantly larger than in mainline. On the other hand, one does not need to worry about KV snapshots to be able to rewind/reuse the KV cache.Closes #1230