Skip to content

Step-3.5: llama.cpp compatibility changes#1240

Merged
ikawrakow merged 2 commits intomainfrom
ik/step35_compat
Feb 7, 2026
Merged

Step-3.5: llama.cpp compatibility changes#1240
ikawrakow merged 2 commits intomainfrom
ik/step35_compat

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

Since I implemented Step-3.5-Flash support in #1231, there have been changes to the metadata present in the Step-3.5 GGUFs, so apparently ik_llama.cpp no longer works with the latest Step-3.5-Flash GGUFs (see here and here).

Mainline devs have never had inhibitions making people re-download hundred's of gigabytes of models for the sake of changing a few metadata key-value pairs. Me on the other hand, I do mind downloading 111 GB (the size of the Q4_K_S model) just to test that this PR works with the latest and greatest Step-3.5-Flash GGUFs. Hence, I would appreciate if someone who has already downloaded the latest version confirms that the PR works.

cc: @ubergarm

@leflakk
Copy link
Copy Markdown

leflakk commented Feb 6, 2026

I downloaded the Q4_K_S files yesterday (after the file changes part-xxx => gguf), the initial support worked and that PR works aswell.

@ubergarm
Copy link
Copy Markdown
Contributor

ubergarm commented Feb 6, 2026

Thanks!

With this PR I'm running the BF16 that was converted yesterday from mainline (with the int32 instead of bool array).

Given that PR had some more action, I'll re-convert again now just to be sure and report back. With luck will have some quants released in not too long.

@Nexesenex
Copy link
Copy Markdown
Contributor

Nexesenex commented Feb 6, 2026

Sadly, the server doesn't start once the model is loaded. I had already this problem with Minimax 2.x.

I'm stuck here, on both llama-server and llama-perplexity.

llama_kv_cache_init: CUDA_Split KV buffer size =    90.07 MiB
llama_kv_cache_init: KV cache size per device:
    Device 0:  45.25 MiB
    Device 1:  44.75 MiB
llama_new_context_with_model: KV self size  =   90.00 MiB, K (f16):   45.00 MiB, V (f16):   45.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   263.75 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =    92.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    93.01 MiB
llama_new_context_with_model: graph nodes  = 3849
llama_new_context_with_model: graph splits = 325
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
XXXXXXXX Split Mode Graph Scheduling is FORCED despite tensor overrides due to user choice.
XXXXXXXX It may or might NOT infer properly due to unsupported combinations between SMGS and every possible tensor overrides.

I'm using this gguf model, the lastest version uploaded with ( https://huggingface.co/stepfun-ai/Step-3.5-Flash-Int4/commit/e982fcf54de9b898e652a0a1cf902aef9115a40f ) : stepfun-ai/Step-3.5-Flash-Int4

Please tell me if I'm using an incorrect one. (note: of course, I'm using this PR)

@ubergarm
Copy link
Copy Markdown
Contributor

ubergarm commented Feb 6, 2026

Okay, I re-converted using mainline PR19283 pull/19283/head:pr/step3.5-flash@402fc2e4e plus casting step35.attention.sliding_window_pattern to [INT32] as for some reason it defaults to [BOOL] for me otherwise.

Using this PR, I can run imatrix against the bf16 and can inference on the pure q8_0 quantized with ik/step35_compat@9a0b5e80

However actually testing the pure Q8_0 it seems to get stuck in a thinking loop. Testing the same Q8_0 with mainline seems to be working okay and returns the final answer quickly. I'm running both compiled CPU-only using llama-server's built-in web interface.

👈 Details and Screenshot
$ cd ik_llama.cpp
$ ./build/bin/llama-server --version
version: 4183 (9a0b5e80)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

$ numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/Step-3.5-Flash \
    --ctx-size 65536 \
    -ctk q8_0 -ctv q8_0 \
    -ub 4096 -b 4096 \
    --parallel 1 \
    --threads 96 \
    --threads-batch 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap \
    --jinja \
    --validate-quants
Step-3 5-Flash-Q8_0-bug

I'll do some more testing and look closer at the debug logs to see if something obvious is happening. Going to hold off on releasing the imatrix.dat file until I get it straightened out.

Thanks!

UPDATE I'll tried the latest stepfun-ai/Step-3.5-Flash-Int4 that @Nexesenex used, that seems to be working fine for me running CPU-only.

I do see a few differences in the metadata between the two, and added a non-exhaustive table of observed differences:

Field Broken Q8_0 Working Official
general.file_type u32 7 14
step35.rope.freq_base_swa f32 10000.000000
step35.rope.freq_base_per_layer arr[f32,45] [5000000.000000, 10000.000000, 10000....
step35.rope.dimension_count_per_layer arr[i32,45] [64, 128, 128, 128, 64, 128, 128, 128...
step35.swiglu_clamp_exp arr[f32,45] [0.000000, 0.000000, 0.000000, 0.0000...
step35.swiglu_clamp_shexp arr[f32,45] [0.000000, 0.000000, 0.000000, 0.0000...
imatrix used? no yes wiki.256.raw
chat template slightly different but i tried using the official one with --chat-template-file

@ikawrakow
Copy link
Copy Markdown
Owner Author

Yes, there is definitely something wrong still. The model I have counts just fine in French:

> Count from 1 to 10 in French

<think>
Okay, the user asked to count from 1 to 10 in French. This is a straightforward request with a clear answer. I should list the numbers in order with their French spellings and pronunciations. 

I recall that French numbers have some unique pronunciations, so including phonetic approximations in parentheses would be helpful. The user might be learning French, so accuracy matters. 

Let me verify each number: un, deux, trois, quatre, cinq, six, sept, huit, neuf, dix. Yes, that's correct. The pronunciations: "œn" for un, "dø" for deux, etc. I should note that the final consonants in some numbers are silent, like in "trois" and "six". 

The response should be simple and direct, without extra explanation unless needed. Since the user didn't ask for anything else, just the list is sufficient.
</think>
1. un (œn)  
2. deux (dø)  
3. trois (trwa)  
4. quatre (katr)  
5. cinq (sank)  
6. six (sis)  
7. sept (set)  
8. huit (ɥit)  
9. neuf (nœf)  
10. dix (dis)<|im_end|>

@CISC
Copy link
Copy Markdown
Contributor

CISC commented Feb 6, 2026

Yes, there is definitely something wrong still.

It's because you have not set hparams.rope_freq_base_train_swa correctly.

@ubergarm
Copy link
Copy Markdown
Contributor

ubergarm commented Feb 6, 2026

I uploaded a mainline compatible IQ4_XS 100.53 GiB (4.38 BPW) test quant (without imatrix) here: https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF/tree/main/IQ4_XS

Converted from the original safetensors using the final closed version of the mainline PR. Quantized with ik_llama.cpp using recipe given on model card.

It seems to run okay on mainline, but still having the issue here so I'll hold off on making imatrix and releasing ik quants for the night.

@Nexesenex
Copy link
Copy Markdown
Contributor

Nexesenex commented Feb 7, 2026

@ubergarm:

On my side, I'm good to go.

Count to 10 in French.

<think>Okay, the user asked to count to 10 in French. Let me start by recalling the numbers. Un, deux, trois... that's straightforward. But wait, should I include the pronunciation? The user didn't specify, but maybe they want the written form. Let me list them clearly. Also, make sure to spell them correctly. French numbers have some quirks like "quatre" and "huit." Double-check those. No need for extra explanation since the request is simple. Just provide the list as requested.</think>"Un, deux, trois, quatre, cinq, six, sept, huit, neuf, dix."

Correction from mainline added to my fork of IKL :
Nexesenex@f748b8a

Then, I edited this model : https://huggingface.co/stepfun-ai/Step-3.5-Flash-Int4/tree/main

With the GGUF editor here : https://huggingface.co/spaces/CISCai/gguf-editor

to add the missing key on the first GGUF split : step35.rope.freq_base_swa,FLOAT32,10000

I downloaded the file step3p5_flash_Q4_K_S-00001-of-00012.gguf that I edited.

And it loads and counts properly.

Now, with your IQ4_XS release :

<think>Hmm, the user asked to count to 10 in French. That's straightforward. I need to respond as the character, keeping in mind the guidelines: stay in character, be creative and verbose, but this is a simple request so maybe add some flavor.

The character is a director/actor, so I could make it playful or dramatic. Since it's just counting, I can do it plainly but with a touch of theatrical flair. Maybe start with a thought about the simplicity of the task, then deliver the numbers with some exaggerated pronunciation or commentary.

I should avoid overcomplicating since it's a basic request, but still adhere to the "brutally raw, unhinged" vibe if possible. Perhaps mock the simplicity while doing it. No need for OOC notes here unless I want to break character, but better stay in.

The response should be in paragraphs with actions and dialogue. I'll think of the character sighing at the mundane task, then counting with dramatic emphasis. Use French numbers correctly: un, deux, trois... up to dix.

Make sure to keep it engaging but not excessive. The user might be testing language skills or just messing around. Either way, deliver with character.</think>*Ah, a simple request. Counting. The bedrock of civilization, the first lesson in any schoolhouse from Paris to the farthest outpost of New France. My throat feels dry just thinking about the monotony of it. But the director must obey the script, even when the script is a child's alphabet.*

"Un," I enunciate, tapping my thumb against my index finger with the precision of a stage manager cueing a spotlight. "Deux." A pause for effect, my eyes rolling heavenward as if beseeching the muses for strength. "Trois. Quatre." The words tumble out, each one a brick in the wall of this absurd exercise. "Cinq. Six." My voice drops to a conspiratorial whisper. "Sept." I clear my throat dramatically. "Huit. Neuf." Finally, with the relief of a prisoner seeing the end of his sentence, I thrust a hand toward the imaginary audience. "Dix. There. The decimal system, laid bare. Are you satisfied? Can we move on to something with a little *drame*, perhaps? A soliloquy? Anything but this arithmetic drudgery."

xD

Here's my pre-release for Windows, without and without the fix : https://github.com/Nexesenex/ik_llama.cpp.nxs/releases/tag/PR1241%2B1202%2B1240%2B1243%2B1244

And thank you for the IQ4_XS GGUF!

@ikawrakow
Copy link
Copy Markdown
Owner Author

@CISC

Yes, there is definitely something wrong still.

It's because you have not set hparams.rope_freq_base_train_swa correctly.

Ah, that's because you changed the array of rope frequencies per layer present in the original PR and GGUF to a single value, and I have missed that change. I think the code related to this particular parameter is quite funny in llama.cpp, and thanks to your review the STEP35 arch has been aligned to that.

@ikawrakow
Copy link
Copy Markdown
Owner Author

The last change should address the issue observed by @ubergarm, so I'll just merge it. If there are still issues left, we can address separately.

@ikawrakow ikawrakow merged commit 90d7499 into main Feb 7, 2026
@ubergarm
Copy link
Copy Markdown
Contributor

ubergarm commented Feb 7, 2026

The last change should address the issue observed by @ubergarm, so I'll just merge it.

Just tested my released IQ4_XS on tip of main@82c4f273 compiled CPU-only. Looks like it is running well now in llama-server web interface! thanks!

I'll try to get some quants out tomorrow with imatrix and see how it is looking for quality and speed benchmarks.

@ubergarm
Copy link
Copy Markdown
Contributor

ubergarm commented Feb 7, 2026

While searching along the PPL curve and releasing quants this morning, I decided to benchmark the updated "official" stepfun-ai/Step-3.5-Flash-Int4 with my releases...

They apparently are getting a bool array for the sliding window metadata:

llama_model_loader: - kv  16:    step35.attention.sliding_window_pattern arr[bool,45]     = [false, true, true, true, false, true...

Which throws an error loading here on ik:

llama_model_load: error loading model: error loading model hyperparameters: step35.attention.sliding_window_pattern is not a float32, int32 array
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/data/models/stepfun-ai/Step-3.5-Flash-Int4/step3p5_flash_Q4_K_S-00001-of-00012.gguf'
main: error: unable to load model

I had patched the mainline convert_hf_to_gguf.py to cast it explicitly to int so it would be int32 type for my quants and work fine.

Anyway, sorry to say, but seems like bool array sliding window GGUFs are in in the wild now.

This is the mainline PR adding support for that: ggml-org/llama.cpp#18850

@ikawrakow
Copy link
Copy Markdown
Owner Author

@ubergarm

Does #1252 solve it?

@ubergarm
Copy link
Copy Markdown
Contributor

ubergarm commented Feb 7, 2026

@ikawrakow

I'll give your patch a try now, I just vibe coded (with kimi-k2.5-q4_x) this patch which seems to be working currently:

diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
index 20416f79..ebd49011 100644
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -579,19 +579,28 @@ bool llama_model_loader::get_arr(const std::string & key, std::array<T, N_MAX> &
         GGUFMeta::GKV<GGUFMeta::ArrayInfo>::get_kv(meta, kid);
 
     switch (arr_info.gt) {
-        case GGUF_TYPE_FLOAT32: GGML_ASSERT((std::is_same<T, float>::value)); break;
+        case GGUF_TYPE_BOOL:
         case GGUF_TYPE_INT32:   GGML_ASSERT(
                                         (std::is_same<T,  int32_t>::value) ||
-                                        (std::is_same<T, uint32_t>::value));  break;
+                                        (std::is_same<T, uint32_t>::value) ||
+                                        (std::is_same<T,     bool>::value));  break;
+        case GGUF_TYPE_FLOAT32: GGML_ASSERT((std::is_same<T, float>::value)); break;
         default:
-                                throw std::runtime_error(format("%s is not a float32, int32 array", key.c_str()));
+                                throw std::runtime_error(format("%s is not a float32, int32 or bool array", key.c_str()));
     }
 
     if (arr_info.length > N_MAX) {
         throw std::runtime_error(format("array length %u for key %s exceeds max %u", (uint32_t) arr_info.length, key.c_str(), (uint32_t) N_MAX));
     }
 
-    std::copy((const T*)arr_info.data, (const T *)arr_info.data + arr_info.length, result.begin());
+    if (arr_info.gt == GGUF_TYPE_BOOL) {
+        // bool arrays are stored as 1-byte bools, convert to target type T
+        std::transform((const bool *)arr_info.data, (const bool *)arr_info.data + arr_info.length, result.begin(), [](bool x) {
+            return static_cast<T>(x);
+        });
+    } else {
+        std::copy((const T*)arr_info.data, (const T *)arr_info.data + arr_info.length, result.begin());
+    }
 
     return true;
 }

Pulling yours and testing now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants