Step-3.5: llama.cpp compatibility changes#1240
Conversation
|
I downloaded the Q4_K_S files yesterday (after the file changes part-xxx => gguf), the initial support worked and that PR works aswell. |
|
Thanks! With this PR I'm running the BF16 that was converted yesterday from mainline (with the int32 instead of bool array). Given that PR had some more action, I'll re-convert again now just to be sure and report back. With luck will have some quants released in not too long. |
|
Sadly, the server doesn't start once the model is loaded. I had already this problem with Minimax 2.x. I'm stuck here, on both llama-server and llama-perplexity. I'm using this gguf model, the lastest version uploaded with ( https://huggingface.co/stepfun-ai/Step-3.5-Flash-Int4/commit/e982fcf54de9b898e652a0a1cf902aef9115a40f ) : stepfun-ai/Step-3.5-Flash-Int4 Please tell me if I'm using an incorrect one. (note: of course, I'm using this PR) |
|
Okay, I re-converted using mainline Using this PR, I can run imatrix against the bf16 and can inference on the pure q8_0 quantized with However actually testing the pure 👈 Details and Screenshot$ cd ik_llama.cpp
$ ./build/bin/llama-server --version
version: 4183 (9a0b5e80)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
$ numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-server \
--model "$model"\
--alias ubergarm/Step-3.5-Flash \
--ctx-size 65536 \
-ctk q8_0 -ctv q8_0 \
-ub 4096 -b 4096 \
--parallel 1 \
--threads 96 \
--threads-batch 128 \
--numa numactl \
--host 127.0.0.1 \
--port 8080 \
--no-mmap \
--jinja \
--validate-quants
I'll do some more testing and look closer at the debug logs to see if something obvious is happening. Going to hold off on releasing the imatrix.dat file until I get it straightened out. Thanks! UPDATE I'll tried the latest I do see a few differences in the metadata between the two, and added a non-exhaustive table of observed differences:
|
|
Yes, there is definitely something wrong still. The model I have counts just fine in French: |
It's because you have not set |
|
I uploaded a mainline compatible Converted from the original safetensors using the final closed version of the mainline PR. Quantized with ik_llama.cpp using recipe given on model card. It seems to run okay on mainline, but still having the issue here so I'll hold off on making imatrix and releasing ik quants for the night. |
|
On my side, I'm good to go. Correction from mainline added to my fork of IKL : Then, I edited this model : https://huggingface.co/stepfun-ai/Step-3.5-Flash-Int4/tree/main With the GGUF editor here : https://huggingface.co/spaces/CISCai/gguf-editor to add the missing key on the first GGUF split : step35.rope.freq_base_swa,FLOAT32,10000 I downloaded the file step3p5_flash_Q4_K_S-00001-of-00012.gguf that I edited. And it loads and counts properly. Now, with your IQ4_XS release : xD Here's my pre-release for Windows, without and without the fix : https://github.com/Nexesenex/ik_llama.cpp.nxs/releases/tag/PR1241%2B1202%2B1240%2B1243%2B1244 And thank you for the IQ4_XS GGUF! |
Ah, that's because you changed the array of rope frequencies per layer present in the original PR and GGUF to a single value, and I have missed that change. I think the code related to this particular parameter is quite funny in |
|
The last change should address the issue observed by @ubergarm, so I'll just merge it. If there are still issues left, we can address separately. |
Just tested my released I'll try to get some quants out tomorrow with imatrix and see how it is looking for quality and speed benchmarks. |
|
While searching along the PPL curve and releasing quants this morning, I decided to benchmark the updated "official" stepfun-ai/Step-3.5-Flash-Int4 with my releases... They apparently are getting a bool array for the sliding window metadata: Which throws an error loading here on ik: I had patched the mainline Anyway, sorry to say, but seems like bool array sliding window GGUFs are in in the wild now. This is the mainline PR adding support for that: ggml-org/llama.cpp#18850 |
|
I'll give your patch a try now, I just vibe coded (with kimi-k2.5-q4_x) this patch which seems to be working currently: diff --git a/src/llama-model-loader.cpp b/src/llama-model-loader.cpp
index 20416f79..ebd49011 100644
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -579,19 +579,28 @@ bool llama_model_loader::get_arr(const std::string & key, std::array<T, N_MAX> &
GGUFMeta::GKV<GGUFMeta::ArrayInfo>::get_kv(meta, kid);
switch (arr_info.gt) {
- case GGUF_TYPE_FLOAT32: GGML_ASSERT((std::is_same<T, float>::value)); break;
+ case GGUF_TYPE_BOOL:
case GGUF_TYPE_INT32: GGML_ASSERT(
(std::is_same<T, int32_t>::value) ||
- (std::is_same<T, uint32_t>::value)); break;
+ (std::is_same<T, uint32_t>::value) ||
+ (std::is_same<T, bool>::value)); break;
+ case GGUF_TYPE_FLOAT32: GGML_ASSERT((std::is_same<T, float>::value)); break;
default:
- throw std::runtime_error(format("%s is not a float32, int32 array", key.c_str()));
+ throw std::runtime_error(format("%s is not a float32, int32 or bool array", key.c_str()));
}
if (arr_info.length > N_MAX) {
throw std::runtime_error(format("array length %u for key %s exceeds max %u", (uint32_t) arr_info.length, key.c_str(), (uint32_t) N_MAX));
}
- std::copy((const T*)arr_info.data, (const T *)arr_info.data + arr_info.length, result.begin());
+ if (arr_info.gt == GGUF_TYPE_BOOL) {
+ // bool arrays are stored as 1-byte bools, convert to target type T
+ std::transform((const bool *)arr_info.data, (const bool *)arr_info.data + arr_info.length, result.begin(), [](bool x) {
+ return static_cast<T>(x);
+ });
+ } else {
+ std::copy((const T*)arr_info.data, (const T *)arr_info.data + arr_info.length, result.begin());
+ }
return true;
}Pulling yours and testing now |

Since I implemented Step-3.5-Flash support in #1231, there have been changes to the metadata present in the Step-3.5 GGUFs, so apparently
ik_llama.cppno longer works with the latest Step-3.5-Flash GGUFs (see here and here).Mainline devs have never had inhibitions making people re-download hundred's of gigabytes of models for the sake of changing a few metadata key-value pairs. Me on the other hand, I do mind downloading 111 GB (the size of the
Q4_K_Smodel) just to test that this PR works with the latest and greatest Step-3.5-Flash GGUFs. Hence, I would appreciate if someone who has already downloaded the latest version confirms that the PR works.cc: @ubergarm