Conversation
Thireus
commented
Jul 29, 2025
- I have read the contributing guidelines
- Self-reported review complexity:
- Low
- Medium
- High
Manually-specified variables were not used by the project:
GGML_BACKEND_DL
GGML_CPU
Manually-specified variables were not used by the project:
GGML_BACKEND_DL
GGML_CPU
Manually-specified variables were not used by the project:
GGML_BACKEND_DL
GGML_CPU
(╯°□°)╯︵ ┻━┻
…F16=1 -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
Bump to latest ik_llama.cpp
Bump ik_llama to 8ffad18
Check if ffn_up and ffn_gate are of the same type before using fmoe
This reverts commit ff0c368.
|
Did this work? Was the main issue the fact that this PR pulls in unrelated stuff? |
|
@saood06, more changes to be made and was meant to be a pull request on my fork until tested and working. |
|
Thanks for looking into this one, as I've heard some good early reports from folks at TheBeaverAIClub discord. I saw one tip on the mainline llama.cpp PR linked to SmallThinker PR, where possibly converting the safetensors to |
|
@ubergarm @saood06 - I have implemented the llama.cpp PR here: https://github.com/Thireus/ik_llama.cpp/tree/glm-4.5 but it has a known issue which is that it talks nonsense after some time. See: https://github.com/Thireus/ik_llama.cpp/releases/tag/glm-4.5-b4021-83d2bb3 |
|
Thanks, I'm going to hold off on GLM4.5 for now given the mainline lcpp PR seems to still be having issues. In the mean time you can keep busy with these: https://www.deepcogito.com/research/cogito-v2-preview 😆 😭 so many models this week!!! |
|
@ubergarm - yeah i'm going to hold off for now, there are too many models and I still haven't finished calibrating the ones I already sharded. I got myself 2x RTX PRO 6000, so already have new toys to play with. :D |
|
@ubergarm, @saood06 - I think it's working fine actually. The issue may have been my broken prompt. Could you guys check in the web UI? Once download finished: Make sure you use https://github.com/Thireus/ik_llama.cpp/tree/83d2bb3e2d8a0a630f77b225515a52c48d4fe16b (there is also a release build for Windows) Example of output: Answer: Also tested on coding abilities and seeing no issues so far. More examples: ggml-org/llama.cpp#14939 (comment) |
|
@ubergarm, would you be able to produce an imatrix for https://huggingface.co/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/tree/main please (not the Q8_0) 🙏🙏🙏? |
|
Hi @Thireus, I notice you're using |
Hi @ddh0, a command line parameter that I often use when dealing with DeepSeek models and forgot to remove, nevertheless you will see it gets disabled when you launch llama-server: |
|
That'd be nice if I can get confirmation from someone else that I'm not hallucinating anything here and that the model indeed produces good answers. |
|
Thank you very much. I am testing it now with the q6_K from your example. No broken responses so far with a few prompts in llama.cpp webui and Cline with create, edit and command line use. |
|
I see some more chatter on the mainline lcpp PR linking a hugging face issue regarding Does that mean your But you have already done that step and uploaded converted GGUF BF16's here: https://huggingface.co/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/tree/main which can be quantized and run using your fork? So you want me to download that bf16 gguf and run imatrix on it (without converting to Q8_0 first) if I understand correctly? Just catching up slowly and drinking my coffee, so much action this week lol. |
|
@ubergarm, yes please use the BF16 I have uploaded for computing the imatrix. You will need to use ulimit -n 9999 for your OS to loft the max opened files limit in the same terminal you run the llama command that computes the imatrix. Thank you so much! |
|
Okay I'm downloading your BF16, and will try to get your fork going. Is there a different PR to add this architechture here, given this one seems closed? |
|
I'll need to add a new PR with clean code only for the changes relevant to this model. |
|
Okay, so I have a Q8_0 up and running now for testing with some folks, here is how I got there: git clone git@github.com:Thireus/ik_llama.cpp.git
cd ik_llama.cpp
git checkout 83d2bb3e2d8a0a630f77b225515a52c48d4fe16b
# compile as usual, CPU-only in my case
# quantize a Q8_0 from this BF16 GGUF: https://huggingface.co/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/tree/main
$ cat myscripts/quantize-GLM-4.5-v01.sh
#!/usr/bin/env bash
ulimit -n 9999
#numactl -N 0 -m 0 \
./build/bin/llama-quantize \
--pure \
/mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01762.gguf \
/mnt/raid/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-Thireus-Q8_0.gguf \
Q8_0 \
192
# run llama-server
$ cat myscripts/api-server-GLM-4.5.sh
#!/usr/bin/env bash
ulimit -n 9999
model=/mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-Thireus-Q8_0.gguf
numactl -N 0 -m 0 \
./build/bin/llama-server \
--model "$model"\
--alias Thireus/GLM-4.5-Thireus-Q8_0.gguf \
--ctx-size 196608 \
-fa -fmoe \
-ctk q8_0 -ctv q8_0 \
-ub 4096 -b 4096 \
--parallel 3 \
--threads 128 \
--threads-batch 192 \
--numa numactl \
--host 127.0.0.1 \
--port 8080 \
--no-mmapNext I'll try to make imatrix with this from the bf16 GGUF directly #!/usr/bin/env bash
ulimit -n 9999
# echo 0 | sudo tee /proc/sys/kernel/numa_balancing
# sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
model=/mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01762.gguf
#Only the best for Thireus, don't use Q8_0 haha
#model=/mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-Thireus-Q8_0.gguf
numactl -N 1 -m 1 \
./build/bin/llama-imatrix \
-m "$model" \
-f ubergarm-imatrix-calibration-corpus-v02.txt \
-o /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat \
--verbosity 1 \
--layer-similarity \
--seed 1337 \
--ctx-size 512 \
-ub 4096 -b 4096 \
--numa numactl \
--threads 128 \
--threads-batch 192 \
--no-mmap
...
save_imatrix: entry ' blk.48.ffn_up_exps.weight' has partial data (98.75%) 2 out of 160 experts are missing data Storing **but be aware*
*
save_imatrix: warning: storing only 1000 out of 1012 entries
save_imatrix: stored collected data after 10 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[10]37447.5358,[11]38232.5777,[12]39631.3289,[13]41582.0199,[14]44141.9806,[15]43651.3243,EDIT hrmm those perplexities are looking super high for the imatrix... i'll restart it and try again using |
|
Yes I don't know what is up with llama-perplexity, I've seen the same. Could be fa, please let us know. |
|
Yes Did you remove I'll let this finish running and upload the resultikng imatrix.dat, but I'm not releasing any quants until we have a PR that is looking good and some more testing. Thanks! |
|
Yeah I think the issue regarding ik_llama.cpp fork version is that we want to remove Vcur reshaping in llama.cpp -> build_glm4_moe() to fix the non // reshape for multi-head
Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens);
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);
Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens); # <--- delete this lineThis has nothing to do with any issues possibly still going on with mainline and not really sure what is different in your implementation and mainlines either yet. |
|
Thank you, I'll give it a go, you are far more knowledgeable than me in this domain as I don't know what Vcur is. Edit: I can confirm this fix works. |
Suggested by @ubergarm - ikawrakow#662 (comment)
|
I suggest we move this conversation over to #668 |