Glm 4.5 by Thireus · Pull Request #662 · ikawrakow/ik_llama.cpp

Thireus · 2025-07-29T20:45:59Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Manually-specified variables were not used by the project: GGML_BACKEND_DL GGML_CPU

(╯°□°)╯︵ ┻━┻

…F16=1 -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1

Bump to latest ik_llama.cpp

Bump ik_llama to 8ffad18

Check if ffn_up and ffn_gate are of the same type before using fmoe

…g fmoe"

This reverts commit ff0c368.

saood06 · 2025-07-30T01:09:15Z

@Thireus

Did this work? Was the main issue the fact that this PR pulls in unrelated stuff?

Thireus · 2025-07-30T05:46:02Z

@saood06, more changes to be made and was meant to be a pull request on my fork until tested and working.

ubergarm · 2025-07-30T14:54:32Z

@Thireus

Thanks for looking into this one, as I've heard some good early reports from folks at TheBeaverAIClub discord.

I saw one tip on the mainline llama.cpp PR linked to SmallThinker PR, where possibly converting the safetensors to fp16 GGUF instead of the usual bf16 might fix something. this despite the original safetensors supposedly being bf16. But total speculation. I'm downloading the GLM4.5-Air now and hope to quantize it eventually.

Thireus · 2025-07-31T07:14:09Z

@ubergarm @saood06 - I have implemented the llama.cpp PR here: https://github.com/Thireus/ik_llama.cpp/tree/glm-4.5 but it has a known issue which is that it talks nonsense after some time. See: https://github.com/Thireus/ik_llama.cpp/releases/tag/glm-4.5-b4021-83d2bb3

ubergarm · 2025-07-31T20:47:37Z

@Thireus

Thanks, I'm going to hold off on GLM4.5 for now given the mainline lcpp PR seems to still be having issues.

In the mean time you can keep busy with these: https://www.deepcogito.com/research/cogito-v2-preview

😆 😭 so many models this week!!!

Thireus · 2025-07-31T21:04:24Z

@ubergarm - yeah i'm going to hold off for now, there are too many models and I still haven't finished calibrating the ones I already sharded. I got myself 2x RTX PRO 6000, so already have new toys to play with. :D

Thireus · 2025-08-01T07:37:38Z

@ubergarm, @saood06 - I think it's working fine actually. The issue may have been my broken prompt.

Could you guys check in the web UI?

git clone https://github.com/Thireus/GGUF-Tool-Suite
cd GGUF-Tool-Suite
# Make sure to copy the relevant download.conf for the model before running quant_assign.py
rm -f download.conf
# Use the download.conf of the chosen model
cp -f models/GLM-4.5/download.conf .
echo ".*=q6_K" > q6_K.recipe
mkdir -p kitchen && cd kitchen
../quant_downloader.sh ../q6_K.recipe

Once download finished:

ulimit -n 9999
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2 ~/ik_llama-glm-4.5-b4021-83d2bb3-bin-win-cuda-12.8-x64-avx512/llama-server -m GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01762.gguf  -mla 3 -fa \
  -amb 1024 \
  -fmoe \
  -ctk f16 \
  -c 110592 \
  -ngl 99 \
  -ot "blk\.([0-9]|1[0-9]|2[0-3])\.ffn_.*=CUDA0" -ot "blk\.(2[4-9]|28|29|3[0-9]|4[0-5])\.ffn_.*=CUDA1" -ot "blk\.(4[6-9])\.ffn_.*=CUDA2" --override-tensor exps=CPU \
  -ot exps=CPU \
  -b 4096 -ub 4096 \
  --warmup-batch \
  --no-mmap \
  --threads 36 \
  --main-gpu 0

Make sure you use https://github.com/Thireus/ik_llama.cpp/tree/83d2bb3e2d8a0a630f77b225515a52c48d4fe16b (there is also a release build for Windows)

Example of output:

What is the solution of x+5=-2?

Answer:

<think>I need to solve the equation \( x + 5 = -2 \) for \( x \).

The goal is to isolate \( x \) on one side of the equation. To do that, I should get rid of the +5 that's with the \( x \). The opposite of adding 5 is subtracting 5. So, if I subtract 5 from both sides of the equation, that should cancel out the +5 on the left.

Let me write that down:

Starting with:

\( x + 5 = -2 \)

Subtract 5 from both sides:

\( x + 5 - 5 = -2 - 5 \)

Now, simplify both sides:

On the left, +5 and -5 cancel each other out, so I just have \( x \).

On the right, -2 minus 5. Since both are negative, I can think of it as adding the absolute values and keeping the negative sign. So, -2 - 5 = -7.

Therefore, \( x = -7 \).

I should check if this is correct by plugging it back into the original equation.

If \( x = -7 \), then \( x + 5 = -7 + 5 \).

-7 + 5 is the same as 5 - 7, which is -2. And -2 equals the right side of the equation, which is -2. Perfect, it checks out.

I can also think about it in terms of a number line. If I start at x and add 5 to get to -2, that means x must be 5 units to the left of -2 on the number line. So, from -2, moving left 5 units: -2 to -3 is one, to -4 is two, to -5 is three, to -6 is four, to -7 is five. So, x = -7.

Another way is to use the property of equality. Whatever you do to one side, you must do to the other to keep it balanced. So, subtracting 5 from both sides maintains the equality.

I think I've confirmed it multiple ways. So, the solution is x = -7.</think>To solve the equation \(x + 5 = -2\) for \(x\), follow these steps:

1. **Isolate \(x\)** by subtracting 5 from both sides of the equation to eliminate the +5 on the left side:
   \[
   x + 5 - 5 = -2 - 5
   \]

2. **Simplify both sides**:
   - On the left, \(+5\) and \(-5\) cancel out, leaving \(x\).
   - On the right, \(-2 - 5 = -7\).
   \[
   x = -7
   \]

3. **Verify the solution** by substituting \(x = -7\) back into the original equation:
   \[
   (-7) + 5 = -2
   \]
   Simplifying the left side: \(-7 + 5 = -2\), which matches the right side of the equation (\(-2\)). This confirms the solution is correct.

**Solution:** \(x = -7\)

Also tested on coding abilities and seeing no issues so far. More examples: ggml-org/llama.cpp#14939 (comment)

Thireus · 2025-08-01T07:48:45Z

@ubergarm, would you be able to produce an imatrix for https://huggingface.co/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/tree/main please (not the Q8_0) 🙏🙏🙏?

ddh0 · 2025-08-01T08:53:26Z

Hi @Thireus, I notice you're using -mla 3, could you please explain why? As far as I'm aware this model doesn't use MLA, but maybe this is a clue to getting it working on mainline?

Thireus · 2025-08-01T09:04:09Z

Hi @Thireus, I notice you're using -mla 3, could you please explain why? As far as I'm aware this model doesn't use MLA, but maybe this is a clue to getting it working on mainline?

Hi @ddh0, a command line parameter that I often use when dealing with DeepSeek models and forgot to remove, nevertheless you will see it gets disabled when you launch llama-server:

...
=====================================================================
 MLA is only available for LLM_ARCH_DEEPSEEK2 -> turning off MLA
=====================================================================
...

Thireus · 2025-08-01T09:11:09Z

That'd be nice if I can get confirmation from someone else that I'm not hallucinating anything here and that the model indeed produces good answers.

kirnat · 2025-08-01T10:40:16Z

Thank you very much. I am testing it now with the q6_K from your example.

No broken responses so far with a few prompts in llama.cpp webui and Cline with create, edit and command line use.

./ik_llama.cpp/build/bin/llama-server -m ~/models/GGUF-Tool-Suite/kitchen/GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01762.gguf \
  -fa \
  -amb 1024 \
  -fmoe \
  -ngl 99 \
  -ot "blk\.([0-9]|1[0-8])\.ffn_.*=CUDA0" \
  -ot exps=CPU \
  -b 4096 -ub 4096 \
  -c 65536 \
  --temp 0.6 \
  --top-p 1.0 \
  --no-mmap \
  --alias GLM-4.5 \
  --threads 52 \
  --host 0.0.0.0 \
  --port 8080

ubergarm · 2025-08-01T15:42:37Z

@Thireus

I see some more chatter on the mainline lcpp PR linking a hugging face issue regarding <think> special token issue: https://huggingface.co/zai-org/GLM-4.5/discussions/9

Does that mean your hf_convert_to_gguf.py on your repo here: https://github.com/Thireus/ik_llama.cpp/tree/83d2bb3e2d8a0a630f77b225515a52c48d4fe16b will work on https://huggingface.co/zai-org/GLM-4.5 to produce GGUF BF16s with the latest tokenizer fix stuff from the mainline PR?

But you have already done that step and uploaded converted GGUF BF16's here: https://huggingface.co/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/tree/main which can be quantized and run using your fork?

So you want me to download that bf16 gguf and run imatrix on it (without converting to Q8_0 first) if I understand correctly?

Just catching up slowly and drinking my coffee, so much action this week lol.

Thireus · 2025-08-01T15:47:21Z

@ubergarm, yes please use the BF16 I have uploaded for computing the imatrix. You will need to use ulimit -n 9999 for your OS to loft the max opened files limit in the same terminal you run the llama command that computes the imatrix. Thank you so much!

ubergarm · 2025-08-01T16:24:08Z

@Thireus

Okay I'm downloading your BF16, and will try to get your fork going. Is there a different PR to add this architechture here, given this one seems closed?

Thireus · 2025-08-01T16:51:48Z

I'll need to add a new PR with clean code only for the changes relevant to this model.

ubergarm · 2025-08-01T18:33:47Z

Okay, so I have a Q8_0 up and running now for testing with some folks, here is how I got there:

git clone git@github.com:Thireus/ik_llama.cpp.git
cd ik_llama.cpp
git checkout 83d2bb3e2d8a0a630f77b225515a52c48d4fe16b

# compile as usual, CPU-only in my case

# quantize a Q8_0 from this BF16 GGUF: https://huggingface.co/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/tree/main
$ cat myscripts/quantize-GLM-4.5-v01.sh
#!/usr/bin/env bash

ulimit -n 9999

#numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --pure \
    /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01762.gguf \
    /mnt/raid/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-Thireus-Q8_0.gguf \
    Q8_0 \
    192

# run llama-server
$ cat myscripts/api-server-GLM-4.5.sh
#!/usr/bin/env bash

ulimit -n 9999

model=/mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-Thireus-Q8_0.gguf

numactl -N 0 -m 0 \
./build/bin/llama-server \
    --model "$model"\
    --alias Thireus/GLM-4.5-Thireus-Q8_0.gguf \
    --ctx-size 196608 \
    -fa -fmoe \
    -ctk q8_0 -ctv q8_0 \
    -ub 4096 -b 4096 \
    --parallel 3 \
    --threads 128 \
    --threads-batch 192 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap

Next I'll try to make imatrix with this from the bf16 GGUF directly

#!/usr/bin/env bash


ulimit -n 9999

# echo 0 | sudo tee /proc/sys/kernel/numa_balancing
# sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches

model=/mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-00001-of-01762.gguf

#Only the best for Thireus, don't use Q8_0 haha
#model=/mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/GLM-4.5-Thireus-Q8_0.gguf

numactl -N 1 -m 1 \
./build/bin/llama-imatrix \
    -m "$model" \
    -f ubergarm-imatrix-calibration-corpus-v02.txt \
    -o /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat \
    --verbosity 1 \
    --layer-similarity \
    --seed 1337 \
    --ctx-size 512 \
    -ub 4096 -b 4096 \
    --numa numactl \
    --threads 128 \
    --threads-batch 192 \
    --no-mmap

...

save_imatrix: entry '               blk.48.ffn_up_exps.weight' has partial data (98.75%) 2 out of 160 experts are missing data Storing **but be aware*
*
save_imatrix: warning: storing only 1000 out of 1012 entries

save_imatrix: stored collected data after 10 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[10]37447.5358,[11]38232.5777,[12]39631.3289,[13]41582.0199,[14]44141.9806,[15]43651.3243,

EDIT hrmm those perplexities are looking super high for the imatrix... i'll restart it and try again using -fa this time...

Thireus · 2025-08-01T18:51:32Z

Yes I don't know what is up with llama-perplexity, I've seen the same. Could be fa, please let us know.

ubergarm · 2025-08-01T18:52:10Z

@Thireus

Yes -fa is required, which points to possibly this: #565 (comment)

Did you remove Vcur reshaping from the mainline lcpp implementation? I'll have to try to diff your fork to see or after you open a PR here I can look.

compute_imatrix: tokenizing the input ..                                                                                                      [0/1844]
compute_imatrix: tokenization took 901.12 ms
compute_imatrix: computing over 814 chunks with batch_size 512
compute_imatrix: 10.08 seconds per pass - ETA 2 hours 16.72 minutes
[1]16.8092,[2]6.7403,[3]4.3630,[4]3.1866,[5]2.5865,[6]2.2122,[7]1.9898,[8]1.8430,[9]1.8314,
save_imatrix: entry '             blk.92.ffn_gate_exps.weight' has partial data (98.75%) 2 out of 160 experts are missing data Storing **but be aware*
*
...
save_imatrix: entry '               blk.48.ffn_up_exps.weight' has partial data (99.38%) 1 out of 160 experts are missing data Storing **but be aware*
*

save_imatrix: stored collected data after 10 chunks in /mnt/data/models/Thireus/GLM-4.5-THIREUS-BF16-SPECIAL_SPLIT/imatrix-GLM-4.5-BF16.dat
[10]1.7538,[11]1.8628,[12]1.9497,[13]2.0128,

I'll let this finish running and upload the resultikng imatrix.dat, but I'm not releasing any quants until we have a PR that is looking good and some more testing. Thanks!

ubergarm · 2025-08-01T19:15:19Z

@Thireus

Yeah I think the issue regarding ik_llama.cpp fork version is that we want to remove Vcur reshaping in llama.cpp -> build_glm4_moe() to fix the non -fa path e.g.:

                // reshape for multi-head
                Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head,    n_tokens);
                Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);
                Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens); # <--- delete this line

This has nothing to do with any issues possibly still going on with mainline and not really sure what is different in your implementation and mainlines either yet.

Thireus · 2025-08-01T19:38:45Z

Thank you, I'll give it a go, you are far more knowledgeable than me in this domain as I don't know what Vcur is.

Edit: I can confirm this fix works.

@ubergarm

Suggested by @ubergarm - ikawrakow#662 (comment)

Thireus · 2025-08-01T20:16:23Z

I suggest we move this conversation over to #668

Thireus and others added 30 commits June 2, 2025 20:09

Github actions from https://github.com/Thireus/llama.cpp

ac77a9d

Fix FATAL: Avoid pinning exact package versions. Use '~=' instead.

3c7947b

Fix FATAL: Avoid pinning exact package versions. Use '~=' instead.

0e729ae

Fix FATAL: Avoid pinning exact package versions. Use '~=' instead.

709e60b

Remove incompatible CMAKE options

5073c69

Manually-specified variables were not used by the project: GGML_BACKEND_DL GGML_CPU

Remove incompatible CMAKE options

aa93732

Manually-specified variables were not used by the project: GGML_BACKEND_DL GGML_CPU

Remove incompatible CMAKE options

d92d40e

Manually-specified variables were not used by the project: GGML_BACKEND_DL GGML_CPU

Remove incompatible CMAKE options

209e106

(╯°□°)╯︵ ┻━┻

Add missing CMakeLists

d616bb0

Debug target list

dca481e

Remove target

ea9d13e

No comment

4acf4cd

Delete ggml/src/ggml-cuda/CMakeLists.txt

35601f8

Build Windows Cuda 12.8 only

a77e304

Fix Windows sysl

96763ca

Fix Windows hip

9d14372

Debug Ubuntu build - Check workspace contents

ea57437

Fixing Ambiguous Overload C2668

fa021b6

Update iqk_quantize.cpp

6135832

Update iqk_quantize.cpp

37d796f

Update iqk_quantize.cpp

27fc1d2

Update iqk_quantize.cpp

ff7c80f

Only Windows Cuda + -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_B…

d498a98

…F16=1 -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1

Fix release and add -DGGML_AVX512=ON

f26fe36

Merge pull request #1 from ikawrakow/main

6e44051

Bump to latest ik_llama.cpp

Merge pull request #2 from ikawrakow/main

4e1d64d

Bump ik_llama to 8ffad18

Merge branch 'ikawrakow:main' into main

7a123b8

Check if ffn_up and ffn_gate are of the same type before using fmoe

fa54b58

Merge pull request #3 from ikawrakow/ik/check_up_gate_fmoe

23c3e73

Check if ffn_up and ffn_gate are of the same type before using fmoe

Revert "Check if ffn_up and ffn_gate are of the same type before usin…

b26d935

…g fmoe"

Thireus added 4 commits July 29, 2025 20:36

Revert "Update llama.cpp"

9b1fdff

This reverts commit ff0c368.

Update llama.cpp

746b87f

Update llama.cpp

9ce87a2

Update llama.cpp

92f5b07

Thireus closed this Jul 29, 2025

Thireus mentioned this pull request Aug 1, 2025

model: Add support for GLM 4.5 family of models (#14921) ggml-org/llama.cpp#14939

Merged

Thireus added a commit to Thireus/ik_llama.cpp that referenced this pull request Aug 1, 2025

Update llama.cpp - Fix non-fa ppl

9640fe9

Suggested by @ubergarm - ikawrakow#662 (comment)

Thireus mentioned this pull request Aug 1, 2025

Add support for GLM-4.5 models #668

Merged

4 tasks

Conversation

Thireus commented Jul 29, 2025

Uh oh!

saood06 commented Jul 30, 2025

Uh oh!

Thireus commented Jul 30, 2025

Uh oh!

ubergarm commented Jul 30, 2025

Uh oh!

Thireus commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented Jul 31, 2025

Uh oh!

Thireus commented Jul 31, 2025

Uh oh!

Thireus commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thireus commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddh0 commented Aug 1, 2025

Uh oh!

Thireus commented Aug 1, 2025

Uh oh!

Thireus commented Aug 1, 2025

Uh oh!

kirnat commented Aug 1, 2025

Uh oh!

ubergarm commented Aug 1, 2025

Uh oh!

Thireus commented Aug 1, 2025

Uh oh!

ubergarm commented Aug 1, 2025

Uh oh!

Thireus commented Aug 1, 2025

Uh oh!

ubergarm commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thireus commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ubergarm commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thireus commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thireus commented Aug 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Thireus commented Jul 31, 2025 •

edited

Loading

Thireus commented Aug 1, 2025 •

edited

Loading

Thireus commented Aug 1, 2025 •

edited

Loading

ubergarm commented Aug 1, 2025 •

edited

Loading

Thireus commented Aug 1, 2025 •

edited

Loading

ubergarm commented Aug 1, 2025 •

edited

Loading

ubergarm commented Aug 1, 2025 •

edited

Loading

Thireus commented Aug 1, 2025 •

edited

Loading