Add MTP decoding support for GLM-4.x MoE#1270
Conversation
Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`. Changes include: - Updating `llama_batch` to support `mtp_params`. - Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft). - Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`). - Adapting the embedding extraction logic to skip MTP update passes.
…or Draft Model).
Even if it is, it has always been basically useless, see #199. |
|
Sorry for the delay reviewing this. I have been focusing on Qwen3-Next optimizations in the last few days. Is there a GLM-4.5-AIR model that I can download for testing that has not crippled MTP tensors included? (as normally they have been either excluded or quantized with very low bpw quants to reduce model size) |
Unsloth and Ubergarm preserve the MTP layer, but it is quantized at the same level as the base model. I typically use IQ4_XS from Unsloth so I can test in both projects (meaning the MTP layer is also at the Q4 level), but I believe using GLM-4.5-Air-Q8_0 would be better for testing. |
|
The PR LGTM. Can you resolve the conflicts? Thanks. I played a bit with it. It looks like it is working, but performance goes down. |
|
Sorry for the delay. The self-speculative logic changed quite a bit, so I took the opportunity to refactor MTP to fit the current
Definitely not. Upstream typically sees a 60-70% acceptance rate for a single token. I am also seeing the lower rate here. I have a strong suspicion about the cause (likely related to embeddings input), but debugging it will take some time. It might require adjustments to how embeddings are extracted during I doubt anyone would want to use MTP in its current state given the performance hit, but the architectural changes in this PR are solid. Feel free to merge if you want the structural support in place while I investigate the acceptance rate issue in a follow-up PR. |
|
I see this warning when building: Well, yes. One shouldn't bitwise-or two unrelated things. |
|
Well, this was from the Kimi port which is funny, in there the logic is the same so I'm curious if the warning is simply being ignored/suppressed upstream. Lets not complicate this, I pushed a commit to fix the warning and also to fix another problem with the mtp param in common that would crash the |
ikawrakow
left a comment
There was a problem hiding this comment.
@firecoperana Do you want to look at it?
This reverts commit fd5c6a4.
This reverts commit fd5c6a4.
This reverts commit fd5c6a4.
…1270)"" This reverts commit 70833dd4316b55937c88647a0ffd68901012dc1c.
This reverts commit fd5c6a4.
…1270)"" This reverts commit 70833dd4316b55937c88647a0ffd68901012dc1c.
This reverts commit fd5c6a4.
…1270)"" This reverts commit 70833dd4316b55937c88647a0ffd68901012dc1c.
This is a follow-up to discussion #1228 that allows using MTP for the GLM 4.5/4.6/4.7 family. This PR focuses primarily on the implementation logic, as performance currently has some regressions due to the necessity of running an additional small layer as a secondary model.
Currently it support four arguments:
-mtpor--multi-token-prediction-no-mtpor--no-multi-token-prediction--draft-maxand--draft-p-min(same as used for speculative decoding)It took some time to understand the differences between upstream and ik_llama, most of the effort went into adapting the KV cache logic for MTP. The most notable difference is the acceptance rate, which ranges from 30-50% here, compared to 50-70% upstream. I don't know exactly what is causing the degradation, so I need to investigate further.
As small examples, this is what I get using hybrid spec with layer mode in creative tasks where MTP have the worst case scenario:
1) GLM 4.5 Air IQ4_XS
a) Without mtp
b) -mtp --draft-max 1 --draft-p-min 0.85
c) -mtp --draft-max 2 --draft-p-min 0.85
2) GLM 4.7 IQ2_KS
a) Without mtp
b) -mtp --draft-max 1 --draft-p-min 0.85
c) -mtp --draft-max 2 --draft-p-min 0.85
d) -mtp --draft-max 3 --draft-p-min 0.85
Some observations that I have, first in 4.5 Air I noticed that
blk.46.nextn.embed_tokens.weightis constantly being copied between backends. I don't know exactly how to fix this; I suspectedcb()could handle it, but that appears incorrect.Also, is
system_prompt_updateinserver-context.cppstill used? I couldn't trigger it, so I'm unsure if it needs a small update for MTP to run there as well.