-
Notifications
You must be signed in to change notification settings - Fork 13.5k
support GLM-4.5V and GLM-4.1V vision models #16600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
need `clip.vision.rope.freq_base` for GLM-4.5V
|
So, it turns out that vision in this model is based on Qwen3-VL, which still needs support from llama.cpp. I am pretty familiar with llama.cpp in general but not with Also just saw this thread (#16207) in which someone has posted a patch to get Qwen3-VL kinda-sorta-working in llama.cpp. I will take a look at that too and see if it is helpful - it might make more sense to get Qwen3-VL to a working state in llama.cpp first and only then start working on this PR on top of that. Not sure, just thinking out loud. |
|
Thanks for your work! @ddh0 |
Thank you @rujialiu! I suspect your understanding of the Also cc @ngxson (llama vision expert :)) |
I have 0 understanding of |
|
@ddh0 I asked Claude Sonnet 4.5 to carefully inspect
It's so similar to Qwen2.5-VL, but why the code re-uses qwen3_vl_moe? It's because Qwen2.5-VL doesn't have moe version 😄 So I guess it's ok to resume the work directly, based on https://github.com/FMayran/llama.cpp/tree/QwenVL-causal-fix It should be easy to adapt to whatever "llama_batch improvement" is merged into BTW: Can we make sure the dense version (GLM-4.1V-9B-Thinking #14495 ) is working first? It's much smaller, easier to compare result with |
|
Thank you @rujialiu, that's all very helpful. I will take a look at GLM-4.1V-9B-Thinking and see if it can be incorporated into this PR. Is there a PR associated with the branch you linked ( |
Of course! Hopefully @ngxson will find some time to fix the general problem (adding an internal token index for casual check). Since you're familiar with LLM part, you can take a look our discussion in #15474 (the quickiest way is to read in a bottom-up order until you understand). The issue and solution is conceptually very simple, but I'm not brave/skillful enough to touch |
now there is: #16745 |
multimodal projector is identical between the models
Add support for zai-org/GLM-4.5V and zai-org/GLM-4.1V-9B-Thinking vision models to llama.cpp. I currently only plan to support images + text, no video inputs in this PR.
The architecture is
Glm4vMoeForConditionalGeneration("model_type": "glm4v_moe") /Glm4vForConditionalGeneration("model_type": "glm4v"). Internally, these consist of an LLM (text model) and a ViT (vision adapter / multimodal projector):LLM
model.language_model.apply_multimodal_rotary_pos_emb, it applies rotary embeddings across temporal, height, and width dimensions for visual tokensViT
Aimv2VisionModelmodel.visual.Glm4vMoeVisionEmbeddingsmodule to handle varied image resolutionsapply_rotary_pos_emb_vision)Other notes:
References:
See also: