mtmd: add GLM4V multimodal model with conversion support#17998
mtmd: add GLM4V multimodal model with conversion support#17998eelbaz wants to merge 1 commit intoggml-org:masterfrom eelbaz:glm4v-complete
Conversation
Adds complete support for GLM-4.6V-Flash and related models including runtime inference and HuggingFace-to-GGUF conversion scripts. Architecture: - Vision encoder with dual Conv2D patch embedding and M-RoPE - GLM4-based LLM with M-RoPE position encoding - 2x2 patch merger with SwiGLU FFN - Reuses existing ggml_rope_multi() infrastructure Conversion support: - GLM4VisionModel class for vision encoder conversion - Handles Conv3D to Conv2D split for patch embeddings - Lazy tensor evaluation and all GLM4V-specific tensors Testing (zai-org/GLM-4.6V-Flash): - Text model: 18.8GB, 523 tensors (bf16) - Vision encoder: 1.7GB, 182 tensors (f16) - Inference: Correct image descriptions Peer-coded with claude for debugging
|
Tested, GLM-4.6V-Flash cannot correctly recognize image content (text content). The model on the official website can correctly recognize the same image content. |
|
@IIIIIllllIIIIIlllll - this is the command I used to test:
`Got it, let's analyze the image. The user is asking what we see in this famous painting. Firstly recognize that "Mona Lisa" by Leonardo da Vinci—so key elements: a woman (the MonaLisa) with long brown hair styled down over her shoulders; she has hands crossed at waist level or lower? Wait, looking closely—the pose of the figure. The background is an outdoor landscape scene in soft blues and greens. So describe it: This image depicts Leonardo da Vinci's famous painting "Mona Lisa" (also known as La Gioconda). In this artwork:.... With Text: "here's output of text image (https://i.sstatic.net/IvV2y.png) Can you share the command you're testing with, I want to see if I can replicate. |
|
I copied your command and tested it again, but the result was the same. The command: Here is the image: Edited: Output: |
|
This PR won't work as-is because it misses many small things I'm superseding it with #18042 |

Adds complete GLM-4.6V-Flash support including runtime and conversion scripts.
(This pull is a working take-or-leave support for glm4.6V models while official support is provided by the maintainer.)
Usage
Convert from HuggingFace:
Run inference:
llama-mtmd-cli -m model.gguf --mmproj mmproj.gguf \ --ctx-size 256 --temp 0.8 --top-p 0.6 --top-k 1.1 \ --repeat_penalty 1.9 -fa on --jinja \ -p "In English Only: Describe what you see in the image." \ --image image.jpgNote: peer-coded with Claude for Debugging