Port mdmd from mainline + Qwen2/2.5-VL support by ikawrakow · Pull Request #798 · ikawrakow/ik_llama.cpp

ikawrakow · 2025-09-25T12:49:50Z

This PR is a port of mainline's mtmd library and multi-modal command line tool llama-mtmd-cli, along with implementation of Qwen2/2.5-VL support.

Based on my own testing it seems fully functional.

Please test and provide feedback!

Original WIP description

This is WIP to port mdmd and mdmd-cli from mainline.

Current state: ~~compiles, but not functional (missing several ggml ops)~~

mtmd-related ops have been added to the CPU and CUDA back-ends
The mtmd library and mtmd-cli tools have been ported
Support for Qwen2/2.5-VL was added
Tests with the example image examples/mtmd/test-1.jpeg produce a meaningful response

Here an example run with the CPU back-end:

./bin/llama-mtmd-cli -m Qwen2.5Vl-7.6B-Q9_0gguf --mmproj mmproj-qwen2.5vl --image ../examples/mtmd/test-1.jpeg -c 8192 -n 8192 -s 5678 -t 16 -fa -p " "
...
load_hparams: model size:         1291.40 MiB
load_hparams: metadata size:      0.18 MiB
alloc_compute_meta:        CPU compute buffer size =     3.60 MiB
encoding image slice...
image slice encoded in 2042 ms
decoding image batch 1/1, n_tokens_batch = 414
image decoded (batch 1/1) in 1210 ms
The image shows the front page of The New York Times from July 21, 1969. The headline reads
"MEN WALK ON MOON: ASTRONAUTS LAND ON PLAIN; COLLECT ROCKS, PLANT FLAG."
This historic front page announces the Apollo 11 moon landing, a significant event in human history.
The page includes a photograph of the lunar module and a report on the astronauts' activities on
the moon. The New York Times was one of the first major newspapers to cover this event, providing
detailed reports and images of the historic mission.

llama_print_timings:        load time =     766.57 ms
llama_print_timings:      sample time =       4.76 ms /   122 runs   (    0.04 ms per token, 25603.36 tokens per second)
llama_print_timings: prompt eval time =    3494.36 ms /   425 tokens (    8.22 ms per token,   121.62 tokens per second)
llama_print_timings:        eval time =   15474.56 ms /   121 runs   (  127.89 ms per token,     7.82 tokens per second)
llama_print_timings:       total time =   19437.51 ms /   546 tokens

The same thing with mainline:

load_hparams: model size:         1291.40 MiB
load_hparams: metadata size:      0.18 MiB
alloc_compute_meta:        CPU compute buffer size =     3.60 MiB
main: loading model: ../../ik_llama.cpp/ncuda/junk.bin
encoding image slice...
image slice encoded in 4553 ms
decoding image batch 1/1, n_tokens_batch = 414
image decoded (batch 1/1) in 2729 ms

The image shows the front page of The New York Times from July 21, 1969. The headline reads
"MEN WALK ON MOON" in large, bold letters, indicating a significant event. Below the headline,
there are two photographs, though the details of the images are not clear from this description.
The article discusses the historic landing of astronauts on the moon, mentioning that they landed
on a plain, collected rocks, and planted a flag. The article is dated Monday, July 21, 1969, which
corresponds to the day of the Apollo 11 moon landing. The newspaper is priced at 10 cents, which
was the cost of a New York Times in 1969.


llama_perf_context_print:        load time =     909.82 ms
llama_perf_context_print: prompt eval time =    7653.99 ms /   425 tokens (   18.01 ms per token,    55.53 tokens per second)
llama_perf_context_print:        eval time =   19592.87 ms /   153 runs   (  128.06 ms per token,     7.81 tokens per second)
llama_perf_context_print:       total time =   27709.67 ms /   578 tokens
llama_perf_context_print:    graphs reused =          0

Interesting to see that image encoding/decoding in ik_llama.cpp is 2X the speed of mainline without me having done anything related to this part of the calculation.

TODO

Testing by others
The conversation mode does not seem to be working - works, it was just a matter of LOG vs LOG_TEE
Implement more vision models - preferences?

More observations

Tested with a bunch of additional images
Seems to work fine on CUDA
Interestingly enough, given a portion of a photograph of my passport, it correctly recognizes my place of birth even though only 5 out of 12 letters are visible. It hallucinates my birth year, which is not visible on the photograph.
Something is not quite right on the CPU when the number of image tokens is larger than the u-batch size. For the passport photo it generates 1036 image tokens, and I get gibberish unless I set u-batch = batch = 2048. This is strange because if something was wrong with setting the token positions (different than usual for Qwen2-VL), it shouldn't be working on CUDA either, but it does. Fixed with last commit
~~Something is not quite right also on CUDA. The first image works fine, but any attempt to add a second image to the conversation leads to gibberish.~~ Fixed with last commit

Ph0rk0z · 2025-09-27T14:35:27Z

So does using it with the server require more implementations?

ikawrakow · 2025-09-27T14:40:09Z

Nothing has been done for the server. You have a command line tool. You can use it like this

./bin/llama-mtmd-cli -m model --mmproj mmproj $other_args
...
> /image some_image.jpg
> Describe the image
...

(see the mtmd documentation)

For the server I'm hoping someone else will do a PR.

Ph0rk0z · 2025-09-27T16:44:19Z

Ok. That clears it up. I have still yet to use the command line tools :P

We have progress!

mcm007 · 2025-09-28T14:52:00Z

Impressive!

Tested: Qwen2.5-VL-7B, gemma-3-12b and pixtral-12b on two systems with CPU+iGPU.

All 3 models are working OK.
Vulkan build is working as well; encoding image slice... (with -ngl 0...) is 2x faster than the CPU build.
.png and .jpg are working.
mmproj-f16.gguf is acepted but not mmproj-Q8_0.gguf

version: 3899 (3d4977cb)

ikawrakow · 2025-09-28T17:11:53Z

mmproj-f16.gguf is acepted but not mmproj-Q8_0.gguf

Does mainline support quantized mmproj files? To me it looked like the convolution and im2col kernels only work with f16/f32 tensors.

mcm007 · 2025-09-28T18:49:21Z

Yes, mmproj-Qwen2.5-VL-7B-Instruct-Q8_0.gguf produces good results in mainline; it seems it needs --no-mmproj-offload otherwise it produces some random text each time.

ikawrakow mentioned this pull request Sep 25, 2025

Feature Request: Add vision / multi-modality support #792

Closed

4 tasks

Iwan Kawrakow added 16 commits September 26, 2025 10:00

Add mtmd: the beginning

7829a60

Add mtmd: mtmd.cpp compiles

31a9ddb

Add mtmd: clip initialization compiles

5913317

Add mtmd: clip.cpp compiles

6b0c8e0

Add mtmd: builds successfully

24618e3

Add CPU implementation for GGML_OP_GLU

c86dadd

Add CUDA implementation for GGML_OP_GLU

292c934

Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

8732eeb

Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

879201c

Add mtmd: refresh CPU rope

293a59d

Add mtmd: refresh CUDA rope

ae9ac97

Add mtmd: add Qwen2-VL

f2a094d

Add mtmd: Qwen2.5-VL text seems to work with this change

933b99c

Add mtmd: fix swiglu

e7ddefc

Add mtmd: use LOG_TEE so generated tokens show up in terminal

042f595

Add mtmd: do not attempt to load a GPU backend if none are available

dbcc01b

ikawrakow force-pushed the ik/add_mtmd branch from 97fb051 to dbcc01b Compare September 26, 2025 07:02

Iwan Kawrakow added 5 commits September 26, 2025 11:20

GLU, not GPU

7e6a1fd

Fix typo

09b3381

Fix new/free mismatch

e3e572f

LOG stuff

f629952

Add mtmd: this fixes gibberish on second image

be7eb79

ikawrakow marked this pull request as ready for review September 26, 2025 15:34

ikawrakow changed the title ~~WIP: port mdmd from mainline~~ Port mdmd from mainline + Qwen2/2.5-VL support Sep 26, 2025

This was referenced Sep 26, 2025

Feature Request: add support for vision model InternVL3_5 #730

Open

Feature Request: port no-mmproj-offload #614

Closed

ikawrakow merged commit 87e4762 into main Sep 27, 2025

saood06 mentioned this pull request Feb 16, 2026

Feature Request: Support Kimi K2.5 #1264

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port mdmd from mainline + Qwen2/2.5-VL support#798

Port mdmd from mainline + Qwen2/2.5-VL support#798
ikawrakow merged 21 commits intomainfrom
ik/add_mtmd

ikawrakow commented Sep 25, 2025 •

edited

Loading

Uh oh!

Ph0rk0z commented Sep 27, 2025

Uh oh!

ikawrakow commented Sep 27, 2025

Uh oh!

Ph0rk0z commented Sep 27, 2025 •

edited

Loading

Uh oh!

mcm007 commented Sep 28, 2025

Uh oh!

ikawrakow commented Sep 28, 2025

Uh oh!

mcm007 commented Sep 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ikawrakow commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Original WIP description

TODO

More observations

Uh oh!

Ph0rk0z commented Sep 27, 2025

Uh oh!

ikawrakow commented Sep 27, 2025

Uh oh!

Ph0rk0z commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcm007 commented Sep 28, 2025

Uh oh!

ikawrakow commented Sep 28, 2025

Uh oh!

mcm007 commented Sep 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ikawrakow commented Sep 25, 2025 •

edited

Loading

Ph0rk0z commented Sep 27, 2025 •

edited

Loading