Skip to content

Implement LFM2.5-VL#105

Merged
JamePeng merged 4 commits intoJamePeng:mainfrom
TAO71-AI:lfm2.5-vl
Apr 6, 2026
Merged

Implement LFM2.5-VL#105
JamePeng merged 4 commits intoJamePeng:mainfrom
TAO71-AI:lfm2.5-vl

Conversation

@alcoftTAO
Copy link
Copy Markdown

Implemented the chat handler for LFM2.5-VL.

Right now it may not work, since I'm getting this error: LFM25VLChatHandler(_init_mtmd_context): Failed to load mtmd context from: .../mmproj-LFM2.5-VL-1.6b-BF16.gguf.

@JamePeng
Copy link
Copy Markdown
Owner

JamePeng commented Apr 4, 2026

Right now it may not work, since I'm getting this error: LFM25VLChatHandler(_init_mtmd_context): Failed to load mtmd context from: .../mmproj-LFM2.5-VL-1.6b-BF16.gguf.

Let me see what the problem is.

@JamePeng
Copy link
Copy Markdown
Owner

JamePeng commented Apr 4, 2026

Working fine with me:

class LFM25VLChatHandler(MTMDChatHandler):
    # Aligned with LFM2.5-VL tokenizer_config
    LFM25VL_BOS_TOKEN = "<|startoftext|>"
    LFM25VL_EOS_TOKEN = "<|im_end|>"
    LFM25VL_PAD_TOKEN = "<|pad|>"

    # Image specific tokens
    LFM25VL_IMAGE_TOKEN = "<image>"
    LFM25VL_IMAGE_START_TOKEN = "<|image_start|>"
    LFM25VL_IMAGE_END_TOKEN = "<|image_end|>"
    LFM25VL_IMAGE_THUMBNAIL = "<|img_thumbnail|>"

    CHAT_FORMAT = (
        "{{- bos_token -}}\n"
        "{%- set keep_past_thinking = keep_past_thinking | default(false) -%}\n"
        "{%- set ns = namespace(system_prompt='', content='') -%}\n"
        "{%- if messages[0]['role'] == 'system' -%}\n"
        "    {%- set ns.system_prompt = messages[0]['content'] -%}\n"
        "    {%- set messages = messages[1:] -%}\n"
        "{%- endif -%}\n"
        "{%- if tools -%}\n"
        "    {%- set ns.system_prompt = ns.system_prompt + ('\\n' if ns.system_prompt else '') + 'List of tools: [' -%}\n"
        "    {%- for tool in tools -%}\n"
        "        {%- if tool is not string -%}\n"
        "            {%- set tool = tool | tojson -%}\n"
        "        {%- endif -%}\n"
        "        {%- set ns.system_prompt = ns.system_prompt + tool -%}\n"
        "        {%- if not loop.last -%}\n"
        "            {%- set ns.system_prompt = ns.system_prompt + ', ' -%}\n"
        "        {%- endif -%}\n"
        "    {%- endfor -%}\n"
        "    {%- set ns.system_prompt = ns.system_prompt + ']' -%}\n"
        "{%- endif -%}\n"
        "{%- if ns.system_prompt -%}\n"
        "    {{- '<|im_start|>system\\n' + ns.system_prompt + '<|im_end|>\\n' -}}\n"
        "{%- endif -%}\n"
        "{%- set ns.last_assistant_index = -1 -%}\n"
        "{%- for message in messages -%}\n"
        "    {%- if message['role'] == 'assistant' -%}\n"
        "        {%- set ns.last_assistant_index = loop.index0 -%}\n"
        "    {%- endif -%}\n"
        "{%- endfor -%}\n"
        "{%- for message in messages -%}\n"
        "    {{- '<|im_start|>' + message['role'] + '\\n' -}}\n"
        "    {%- set content = message['content'] -%}\n"
        "    {%- if content is not string -%}\n"
        "        {%- set ns.content = '' -%}\n"
        "        {#- MTMD-style Multimodal Injection (Audio stripped for VL model) -#}\n"
        "        {%- for item in content -%}\n"
        "            {%- if item['type'] == 'image_url' -%}\n"
        "                {%- set img_val = item['image_url'] if item['image_url'] is string else item['image_url']['url'] -%}\n"
        "                {%- set ns.content = ns.content + img_val -%}\n"
        "            {%- elif item['type'] == 'text' -%}\n"
        "                {%- set ns.content = ns.content + item['text'] -%}\n"
        "            {%- else -%}\n"
        "                {%- set ns.content = ns.content + (item | tojson) -%}\n"
        "            {%- endif -%}\n"
        "        {%- endfor -%}\n"
        "        {%- set content = ns.content -%}\n"
        "    {%- endif -%}\n"
        "    {%- if message['role'] == 'assistant' and not keep_past_thinking and loop.index0 != ns.last_assistant_index -%}\n"
        "        {%- if '</think>' in content -%}\n"
        "            {%- set content = content.split('</think>')[-1] | trim -%}\n"
        "        {%- endif -%}\n"
        "    {%- endif -%}\n"
        "    {{- content + '<|im_end|>\\n' -}}\n"
        "{%- endfor -%}\n"
        "{%- if add_generation_prompt -%}\n"
        "    {{- '<|im_start|>assistant\\n' -}}\n"
        "{%- endif -%}\n"
    )

    def __init__(self, keep_past_thinking: bool = False, **kwargs):
        self.keep_past_thinking = keep_past_thinking
        super().__init__(**kwargs)


    def __call__(self, **kwargs):
        self.extra_template_arguments["keep_past_thinking"] = self.keep_past_thinking

        kwargs['stop'] = [self.LFM25VL_EOS_TOKEN]

        if self.verbose:
            print(f"{self.log_prefix}(keep_past_thinking={self.keep_past_thinking}) - Start processing")
        return super().__call__(**kwargs)

However, I found that a 512x512 image is required for it to be recognized as a single image.

@JamePeng
Copy link
Copy Markdown
Owner

JamePeng commented Apr 5, 2026

How are the LFM2.5VL chat template tests going? I'll probably start compiling Wheel after the upstream llama.cpp has a few more gemma4 fixes and stability commits merged.

@alcoftTAO
Copy link
Copy Markdown
Author

I've updated the code, still having the same error. However, setting verbose=True I get the following information:

...
clip_model_loader: tensor[436]: n_dims = 2, name = v.blk.9.attn_q.weight, tensor_size=2654208, offset=850415040, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[437]: n_dims = 1, name = v.blk.9.attn_v.bias, tensor_size=4608, offset=853069248, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[438]: n_dims = 2, name = v.blk.9.attn_v.weight, tensor_size=2654208, offset=853073856, shape:[1152, 1152, 1, 1], type = bf16
clip_model_loader: tensor[439]: n_dims = 1, name = v.post_ln.bias, tensor_size=4608, offset=855728064, shape:[1152, 1, 1, 1], type = f32
clip_model_loader: tensor[440]: n_dims = 1, name = v.post_ln.weight, tensor_size=4608, offset=855732672, shape:[1152, 1, 1, 1], type = f32
clip_ctx: CLIP using CUDA0 backend
clip_init: failed to load model '../mmproj-LFM2.5-VL-1.6b-BF16.gguf': load_hparams: image_max_pixels (262144) is less than image_min_pixels (1048576)

mtmd_init_from_file: error: Failed to load CLIP model from ../mmproj-LFM2.5-VL-1.6b-BF16.gguf

I've checked my code and I've set image_min_tokens=1024 and image_max_tokens=-1.

@JamePeng
Copy link
Copy Markdown
Owner

JamePeng commented Apr 5, 2026

Could it be a problem with your mmproj model?

@alcoftTAO
Copy link
Copy Markdown
Author

alcoftTAO commented Apr 5, 2026

I've downloaded this one. Also tried this one.

However, everything works fine when image_min_tokens=256 or image_min_tokens=-1.

I've read the GGUF metadata from the mmproj and I've found that clip.vision.image_size is 256. I think that if image_max_tokens=-1 the value for it is the clip.vision.image_size parameter from the metadata.

This could explain why I was having these errors, since I was setting image_min_tokens=1024.

@JamePeng
Copy link
Copy Markdown
Owner

JamePeng commented Apr 5, 2026

image "min_image_tokens": 64, "max_image_tokens": 256,

@alcoftTAO alcoftTAO marked this pull request as ready for review April 6, 2026 01:03
@JamePeng JamePeng merged commit 9e749f9 into JamePeng:main Apr 6, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants