feat: add multimodal image support for vision-capable LLMs#981
Closed
lcolok wants to merge 2 commits intosipeed:mainfrom
Closed
feat: add multimodal image support for vision-capable LLMs#981lcolok wants to merge 2 commits intosipeed:mainfrom
lcolok wants to merge 2 commits intosipeed:mainfrom
Conversation
bd5cae9 to
8edf3ba
Compare
This was referenced Mar 3, 2026
Enable channels (e.g. Telegram) to pass user-attached images through to vision-capable LLMs. Previously, downloaded images were stored locally but never forwarded to the provider — the media pipeline was broken at the agent layer. Changes: - Add ImageURL type and ContentParts field to protocoltypes.Message - New providers/media.go: LoadMediaAsContentParts converts local files to base64 data URLs (supports jpg/png/gif/webp, ≤5MB per image) - Agent loop resolves media:// refs to local paths and passes them through processOptions to BuildMessages - BuildMessages attaches ContentParts (text + image_url blocks) when media is present, preserving pure-text behavior when absent - OpenAI compat: Content field becomes `any` (string or []contentPart) following the OpenAI Vision API spec - Anthropic: ContentParts mapped to NewImageBlockBase64 via parseDataURL Backward compatible — no existing behavior changes when no media is attached. The Content (string) field is always preserved as fallback.
862fe1a to
152529a
Compare
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary / 概述
Enable channels (e.g. Telegram) to forward user-attached images to vision-capable LLMs. Previously, images were downloaded and stored locally but never passed to the provider — the media pipeline was broken at the agent layer (
msg.Mediawas discarded,BuildMessagesreceivednilfor media, andMessagehad no image field).让 channel(如 Telegram)能将用户发送的图片转发给支持视觉的 LLM。此前图片虽已下载存储到本地,但从未传递给 provider — 媒体管道在 agent 层断裂(
msg.Media被丢弃,BuildMessages的 media 参数硬编码为 nil,Message也没有图片字段)。Changes / 改动
Design / 设计
Data Flow / 数据流
Test Plan / 测试计划