Skip to content

feat: add multimodal image support for vision-capable LLMs#981

Closed
lcolok wants to merge 2 commits intosipeed:mainfrom
lcolok:feat/multimodal-image-support
Closed

feat: add multimodal image support for vision-capable LLMs#981
lcolok wants to merge 2 commits intosipeed:mainfrom
lcolok:feat/multimodal-image-support

Conversation

@lcolok
Copy link

@lcolok lcolok commented Mar 2, 2026

Summary / 概述

Enable channels (e.g. Telegram) to forward user-attached images to vision-capable LLMs. Previously, images were downloaded and stored locally but never passed to the provider — the media pipeline was broken at the agent layer (msg.Media was discarded, BuildMessages received nil for media, and Message had no image field).

让 channel(如 Telegram)能将用户发送的图片转发给支持视觉的 LLM。此前图片虽已下载存储到本地,但从未传递给 provider — 媒体管道在 agent 层断裂(msg.Media 被丢弃,BuildMessages 的 media 参数硬编码为 nil,Message 也没有图片字段)。

Changes / 改动

File / 文件 Change / 改动
`protocoltypes/types.go` Add `ImageURL` type, `ContentBlock.ImageURL` field, `Message.ContentParts` field / 新增 `ImageURL` 类型及相关字段
`providers/media.go` New file — `LoadMediaAsContentParts()`: local images → base64 data URL (jpg/png/gif/webp, ≤5MB) / 新建 — 图片加载工具函数
`agent/loop.go` Add `Media` to `processOptions`, resolve `media://` refs to local paths, pass to `BuildMessages` / 解析媒体引用并传递
`agent/context.go` `BuildMessages` attaches `ContentParts` (text + image blocks) when media is present / 有媒体时构建多模态内容
`openai_compat/provider.go` `Content` field becomes `any` (string or `[]contentPart`), following OpenAI Vision API spec / 适配 OpenAI Vision API
`anthropic/provider.go` Map `ContentParts` to `NewImageBlockBase64()` via `parseDataURL()` / 适配 Anthropic Vision API

Design / 设计

  • Minimal & upstream-friendly: Only extends existing types, no breaking changes / 最小改动,仅扩展不破坏
  • Follows OpenAI Vision API spec: `content` field is `string | []ContentPart` / 遵循 OpenAI Vision API 规范
  • Backward compatible: When no media is attached, all code paths remain unchanged. `Content` (string) is always preserved as fallback / 无图片时所有行为不变
  • Both providers supported: OpenAI-compatible (content array) and Anthropic (image block base64) / 两种 provider 均已适配

Data Flow / 数据流

Channel downloads image → MediaStore.Store() → media://ref
  → processMessage: Resolve(ref) → /tmp/xxx.jpg
    → processOptions.Media → BuildMessages
      → LoadMediaAsContentParts → data:image/jpeg;base64,...
        → ContentParts on user Message
          → OpenAI: []openaiContentPart (Vision API format)
          → Anthropic: NewImageBlockBase64()

Test Plan / 测试计划

  • Compiles with `CGO_ENABLED=0 go build ./cmd/picoclaw/...`
  • Telegram: send image + text "这是什么" → AI describes the image
  • Telegram: send image only → AI describes the image
  • Telegram: send text only → normal reply, no regression
  • Send non-image file → no crash, text part processed normally

lcolok added 2 commits March 3, 2026 16:48
Enable channels (e.g. Telegram) to pass user-attached images through
to vision-capable LLMs. Previously, downloaded images were stored locally
but never forwarded to the provider — the media pipeline was broken at
the agent layer.

Changes:
- Add ImageURL type and ContentParts field to protocoltypes.Message
- New providers/media.go: LoadMediaAsContentParts converts local files
  to base64 data URLs (supports jpg/png/gif/webp, ≤5MB per image)
- Agent loop resolves media:// refs to local paths and passes them
  through processOptions to BuildMessages
- BuildMessages attaches ContentParts (text + image_url blocks) when
  media is present, preserving pure-text behavior when absent
- OpenAI compat: Content field becomes `any` (string or []contentPart)
  following the OpenAI Vision API spec
- Anthropic: ContentParts mapped to NewImageBlockBase64 via parseDataURL

Backward compatible — no existing behavior changes when no media is
attached. The Content (string) field is always preserved as fallback.
@lcolok lcolok force-pushed the feat/multimodal-image-support branch from 862fe1a to 152529a Compare March 3, 2026 08:48
@lcolok
Copy link
Author

lcolok commented Mar 4, 2026

Closing in favor of #1020 which was merged with a more complete implementation (streaming base64, filetype detection, configurable size limit).

One piece that #1020 doesn't cover: Anthropic native provider vision support (anthropic/provider.go). I'll submit that as a separate, focused PR.

@lcolok lcolok closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant