feat: add multimodal image support for vision-capable LLMs by lcolok · Pull Request #981 · sipeed/picoclaw

lcolok · 2026-03-02T09:21:31Z

Summary / 概述

Enable channels (e.g. Telegram) to forward user-attached images to vision-capable LLMs. Previously, images were downloaded and stored locally but never passed to the provider — the media pipeline was broken at the agent layer (msg.Media was discarded, BuildMessages received nil for media, and Message had no image field).

让 channel（如 Telegram）能将用户发送的图片转发给支持视觉的 LLM。此前图片虽已下载存储到本地，但从未传递给 provider — 媒体管道在 agent 层断裂（msg.Media 被丢弃，BuildMessages 的 media 参数硬编码为 nil，Message 也没有图片字段）。

Changes / 改动

File / 文件	Change / 改动
`protocoltypes/types.go`	Add `ImageURL` type, `ContentBlock.ImageURL` field, `Message.ContentParts` field / 新增 `ImageURL` 类型及相关字段
`providers/media.go`	New file — `LoadMediaAsContentParts()`: local images → base64 data URL (jpg/png/gif/webp, ≤5MB) / 新建 — 图片加载工具函数
`agent/loop.go`	Add `Media` to `processOptions`, resolve `media://` refs to local paths, pass to `BuildMessages` / 解析媒体引用并传递
`agent/context.go`	`BuildMessages` attaches `ContentParts` (text + image blocks) when media is present / 有媒体时构建多模态内容
`openai_compat/provider.go`	`Content` field becomes `any` (string or `[]contentPart`), following OpenAI Vision API spec / 适配 OpenAI Vision API
`anthropic/provider.go`	Map `ContentParts` to `NewImageBlockBase64()` via `parseDataURL()` / 适配 Anthropic Vision API

Design / 设计

Minimal & upstream-friendly: Only extends existing types, no breaking changes / 最小改动，仅扩展不破坏
Follows OpenAI Vision API spec: `content` field is `string | []ContentPart` / 遵循 OpenAI Vision API 规范
Backward compatible: When no media is attached, all code paths remain unchanged. `Content` (string) is always preserved as fallback / 无图片时所有行为不变
Both providers supported: OpenAI-compatible (content array) and Anthropic (image block base64) / 两种 provider 均已适配

Data Flow / 数据流

Channel downloads image → MediaStore.Store() → media://ref
  → processMessage: Resolve(ref) → /tmp/xxx.jpg
    → processOptions.Media → BuildMessages
      → LoadMediaAsContentParts → data:image/jpeg;base64,...
        → ContentParts on user Message
          → OpenAI: []openaiContentPart (Vision API format)
          → Anthropic: NewImageBlockBase64()

Test Plan / 测试计划

Compiles with `CGO_ENABLED=0 go build ./cmd/picoclaw/...`
Telegram: send image + text "这是什么" → AI describes the image
Telegram: send image only → AI describes the image
Telegram: send text only → normal reply, no regression
Send non-image file → no crash, text part processed normally

Enable channels (e.g. Telegram) to pass user-attached images through to vision-capable LLMs. Previously, downloaded images were stored locally but never forwarded to the provider — the media pipeline was broken at the agent layer. Changes: - Add ImageURL type and ContentParts field to protocoltypes.Message - New providers/media.go: LoadMediaAsContentParts converts local files to base64 data URLs (supports jpg/png/gif/webp, ≤5MB per image) - Agent loop resolves media:// refs to local paths and passes them through processOptions to BuildMessages - BuildMessages attaches ContentParts (text + image_url blocks) when media is present, preserving pure-text behavior when absent - OpenAI compat: Content field becomes `any` (string or []contentPart) following the OpenAI Vision API spec - Anthropic: ContentParts mapped to NewImageBlockBase64 via parseDataURL Backward compatible — no existing behavior changes when no media is attached. The Content (string) field is always preserved as fallback.

lcolok · 2026-03-04T06:49:57Z

Closing in favor of #1020 which was merged with a more complete implementation (streaming base64, filetype detection, configurable size limit).

One piece that #1020 doesn't cover: Anthropic native provider vision support (anthropic/provider.go). I'll submit that as a separate, focused PR.

lcolok force-pushed the feat/multimodal-image-support branch from bd5cae9 to 8edf3ba Compare March 2, 2026 09:27

sipeed-bot bot added type: enhancement New feature or request domain: agent domain: provider labels Mar 3, 2026

This was referenced Mar 3, 2026

🦞 OpenClaw 生态日报 2026-03-03 duanyytop/agents-radar#61

Closed

🦞 OpenClaw 生态日报 2026-03-03 duanyytop/agents-radar#66

Open

lcolok added 2 commits March 3, 2026 16:48

style: fix gci formatting in openaiContentPart struct

152529a

lcolok force-pushed the feat/multimodal-image-support branch from 862fe1a to 152529a Compare March 3, 2026 08:48

lcolok closed this Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add multimodal image support for vision-capable LLMs#981

feat: add multimodal image support for vision-capable LLMs#981
lcolok wants to merge 2 commits intosipeed:mainfrom
lcolok:feat/multimodal-image-support

lcolok commented Mar 2, 2026

Uh oh!

lcolok commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lcolok commented Mar 2, 2026

Summary / 概述

Changes / 改动

Design / 设计

Data Flow / 数据流

Test Plan / 测试计划

Uh oh!

lcolok commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant