feat(agent): add vision/image support with streaming base64 and filetype detection by shikihane · Pull Request #1020 · sipeed/picoclaw

shikihane · 2026-03-03T08:51:14Z

Description

Add vision/image support to the agent pipeline so that channels can pass media (images) from users to vision-capable LLMs.

This is a rewrite of #990 (which was reverted in #1010 due to dispute over file processing method). All three review comments from @mengzhuo on #990 have been addressed in this version.

Changes from reverted #990

Issue raised by @mengzhuo	#990 (reverted)	This PR
Use `h2non/filetype` for MIME detection	`mimeFromExtension()` — relied on file extension	`filetype.MatchFile()` — magic-bytes detection, no extension dependency
Use file handler + encoder to avoid 2x memory	`os.ReadFile` + `base64.EncodeToString` — peak 2.6x file size in memory	`os.Open` → `base64.NewEncoder` → `io.Copy` — streaming, peak ~1.33x (base64 buffer only)
Bake size limitation into config	`maxMediaFileSize` hardcoded const (20MB)	`config.MaxMediaSize` JSON/env configurable, 20MB default via `GetMaxMediaSize()`

Implementation

New file: pkg/agent/loop_media.go — resolveMediaRefs() function (extracted from loop.go for clarity)

Streaming base64: os.Open → base64.NewEncoder → bytes.Buffer → data URL
MIME detection: prefer MediaMeta.ContentType from store, fallback to filetype.MatchFile() (magic bytes)
Size guard: skip files exceeding config.MaxMediaSize with warning log
Immutable: returns new slice, never mutates original messages
Graceful degradation: skips unresolvable refs, unknown MIME types, oversized files

Modified: pkg/providers/openai_compat/provider.go — serializeMessages() replaces stripSystemParts()

Plain text messages: unchanged behavior (string content)
Messages with Media: OpenAI vision multi-part format (image_url content parts with data URLs)
Preserves reasoning_content, tool_calls, tool_call_id

Pipeline wiring:

Message.Media []string field on protocol types
processOptions.Media threads media refs from InboundMessage through to BuildMessages()
resolveMediaRefs() called before LLM to convert media:// refs → data: URLs

Data flow

Channel receives image
  → stores in MediaStore → media://uuid ref
  → InboundMessage.Media carries ref through bus
  → processOptions.Media → BuildMessages() attaches to Message.Media
  → resolveMediaRefs(): file → filetype detect → streaming base64 → data URL
  → serializeMessages(): Message.Media → OpenAI multi-part content
  → LLM sees the image

Test coverage

resolveMediaRefs: 6 tests (resolve to base64, skip oversized, skip unknown type, passthrough non-media URLs, immutability, metadata content-type preference)
serializeMessages: 4 tests (plain text, with media, media + tool_call_id, system message stripping)

E2E verification

Tested on Radxa Cubie A7A (arm64) with Feishu channel → Dashscope LLM (kimi-k2.5). Sent image via Feishu, LLM correctly identified image content.

Type of Change

✨ New feature (non-breaking change which adds functionality)

AI Code Generation

🛠️ Mostly AI-generated (AI draft, Human verified/modified)

Test Environment

Hardware: Radxa Cubie A7A (arm64, 4GB RAM)
OS: Debian 11 (bullseye)
Model/Provider: Dashscope (kimi-k2.5, vision-capable)
Channel: Feishu (WebSocket mode)

Checklist

My code follows the style of this project
I have performed a self-review of my own changes
Backward compatible — messages without Media work exactly as before
All existing tests pass (make test)
Addresses all 3 review comments from @mengzhuo on feat(agent): add vision/image support to agent pipeline #990

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…etype detection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Separate third-party imports from local module imports (gci) - Fix byte slice literal formatting (gofumpt) - Rename shadowed err variable to ftErr (govet) - Remove trailing blank lines in test files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yinwm · 2026-03-03T10:34:02Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

yinwm

LGTM

yinwm · 2026-03-03T10:35:30Z

thanks for the pr

Add multipart content support for messages with Media attachments in CodexProvider (OAuth). This enables vision-capable models like gpt-5.2 to process images when using OAuth authentication. Changes: - Check for msg.Media in user messages - Build ResponseInputMessageContentListParam with text + image parts - Use ResponseInputImageParam with ImageURL and auto detail level This mirrors the existing vision support in openai_compat/provider.go and completes the vision pipeline for OAuth users (PR sipeed#1020 only updated openai_compat). Fixes: OAuth users couldn't use vision despite PR sipeed#1020 adding support

Orgmar · 2026-03-04T03:16:45Z

@shikihane Solid rewrite of the vision support. Switching to filetype.MatchFile() for magic-bytes detection and streaming base64 via io.Copy to keep memory at ~1.33x instead of 2.6x are both really clean improvements. Nice that you also made the size limit configurable through config.MaxMediaSize instead of keeping it hardcoded. The fact that you verified it end-to-end on actual hardware (Radxa Cubie A7A with Feishu + Dashscope) is a great touch.

We are building the PicoClaw Dev Group on Discord for contributors to connect and collaborate. If you are interested, send an email to support@sipeed.com with the subject [Join PicoClaw Dev Group] + Your GitHub account and we will send you the invite link!

The merged sipeed#1020 added vision support to the openai_compat provider but left the native Anthropic provider without image handling. This patch closes the gap: when a Message carries Media (base64 data URLs), the Anthropic adapter now builds NewImageBlockBase64 content blocks using the official SDK, giving users of `anthropic/` prefixed models the same vision capability. Changes: - buildParams: handle msg.Media in user messages, convert data URLs to anthropic.NewImageBlockBase64 content blocks - parseDataURL: extract media type and base64 data from data URLs - Tests for parseDataURL (8 cases) and media message building

…ine-v2 feat(agent): add vision/image support with streaming base64 and filetype detection

shikihane and others added 6 commits March 3, 2026 16:27

feat(providers): add Media field to Message struct for vision support

6689c0b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(config): add configurable max_media_size with 20MB default

4c6c05a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: add h2non/filetype dependency for magic-bytes MIME detection

559cef3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(agent): implement resolveMediaRefs with streaming base64 and fil…

6fd6582

…etype detection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(openai_compat): implement serializeMessages with multipart media…

03f7ae4

… support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(agent): wire media refs through agent pipeline to LLM provider

4322741

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

shikihane mentioned this pull request Mar 3, 2026

feat(feishu): add inbound image message support #951

Closed

6 tasks

sipeed-bot bot added type: enhancement New feature or request domain: agent domain: provider labels Mar 3, 2026

yinwm approved these changes Mar 3, 2026

View reviewed changes

yinwm merged commit a65ccc0 into sipeed:main Mar 3, 2026
2 checks passed

iamhitarth mentioned this pull request Mar 4, 2026

feat(providers): add vision/image support to CodexProvider (OAuth) #1041

Open

lcolok mentioned this pull request Mar 4, 2026

feat: add multimodal image support for vision-capable LLMs #981

Closed

5 tasks

lcolok mentioned this pull request Mar 4, 2026

feat(anthropic): add vision/image support for native Anthropic provider #1063

Open

2 tasks

hyperwd pushed a commit to hyperwd/picoclaw that referenced this pull request Mar 5, 2026

Merge pull request sipeed#1020 from shikihane/feat/agent-vision-pipel…

d0f9140

…ine-v2 feat(agent): add vision/image support with streaming base64 and filetype detection

Pluckypan pushed a commit to Pluckypan/picoclaw that referenced this pull request Mar 6, 2026

Merge pull request sipeed#1020 from shikihane/feat/agent-vision-pipel…

8840fa9

…ine-v2 feat(agent): add vision/image support with streaming base64 and filetype detection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): add vision/image support with streaming base64 and filetype detection#1020

feat(agent): add vision/image support with streaming base64 and filetype detection#1020
yinwm merged 7 commits intosipeed:mainfrom
shikihane:feat/agent-vision-pipeline-v2

shikihane commented Mar 3, 2026

Uh oh!

yinwm commented Mar 3, 2026

Uh oh!

yinwm left a comment

Uh oh!

Uh oh!

yinwm commented Mar 3, 2026

Uh oh!

Orgmar commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shikihane commented Mar 3, 2026

Description

Changes from reverted #990

Implementation

Data flow

Test coverage

E2E verification

Type of Change

AI Code Generation

Test Environment

Checklist

Uh oh!

yinwm commented Mar 3, 2026

Code review

Uh oh!

yinwm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yinwm commented Mar 3, 2026

Uh oh!

Orgmar commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants