Skip to content

feat(agent): add vision/image support with streaming base64 and filetype detection#1020

Merged
yinwm merged 7 commits intosipeed:mainfrom
shikihane:feat/agent-vision-pipeline-v2
Mar 3, 2026
Merged

feat(agent): add vision/image support with streaming base64 and filetype detection#1020
yinwm merged 7 commits intosipeed:mainfrom
shikihane:feat/agent-vision-pipeline-v2

Conversation

@shikihane
Copy link
Contributor

Description

Add vision/image support to the agent pipeline so that channels can pass media (images) from users to vision-capable LLMs.

This is a rewrite of #990 (which was reverted in #1010 due to dispute over file processing method). All three review comments from @mengzhuo on #990 have been addressed in this version.

Changes from reverted #990

Issue raised by @mengzhuo #990 (reverted) This PR
Use h2non/filetype for MIME detection mimeFromExtension() — relied on file extension filetype.MatchFile() — magic-bytes detection, no extension dependency
Use file handler + encoder to avoid 2x memory os.ReadFile + base64.EncodeToString — peak 2.6x file size in memory os.Openbase64.NewEncoderio.Copy — streaming, peak ~1.33x (base64 buffer only)
Bake size limitation into config maxMediaFileSize hardcoded const (20MB) config.MaxMediaSize JSON/env configurable, 20MB default via GetMaxMediaSize()

Implementation

New file: pkg/agent/loop_media.goresolveMediaRefs() function (extracted from loop.go for clarity)

  • Streaming base64: os.Openbase64.NewEncoderbytes.Buffer → data URL
  • MIME detection: prefer MediaMeta.ContentType from store, fallback to filetype.MatchFile() (magic bytes)
  • Size guard: skip files exceeding config.MaxMediaSize with warning log
  • Immutable: returns new slice, never mutates original messages
  • Graceful degradation: skips unresolvable refs, unknown MIME types, oversized files

Modified: pkg/providers/openai_compat/provider.goserializeMessages() replaces stripSystemParts()

  • Plain text messages: unchanged behavior (string content)
  • Messages with Media: OpenAI vision multi-part format (image_url content parts with data URLs)
  • Preserves reasoning_content, tool_calls, tool_call_id

Pipeline wiring:

  • Message.Media []string field on protocol types
  • processOptions.Media threads media refs from InboundMessage through to BuildMessages()
  • resolveMediaRefs() called before LLM to convert media:// refs → data: URLs

Data flow

Channel receives image
  → stores in MediaStore → media://uuid ref
  → InboundMessage.Media carries ref through bus
  → processOptions.Media → BuildMessages() attaches to Message.Media
  → resolveMediaRefs(): file → filetype detect → streaming base64 → data URL
  → serializeMessages(): Message.Media → OpenAI multi-part content
  → LLM sees the image

Test coverage

  • resolveMediaRefs: 6 tests (resolve to base64, skip oversized, skip unknown type, passthrough non-media URLs, immutability, metadata content-type preference)
  • serializeMessages: 4 tests (plain text, with media, media + tool_call_id, system message stripping)

E2E verification

Tested on Radxa Cubie A7A (arm64) with Feishu channel → Dashscope LLM (kimi-k2.5). Sent image via Feishu, LLM correctly identified image content.

Type of Change

  • ✨ New feature (non-breaking change which adds functionality)

AI Code Generation

  • 🛠️ Mostly AI-generated (AI draft, Human verified/modified)

Test Environment

  • Hardware: Radxa Cubie A7A (arm64, 4GB RAM)
  • OS: Debian 11 (bullseye)
  • Model/Provider: Dashscope (kimi-k2.5, vision-capable)
  • Channel: Feishu (WebSocket mode)

Checklist

shikihane and others added 6 commits March 3, 2026 16:27
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…etype detection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Separate third-party imports from local module imports (gci)
- Fix byte slice literal formatting (gofumpt)
- Rename shadowed err variable to ftErr (govet)
- Remove trailing blank lines in test files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yinwm
Copy link
Collaborator

yinwm commented Mar 3, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Copy link
Collaborator

@yinwm yinwm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yinwm yinwm merged commit a65ccc0 into sipeed:main Mar 3, 2026
2 checks passed
@yinwm
Copy link
Collaborator

yinwm commented Mar 3, 2026

thanks for the pr

iamhitarth added a commit to iamhitarth/picoclaw that referenced this pull request Mar 4, 2026
Add multipart content support for messages with Media attachments in
CodexProvider (OAuth). This enables vision-capable models like gpt-5.2
to process images when using OAuth authentication.

Changes:
- Check for msg.Media in user messages
- Build ResponseInputMessageContentListParam with text + image parts
- Use ResponseInputImageParam with ImageURL and auto detail level

This mirrors the existing vision support in openai_compat/provider.go
and completes the vision pipeline for OAuth users (PR sipeed#1020 only
updated openai_compat).

Fixes: OAuth users couldn't use vision despite PR sipeed#1020 adding support
@Orgmar
Copy link
Contributor

Orgmar commented Mar 4, 2026

@shikihane Solid rewrite of the vision support. Switching to filetype.MatchFile() for magic-bytes detection and streaming base64 via io.Copy to keep memory at ~1.33x instead of 2.6x are both really clean improvements. Nice that you also made the size limit configurable through config.MaxMediaSize instead of keeping it hardcoded. The fact that you verified it end-to-end on actual hardware (Radxa Cubie A7A with Feishu + Dashscope) is a great touch.

We are building the PicoClaw Dev Group on Discord for contributors to connect and collaborate. If you are interested, send an email to support@sipeed.com with the subject [Join PicoClaw Dev Group] + Your GitHub account and we will send you the invite link!

lcolok added a commit to lcolok/picoclaw that referenced this pull request Mar 4, 2026
The merged sipeed#1020 added vision support to the openai_compat provider but
left the native Anthropic provider without image handling. This patch
closes the gap: when a Message carries Media (base64 data URLs), the
Anthropic adapter now builds NewImageBlockBase64 content blocks using
the official SDK, giving users of `anthropic/` prefixed models the
same vision capability.

Changes:
- buildParams: handle msg.Media in user messages, convert data URLs
  to anthropic.NewImageBlockBase64 content blocks
- parseDataURL: extract media type and base64 data from data URLs
- Tests for parseDataURL (8 cases) and media message building
hyperwd pushed a commit to hyperwd/picoclaw that referenced this pull request Mar 5, 2026
…ine-v2

feat(agent): add vision/image support with streaming base64 and filetype detection
Pluckypan pushed a commit to Pluckypan/picoclaw that referenced this pull request Mar 6, 2026
…ine-v2

feat(agent): add vision/image support with streaming base64 and filetype detection
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants