feat(agent): add vision/image support to agent pipeline by shikihane · Pull Request #990 · sipeed/picoclaw

shikihane · 2026-03-02T13:48:16Z

Description

Add vision/image support to the agent pipeline so that channels can pass media (images) from users to vision-capable LLMs.

Changes

Message.Media field (protocoltypes/types.go) — new []string field for media URLs/refs
serializeMessages() (openai_compat/provider.go) — converts messages with Media into OpenAI vision API multi-part format (image_url content parts). Also preserves reasoning_content for thinking models.
Pipeline wiring (agent/loop.go, agent/context.go) — threads Media from InboundMessage through processOptions → BuildMessages() → LLM call
resolveMediaRefs() (agent/loop.go) — converts media:// refs (from MediaStore) to data:image/...;base64,... data URLs before the LLM call, so sessions store lightweight refs while the LLM receives full image data

How it works

Channel receives image → stores in MediaStore → media://uuid ref
    → BuildMessages() attaches ref to Message.Media
    → resolveMediaRefs() reads file, encodes base64, replaces ref
    → serializeMessages() formats as OpenAI vision multi-part content
    → LLM sees the image

Relation to PR #555

This PR covers similar ground to #555 by @as3k (with attribution in commit messages). We needed this functionality urgently for our Feishu channel deployment and #555 has been inactive with merge conflicts, so we went ahead with our own implementation. Key additions beyond #555:

resolveMediaRefs() for MediaStore integration (without this, media:// refs are sent raw to the LLM and rejected)
reasoning_content preservation in serializeMessages()

Type of Change

✨ New feature (non-breaking change which adds functionality)

AI Code Generation

🛠️ Mostly AI-generated (AI draft, Human verified/modified)

Test Environment

Hardware: Radxa Cubie A7A (arm64, 4GB RAM)
OS: Debian 11 (bullseye)
Model/Provider: Dashscope (kimi-k2.5, vision-capable)
Channel: Feishu (WebSocket mode)

Checklist

My code follows the style of this project
I have performed a self-review of my own changes
Backward compatible — messages without Media work exactly as before
All existing tests pass (make test on affected packages)

…es for vision API support - Add Media []string field to Message struct for image/media URLs - Implement serializeMessages() to format messages with image_url content parts - Enables OpenAI-compatible vision APIs to receive image attachments

@as3k

…#555) Add Media field to processOptions, pass msg.Media from inbound messages through to BuildMessages and serializeMessages so vision-capable LLMs receive image_url content parts. Based on work by @as3k in sipeed#555. Co-Authored-By: Claude Opus 4.6 <[email protected]>

The serializeMessages() function was not preserving the reasoning_content field when serializing messages for vision API calls. This caused the TestProviderChat_PreservesReasoningContentInHistory test to fail. This fix ensures reasoning_content is included in both text-only messages and vision messages with media attachments. Co-authored-by: Zachary Guerrero <[email protected]>

…data URLs Without this function, media:// refs stored by MediaStore are passed directly to the LLM API, which rejects them as invalid URLs. resolveMediaRefs() runs after BuildMessages() and before the LLM call, converting each media:// ref to a data:image/...;base64,... URL that vision-capable models can process. Also adds mimeFromExtension() helper for MIME type inference from file extensions when ContentType metadata is not available.

yinwm · 2026-03-02T15:28:00Z

Code Review: Vision/Image Support for Agent Pipeline

Thanks for the contribution! The overall architecture is solid, but I found a few issues that should be addressed before merging.

🔴 Must Fix

1. `serializeMessages` drops `ToolCallID` and `ToolCalls` when Media is present

```go
// When Media is present, ToolCallID and ToolCalls are ignored:
msg := map[string]interface{}{
"role": m.Role,
"content": parts,
}
if m.ReasoningContent != "" {
msg["reasoning_content"] = m.ReasoningContent
}
// ❌ Missing ToolCallID and ToolCalls handling
```

While "tool calls with images" may be rare, this inconsistency could cause hard-to-debug issues. Suggest adding:

```go
if m.ToolCallID != "" {
msg["tool_call_id"] = m.ToolCallID
}
if len(m.ToolCalls) > 0 {
msg["tool_calls"] = m.ToolCalls
}
```

2. Missing Unit Tests

The PR mentions `All existing tests pass`, but the new functions have no test coverage:

`serializeMessages()` - needs tests for plain text, with Media, with ToolCalls scenarios
`resolveMediaRefs()` - needs tests for `media://` resolution, error handling, MIME inference
`mimeFromExtension()` - simple but should have basic coverage

This is critical since `serializeMessages` directly affects the API request format.

🟡 Suggestions

3. Memory Risk in `resolveMediaRefs`

```go
data, err := os.ReadFile(localPath)
```

Reading entire files into memory. For high-res images (4K screenshots can be 10MB+), this could cause OOM under concurrent load.

Suggestion: Add file size limit:
```go
const maxMediaSize = 20 * 1024 * 1024 // 20MB

info, err := os.Stat(localPath)
if err != nil || info.Size() > maxMediaSize {
// handle error
}
```

4. Silent Failure on Media Resolution Errors

When image resolution fails, only a warning is logged - users get no feedback:

```go
if err != nil {
logger.WarnCF(...) // User doesn't know their image wasn't sent
continue
}
```

Suggestion: Consider adding a hint in the response or propagating the error.

5. `mimeFromExtension` Default Return Value

```go
default:
return "image/jpeg" // Returns jpeg for .txt, .pdf, etc.
```

Suggestion: For unknown extensions, skip or return an error instead of guessing jpeg.

✅ What's Done Well

Backward compatible - Messages without Media use the original path
Clear documentation - PR description and code comments explain the data flow
Smart `media://` ref design - Sessions store lightweight refs, resolved to base64 only at call time
Preserves `reasoning_content` - Compatible with thinking models

Summary

Category	Status
Feature completeness	✅
Architecture	✅
Test coverage	❌ Missing
Bug (ToolCallID)	❌ Needs fix
Edge cases	⚠️ Suggestions

Recommendation: Please add unit tests (at minimum for `serializeMessages`) and fix the ToolCallID/ToolCalls bug before merging.

- serializeMessages: preserve ToolCallID/ToolCalls when Media is present - resolveMediaRefs: add 20MB file size limit to prevent OOM - mimeFromExtension: return empty string for unknown extensions - Add 11 unit tests for serializeMessages, resolveMediaRefs, mimeFromExtension Co-Authored-By: Claude Opus 4.6 <[email protected]>

mengzhuo · 2026-03-03T03:38:23Z

pkg/agent/loop.go

+
+// mimeFromExtension returns a MIME type for common image extensions.
+// Returns empty string for unrecognized extensions.
+func mimeFromExtension(ext string) string {


Please use github.com/h2non/filetype for file type detection

mengzhuo · 2026-03-03T03:40:12Z

pkg/agent/loop.go

+				continue
+			}
+
+			dataURL := "data:" + mime + ";base64," + base64.StdEncoding.EncodeToString(data)


Could we use file handler and encoder for later use, instead of allocation 2X of memory for media files.

mengzhuo · 2026-03-03T03:41:08Z

pkg/agent/loop.go

+
+// maxMediaFileSize is the maximum file size (20 MB) for media resolution.
+// Files larger than this are skipped to prevent OOM under concurrent load.
+const maxMediaFileSize = 20 * 1024 * 1024


Bake size limitation into config.

feat(agent): add vision/image support to agent pipeline

as3k and others added 4 commits March 2, 2026 17:18

shikihane mentioned this pull request Mar 2, 2026

feat: add vision/image support to agent pipeline #555

Closed

sipeed-bot bot added type: enhancement New feature or request domain: agent domain: provider priority: high labels Mar 3, 2026

Orgmar merged commit 12d4570 into sipeed:main Mar 3, 2026

imguoguo mentioned this pull request Mar 3, 2026

revert: "feat(agent): add vision/image support to agent pipeline" #1010

Merged

mengzhuo reviewed Mar 3, 2026

View reviewed changes

shikihane mentioned this pull request Mar 3, 2026

feat(agent): add vision/image support with streaming base64 and filetype detection #1020

Merged

7 tasks

iamhitarth mentioned this pull request Mar 4, 2026

feat(providers): add vision/image support to CodexProvider (OAuth) #1041

Open

hyperwd pushed a commit to hyperwd/picoclaw that referenced this pull request Mar 5, 2026

Merge pull request sipeed#990 from shikihane/feat/agent-vision-pipeline

1b8dab2

feat(agent): add vision/image support to agent pipeline

Pluckypan pushed a commit to Pluckypan/picoclaw that referenced this pull request Mar 6, 2026

Merge pull request sipeed#990 from shikihane/feat/agent-vision-pipeline

452b169

feat(agent): add vision/image support to agent pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): add vision/image support to agent pipeline#990

feat(agent): add vision/image support to agent pipeline#990
Orgmar merged 5 commits intosipeed:mainfrom
shikihane:feat/agent-vision-pipeline

shikihane commented Mar 2, 2026

Uh oh!

yinwm commented Mar 2, 2026

Uh oh!

mengzhuo Mar 3, 2026

Uh oh!

mengzhuo Mar 3, 2026

Uh oh!

mengzhuo Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

shikihane commented Mar 2, 2026

Description

Changes

How it works

Relation to PR #555

Type of Change

AI Code Generation

Test Environment

Checklist

Uh oh!

yinwm commented Mar 2, 2026

Code Review: Vision/Image Support for Agent Pipeline

🔴 Must Fix

1. `serializeMessages` drops `ToolCallID` and `ToolCalls` when Media is present

2. Missing Unit Tests

🟡 Suggestions

3. Memory Risk in `resolveMediaRefs`

4. Silent Failure on Media Resolution Errors

5. `mimeFromExtension` Default Return Value

✅ What's Done Well

Summary

Uh oh!

mengzhuo Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

mengzhuo Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

mengzhuo Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants