Skip to content

feat(agent): add vision/image support to agent pipeline#990

Merged
Orgmar merged 5 commits intosipeed:mainfrom
shikihane:feat/agent-vision-pipeline
Mar 3, 2026
Merged

feat(agent): add vision/image support to agent pipeline#990
Orgmar merged 5 commits intosipeed:mainfrom
shikihane:feat/agent-vision-pipeline

Conversation

@shikihane
Copy link
Contributor

Description

Add vision/image support to the agent pipeline so that channels can pass media (images) from users to vision-capable LLMs.

Changes

  1. Message.Media field (protocoltypes/types.go) — new []string field for media URLs/refs
  2. serializeMessages() (openai_compat/provider.go) — converts messages with Media into OpenAI vision API multi-part format (image_url content parts). Also preserves reasoning_content for thinking models.
  3. Pipeline wiring (agent/loop.go, agent/context.go) — threads Media from InboundMessage through processOptionsBuildMessages() → LLM call
  4. resolveMediaRefs() (agent/loop.go) — converts media:// refs (from MediaStore) to data:image/...;base64,... data URLs before the LLM call, so sessions store lightweight refs while the LLM receives full image data

How it works

Channel receives image → stores in MediaStore → media://uuid ref
    → BuildMessages() attaches ref to Message.Media
    → resolveMediaRefs() reads file, encodes base64, replaces ref
    → serializeMessages() formats as OpenAI vision multi-part content
    → LLM sees the image

Relation to PR #555

This PR covers similar ground to #555 by @as3k (with attribution in commit messages). We needed this functionality urgently for our Feishu channel deployment and #555 has been inactive with merge conflicts, so we went ahead with our own implementation. Key additions beyond #555:

  • resolveMediaRefs() for MediaStore integration (without this, media:// refs are sent raw to the LLM and rejected)
  • reasoning_content preservation in serializeMessages()

Type of Change

  • ✨ New feature (non-breaking change which adds functionality)

AI Code Generation

  • 🛠️ Mostly AI-generated (AI draft, Human verified/modified)

Test Environment

  • Hardware: Radxa Cubie A7A (arm64, 4GB RAM)
  • OS: Debian 11 (bullseye)
  • Model/Provider: Dashscope (kimi-k2.5, vision-capable)
  • Channel: Feishu (WebSocket mode)

Checklist

  • My code follows the style of this project
  • I have performed a self-review of my own changes
  • Backward compatible — messages without Media work exactly as before
  • All existing tests pass (make test on affected packages)

as3k and others added 4 commits March 2, 2026 17:18
…es for vision API support

- Add Media []string field to Message struct for image/media URLs
- Implement serializeMessages() to format messages with image_url content parts
- Enables OpenAI-compatible vision APIs to receive image attachments
…#555)

Add Media field to processOptions, pass msg.Media from inbound
messages through to BuildMessages and serializeMessages so
vision-capable LLMs receive image_url content parts.

Based on work by @as3k in sipeed#555.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The serializeMessages() function was not preserving the reasoning_content
field when serializing messages for vision API calls. This caused the
TestProviderChat_PreservesReasoningContentInHistory test to fail.

This fix ensures reasoning_content is included in both text-only messages
and vision messages with media attachments.

Co-authored-by: Zachary Guerrero <[email protected]>
…data URLs

Without this function, media:// refs stored by MediaStore are passed
directly to the LLM API, which rejects them as invalid URLs.

resolveMediaRefs() runs after BuildMessages() and before the LLM call,
converting each media:// ref to a data:image/...;base64,... URL that
vision-capable models can process.

Also adds mimeFromExtension() helper for MIME type inference from
file extensions when ContentType metadata is not available.
@yinwm
Copy link
Collaborator

yinwm commented Mar 2, 2026

Code Review: Vision/Image Support for Agent Pipeline

Thanks for the contribution! The overall architecture is solid, but I found a few issues that should be addressed before merging.


🔴 Must Fix

1. `serializeMessages` drops `ToolCallID` and `ToolCalls` when Media is present

```go
// When Media is present, ToolCallID and ToolCalls are ignored:
msg := map[string]interface{}{
"role": m.Role,
"content": parts,
}
if m.ReasoningContent != "" {
msg["reasoning_content"] = m.ReasoningContent
}
// ❌ Missing ToolCallID and ToolCalls handling
```

While "tool calls with images" may be rare, this inconsistency could cause hard-to-debug issues. Suggest adding:

```go
if m.ToolCallID != "" {
msg["tool_call_id"] = m.ToolCallID
}
if len(m.ToolCalls) > 0 {
msg["tool_calls"] = m.ToolCalls
}
```

2. Missing Unit Tests

The PR mentions `All existing tests pass`, but the new functions have no test coverage:

  • `serializeMessages()` - needs tests for plain text, with Media, with ToolCalls scenarios
  • `resolveMediaRefs()` - needs tests for `media://` resolution, error handling, MIME inference
  • `mimeFromExtension()` - simple but should have basic coverage

This is critical since `serializeMessages` directly affects the API request format.


🟡 Suggestions

3. Memory Risk in `resolveMediaRefs`

```go
data, err := os.ReadFile(localPath)
```

Reading entire files into memory. For high-res images (4K screenshots can be 10MB+), this could cause OOM under concurrent load.

Suggestion: Add file size limit:
```go
const maxMediaSize = 20 * 1024 * 1024 // 20MB

info, err := os.Stat(localPath)
if err != nil || info.Size() > maxMediaSize {
// handle error
}
```

4. Silent Failure on Media Resolution Errors

When image resolution fails, only a warning is logged - users get no feedback:

```go
if err != nil {
logger.WarnCF(...) // User doesn't know their image wasn't sent
continue
}
```

Suggestion: Consider adding a hint in the response or propagating the error.

5. `mimeFromExtension` Default Return Value

```go
default:
return "image/jpeg" // Returns jpeg for .txt, .pdf, etc.
```

Suggestion: For unknown extensions, skip or return an error instead of guessing jpeg.


✅ What's Done Well

  1. Backward compatible - Messages without Media use the original path
  2. Clear documentation - PR description and code comments explain the data flow
  3. Smart `media://` ref design - Sessions store lightweight refs, resolved to base64 only at call time
  4. Preserves `reasoning_content` - Compatible with thinking models

Summary

Category Status
Feature completeness
Architecture
Test coverage ❌ Missing
Bug (ToolCallID) ❌ Needs fix
Edge cases ⚠️ Suggestions

Recommendation: Please add unit tests (at minimum for `serializeMessages`) and fix the ToolCallID/ToolCalls bug before merging.

- serializeMessages: preserve ToolCallID/ToolCalls when Media is present
- resolveMediaRefs: add 20MB file size limit to prevent OOM
- mimeFromExtension: return empty string for unknown extensions
- Add 11 unit tests for serializeMessages, resolveMediaRefs, mimeFromExtension

Co-Authored-By: Claude Opus 4.6 <[email protected]>

// mimeFromExtension returns a MIME type for common image extensions.
// Returns empty string for unrecognized extensions.
func mimeFromExtension(ext string) string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use github.com/h2non/filetype for file type detection

continue
}

dataURL := "data:" + mime + ";base64," + base64.StdEncoding.EncodeToString(data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use file handler and encoder for later use, instead of allocation 2X of memory for media files.


// maxMediaFileSize is the maximum file size (20 MB) for media resolution.
// Files larger than this are skipped to prevent OOM under concurrent load.
const maxMediaFileSize = 20 * 1024 * 1024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bake size limitation into config.

hyperwd pushed a commit to hyperwd/picoclaw that referenced this pull request Mar 5, 2026
feat(agent): add vision/image support to agent pipeline
Pluckypan pushed a commit to Pluckypan/picoclaw that referenced this pull request Mar 6, 2026
feat(agent): add vision/image support to agent pipeline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants