Skip to content

feat: add vision/image support to agent pipeline#555

Closed
as3k wants to merge 2 commits intosipeed:mainfrom
as3k:fix/vision-image-support
Closed

feat: add vision/image support to agent pipeline#555
as3k wants to merge 2 commits intosipeed:mainfrom
as3k:fix/vision-image-support

Conversation

@as3k
Copy link
Contributor

@as3k as3k commented Feb 20, 2026

Summary

This PR adds end-to-end vision/image support to the picoclaw agent pipeline, enabling vision-capable models like Gemini 2.0 Flash to receive and process image attachments.

Problem

Images extracted from Discord messages into InboundMessage.Media were being dropped (nil passed to BuildMessages), preventing vision-capable models from accessing image content.

Solution

  1. Add Media field to Message struct (pkg/providers/protocoltypes/types.go)

    • Added Media []string field to store image/media URLs
  2. Implement serializeMessages() (pkg/providers/openai_compat/provider.go)

    • Formats messages with image_url content parts for OpenAI-compatible vision APIs
    • Handles both text and image content in a single message
  3. Wire Media through agent pipeline (pkg/agent/context.go and pkg/agent/loop.go)

    • Add Media field to processOptions struct
    • Pass msg.Media from InboundMessage through the pipeline
    • Update BuildMessages to attach media to user message
    • Pass opts.Media to BuildMessages instead of nil

Testing

The changes enable:

  • Discord image attachments to flow through the entire agent pipeline
  • Vision models to receive image_url content parts in the correct OpenAI format
  • Backward compatibility (messages without media work as before)

Commits

  • 42725e0 - feat: add Media field to Message struct and implement serializeMessages for vision API support
  • 4b58912 - feat: wire Media through agent pipeline to enable vision model support

as3k added 2 commits February 20, 2026 14:32
…es for vision API support

- Add Media []string field to Message struct for image/media URLs
- Implement serializeMessages() to format messages with image_url content parts
- Enables OpenAI-compatible vision APIs to receive image attachments
- Add Media field to processOptions struct
- Pass msg.Media from InboundMessage to processOptions in processMessage
- Update BuildMessages to attach media to user message
- Pass opts.Media to BuildMessages instead of nil in runAgentLoop
- Enables vision-capable models (Gemini 2.0 Flash, etc.) to receive image_url content parts
Copilot AI review requested due to automatic review settings February 20, 2026 22:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end vision/image support to the picoclaw agent pipeline, enabling vision-capable models like Gemini 2.0 Flash to process image attachments from Discord messages. The implementation follows the OpenAI-compatible vision API format with image_url content parts.

Changes:

  • Added Media []string field to the Message struct to store image/media URLs
  • Implemented serializeMessages() function to format messages with both text and image content for OpenAI-compatible APIs
  • Wired the Media field through the agent pipeline from InboundMessage to BuildMessages

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
pkg/providers/protocoltypes/types.go Added Media field to Message struct with appropriate JSON tag and documentation
pkg/providers/openai_compat/provider.go Implemented serializeMessages to format messages with image_url content parts for vision APIs
pkg/agent/loop.go Added Media field to processOptions and wired it through the message processing pipeline
pkg/agent/context.go Updated BuildMessages to accept and attach media to user messages, with proper condition handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +171 to +174
result = append(result, map[string]interface{}{
"role": m.Role,
"content": parts,
})
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a message contains Media, the tool_calls and tool_call_id fields are not serialized. If a message has both media attachments and tool calls (or is a tool response with media), this information would be lost. Consider handling these fields even when Media is present, or document if this combination is intentionally not supported.

Suggested change
result = append(result, map[string]interface{}{
"role": m.Role,
"content": parts,
})
msg := map[string]interface{}{
"role": m.Role,
"content": parts,
}
if m.ToolCallID != "" {
msg["tool_call_id"] = m.ToolCallID
}
if len(m.ToolCalls) > 0 {
msg["tool_calls"] = m.ToolCalls
}
result = append(result, msg)

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +177
func serializeMessages(messages []Message) []map[string]interface{} {
result := make([]map[string]interface{}, 0, len(messages))
for _, m := range messages {
if len(m.Media) == 0 {
msg := map[string]interface{}{
"role": m.Role,
"content": m.Content,
}
if m.ToolCallID != "" {
msg["tool_call_id"] = m.ToolCallID
}
if len(m.ToolCalls) > 0 {
msg["tool_calls"] = m.ToolCalls
}
result = append(result, msg)
continue
}

parts := make([]map[string]interface{}, 0, 1+len(m.Media))
if m.Content != "" {
parts = append(parts, map[string]interface{}{
"type": "text",
"text": m.Content,
})
}
for _, mediaURL := range m.Media {
parts = append(parts, map[string]interface{}{
"type": "image_url",
"image_url": map[string]interface{}{
"url": mediaURL,
},
})
}
result = append(result, map[string]interface{}{
"role": m.Role,
"content": parts,
})
}
return result
}
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serializeMessages function lacks test coverage. Given the comprehensive test suite for other provider functionality (e.g., TestProviderChat_ParsesToolCalls, TestProviderChat_StripsMoonshotPrefixAndNormalizesKimiTemperature), consider adding tests for message serialization with media attachments to ensure correct OpenAI API format compliance and prevent regressions. Test cases should cover: messages with media only, messages with both text and media, and messages without media (backward compatibility).

Copilot uses AI. Check for mistakes.
summary,
opts.UserMessage,
nil,
opts.Media,
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Media field contains URLs that are passed directly to external LLM APIs without validation. Consider adding validation to ensure URLs are from trusted sources (e.g., Discord CDN for Discord messages) and don't expose internal network resources. While the URLs originate from InboundMessage.Media which likely comes from Discord, explicit validation would prevent potential SSRF vulnerabilities if the source changes or if there are bugs in upstream URL extraction.

Copilot uses AI. Check for mistakes.
@shikihane
Copy link
Contributor

Hi @as3k 👋

We've implemented Feishu inbound image download on our side — when a user sends an image in Feishu, we download it via the SDK and pass it through mediaPaths to HandleMessage().

Your PR is the missing piece that would make it end-to-end: once Media flows through BuildMessages() and serializeMessages() formats the image_url parts, vision-capable LLMs will actually see the images.

We've opened a draft PR at #951 to track our work. Just wanted to let you know so we don't duplicate effort, and to confirm that our inbound download → your pipeline wiring → LLM is the intended flow.

shikihane added a commit to shikihane/picoclaw that referenced this pull request Mar 2, 2026
…#555)

Add Media field to processOptions, pass msg.Media from inbound
messages through to BuildMessages and serializeMessages so
vision-capable LLMs receive image_url content parts.

Based on work by @as3k in sipeed#555.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shikihane
Copy link
Contributor

Hi @as3k, thanks for the original work on this — your vision pipeline design is clean and we've been referencing it.

We have a Feishu inbound image feature that depends on this pipeline, and it's been a blocker for our deployment for a while. Since this PR has merge conflicts with the latest main and has been inactive, we submitted our own implementation in #990, based on your approach (with attribution in our commit messages).

We also added resolveMediaRefs() to convert media:// refs to base64 data URLs before the LLM call (without this, MediaStore refs get rejected by the API), and fixed a reasoning_content preservation issue in serializeMessages().

If your PR gets updated and merged first, we're happy to close ours. Thanks again for the groundwork!

wj-xiao pushed a commit to wj-xiao/picoclaw that referenced this pull request Mar 3, 2026
…#555)

Add Media field to processOptions, pass msg.Media from inbound
messages through to BuildMessages and serializeMessages so
vision-capable LLMs receive image_url content parts.

Based on work by @as3k in sipeed#555.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
hyperwd pushed a commit to hyperwd/picoclaw that referenced this pull request Mar 5, 2026
…#555)

Add Media field to processOptions, pass msg.Media from inbound
messages through to BuildMessages and serializeMessages so
vision-capable LLMs receive image_url content parts.

Based on work by @as3k in sipeed#555.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Pluckypan pushed a commit to Pluckypan/picoclaw that referenced this pull request Mar 6, 2026
…#555)

Add Media field to processOptions, pass msg.Media from inbound
messages through to BuildMessages and serializeMessages so
vision-capable LLMs receive image_url content parts.

Based on work by @as3k in sipeed#555.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yinwm
Copy link
Collaborator

yinwm commented Mar 11, 2026

Closing this PR as it has been superseded by #990 which has already been merged.

The vision/image support feature has been implemented in #990 with additional improvements:

  • Added resolveMediaRefs() to convert media:// refs to base64 data URLs before LLM calls
  • Fixed reasoning_content preservation issue in serializeMessages()
  • Proper attribution to @as3k's original work has been included in the merged commits

Thanks @as3k for the groundwork on this feature!

@yinwm yinwm closed this Mar 11, 2026
dome pushed a commit to domeclaw/jibclaw that referenced this pull request Mar 22, 2026
Add Media field to processOptions, pass msg.Media from inbound
messages through to BuildMessages and serializeMessages so
vision-capable LLMs receive image_url content parts.

Based on work by @as3k in sipeed/picoclaw#555.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants