feat: add vision/image support to agent pipeline by as3k · Pull Request #555 · sipeed/picoclaw

as3k · 2026-02-20T22:33:38Z

Summary

This PR adds end-to-end vision/image support to the picoclaw agent pipeline, enabling vision-capable models like Gemini 2.0 Flash to receive and process image attachments.

Problem

Images extracted from Discord messages into InboundMessage.Media were being dropped (nil passed to BuildMessages), preventing vision-capable models from accessing image content.

Solution

Add Media field to Message struct (pkg/providers/protocoltypes/types.go)
- Added Media []string field to store image/media URLs
Implement serializeMessages() (pkg/providers/openai_compat/provider.go)
- Formats messages with image_url content parts for OpenAI-compatible vision APIs
- Handles both text and image content in a single message
Wire Media through agent pipeline (pkg/agent/context.go and pkg/agent/loop.go)
- Add Media field to processOptions struct
- Pass msg.Media from InboundMessage through the pipeline
- Update BuildMessages to attach media to user message
- Pass opts.Media to BuildMessages instead of nil

Testing

The changes enable:

Discord image attachments to flow through the entire agent pipeline
Vision models to receive image_url content parts in the correct OpenAI format
Backward compatibility (messages without media work as before)

Commits

42725e0 - feat: add Media field to Message struct and implement serializeMessages for vision API support
4b58912 - feat: wire Media through agent pipeline to enable vision model support

…es for vision API support - Add Media []string field to Message struct for image/media URLs - Implement serializeMessages() to format messages with image_url content parts - Enables OpenAI-compatible vision APIs to receive image attachments

- Add Media field to processOptions struct - Pass msg.Media from InboundMessage to processOptions in processMessage - Update BuildMessages to attach media to user message - Pass opts.Media to BuildMessages instead of nil in runAgentLoop - Enables vision-capable models (Gemini 2.0 Flash, etc.) to receive image_url content parts

Copilot

Pull request overview

This PR adds end-to-end vision/image support to the picoclaw agent pipeline, enabling vision-capable models like Gemini 2.0 Flash to process image attachments from Discord messages. The implementation follows the OpenAI-compatible vision API format with image_url content parts.

Changes:

Added Media []string field to the Message struct to store image/media URLs
Implemented serializeMessages() function to format messages with both text and image content for OpenAI-compatible APIs
Wired the Media field through the agent pipeline from InboundMessage to BuildMessages

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
pkg/providers/protocoltypes/types.go	Added Media field to Message struct with appropriate JSON tag and documentation
pkg/providers/openai_compat/provider.go	Implemented serializeMessages to format messages with image_url content parts for vision APIs
pkg/agent/loop.go	Added Media field to processOptions and wired it through the message processing pipeline
pkg/agent/context.go	Updated BuildMessages to accept and attach media to user messages, with proper condition handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-20T22:37:27Z

pkg/providers/openai_compat/provider.go

+		result = append(result, map[string]interface{}{
+			"role":    m.Role,
+			"content": parts,
+		})


When a message contains Media, the tool_calls and tool_call_id fields are not serialized. If a message has both media attachments and tool calls (or is a tool response with media), this information would be lost. Consider handling these fields even when Media is present, or document if this combination is intentionally not supported.

Suggested change

result = append(result, map[string]interface{}{

"role": m.Role,

"content": parts,

})

msg := map[string]interface{}{

"role": m.Role,

"content": parts,

}

if m.ToolCallID != "" {

msg["tool_call_id"] = m.ToolCallID

}

if len(m.ToolCalls) > 0 {

msg["tool_calls"] = m.ToolCalls

}

result = append(result, msg)

Copilot · 2026-02-20T22:37:28Z

pkg/providers/openai_compat/provider.go

+func serializeMessages(messages []Message) []map[string]interface{} {
+	result := make([]map[string]interface{}, 0, len(messages))
+	for _, m := range messages {
+		if len(m.Media) == 0 {
+			msg := map[string]interface{}{
+				"role":    m.Role,
+				"content": m.Content,
+			}
+			if m.ToolCallID != "" {
+				msg["tool_call_id"] = m.ToolCallID
+			}
+			if len(m.ToolCalls) > 0 {
+				msg["tool_calls"] = m.ToolCalls
+			}
+			result = append(result, msg)
+			continue
+		}
+
+		parts := make([]map[string]interface{}, 0, 1+len(m.Media))
+		if m.Content != "" {
+			parts = append(parts, map[string]interface{}{
+				"type": "text",
+				"text": m.Content,
+			})
+		}
+		for _, mediaURL := range m.Media {
+			parts = append(parts, map[string]interface{}{
+				"type": "image_url",
+				"image_url": map[string]interface{}{
+					"url": mediaURL,
+				},
+			})
+		}
+		result = append(result, map[string]interface{}{
+			"role":    m.Role,
+			"content": parts,
+		})
+	}
+	return result
+}


The serializeMessages function lacks test coverage. Given the comprehensive test suite for other provider functionality (e.g., TestProviderChat_ParsesToolCalls, TestProviderChat_StripsMoonshotPrefixAndNormalizesKimiTemperature), consider adding tests for message serialization with media attachments to ensure correct OpenAI API format compliance and prevent regressions. Test cases should cover: messages with media only, messages with both text and media, and messages without media (backward compatibility).

Copilot · 2026-02-20T22:37:28Z

pkg/agent/loop.go

 		summary,
 		opts.UserMessage,
-		nil,
+		opts.Media,


The Media field contains URLs that are passed directly to external LLM APIs without validation. Consider adding validation to ensure URLs are from trusted sources (e.g., Discord CDN for Discord messages) and don't expose internal network resources. While the URLs originate from InboundMessage.Media which likely comes from Discord, explicit validation would prevent potential SSRF vulnerabilities if the source changes or if there are bugs in upstream URL extraction.

shikihane · 2026-03-01T09:51:03Z

Hi @as3k 👋

We've implemented Feishu inbound image download on our side — when a user sends an image in Feishu, we download it via the SDK and pass it through mediaPaths to HandleMessage().

Your PR is the missing piece that would make it end-to-end: once Media flows through BuildMessages() and serializeMessages() formats the image_url parts, vision-capable LLMs will actually see the images.

We've opened a draft PR at #951 to track our work. Just wanted to let you know so we don't duplicate effort, and to confirm that our inbound download → your pipeline wiring → LLM is the intended flow.

@as3k

…#555) Add Media field to processOptions, pass msg.Media from inbound messages through to BuildMessages and serializeMessages so vision-capable LLMs receive image_url content parts. Based on work by @as3k in sipeed#555. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

shikihane · 2026-03-02T13:48:41Z

Hi @as3k, thanks for the original work on this — your vision pipeline design is clean and we've been referencing it.

We have a Feishu inbound image feature that depends on this pipeline, and it's been a blocker for our deployment for a while. Since this PR has merge conflicts with the latest main and has been inactive, we submitted our own implementation in #990, based on your approach (with attribution in our commit messages).

We also added resolveMediaRefs() to convert media:// refs to base64 data URLs before the LLM call (without this, MediaStore refs get rejected by the API), and fixed a reasoning_content preservation issue in serializeMessages().

If your PR gets updated and merged first, we're happy to close ours. Thanks again for the groundwork!

@as3k

…#555) Add Media field to processOptions, pass msg.Media from inbound messages through to BuildMessages and serializeMessages so vision-capable LLMs receive image_url content parts. Based on work by @as3k in sipeed#555. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@as3k

…#555) Add Media field to processOptions, pass msg.Media from inbound messages through to BuildMessages and serializeMessages so vision-capable LLMs receive image_url content parts. Based on work by @as3k in sipeed#555. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CLAassistant · 2026-03-05T15:08:42Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

@as3k

…#555) Add Media field to processOptions, pass msg.Media from inbound messages through to BuildMessages and serializeMessages so vision-capable LLMs receive image_url content parts. Based on work by @as3k in sipeed#555. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yinwm · 2026-03-11T18:58:31Z

Closing this PR as it has been superseded by #990 which has already been merged.

The vision/image support feature has been implemented in #990 with additional improvements:

Added resolveMediaRefs() to convert media:// refs to base64 data URLs before LLM calls
Fixed reasoning_content preservation issue in serializeMessages()
Proper attribution to @as3k's original work has been included in the merged commits

Thanks @as3k for the groundwork on this feature!

@as3k

Add Media field to processOptions, pass msg.Media from inbound messages through to BuildMessages and serializeMessages so vision-capable LLMs receive image_url content parts. Based on work by @as3k in sipeed/picoclaw#555. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

as3k added 2 commits February 20, 2026 14:32

Copilot AI review requested due to automatic review settings February 20, 2026 22:33

Copilot started reviewing on behalf of as3k February 20, 2026 22:34 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

xiaket requested a review from yinwm February 22, 2026 05:18

shikihane mentioned this pull request Mar 1, 2026

feat(feishu): add inbound image message support #951

Closed

6 tasks

shikihane mentioned this pull request Mar 2, 2026

feat(agent): add vision/image support to agent pipeline #990

Merged

6 tasks

sipeed-bot bot added type: enhancement New feature or request domain: agent domain: provider labels Mar 3, 2026

yinwm closed this Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add vision/image support to agent pipeline#555

feat: add vision/image support to agent pipeline#555
as3k wants to merge 2 commits intosipeed:mainfrom
as3k:fix/vision-image-support

as3k commented Feb 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

shikihane commented Mar 1, 2026

Uh oh!

shikihane commented Mar 2, 2026

Uh oh!

CLAassistant commented Mar 5, 2026

Uh oh!

yinwm commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

as3k commented Feb 20, 2026

Summary

Problem

Solution

Testing

Commits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

shikihane commented Mar 1, 2026

Uh oh!

shikihane commented Mar 2, 2026

Uh oh!

CLAassistant commented Mar 5, 2026

Uh oh!

yinwm commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants