feat(agent): add vision/image support to agent pipeline#990
feat(agent): add vision/image support to agent pipeline#990Orgmar merged 5 commits intosipeed:mainfrom
Conversation
…es for vision API support - Add Media []string field to Message struct for image/media URLs - Implement serializeMessages() to format messages with image_url content parts - Enables OpenAI-compatible vision APIs to receive image attachments
…#555) Add Media field to processOptions, pass msg.Media from inbound messages through to BuildMessages and serializeMessages so vision-capable LLMs receive image_url content parts. Based on work by @as3k in sipeed#555. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The serializeMessages() function was not preserving the reasoning_content field when serializing messages for vision API calls. This caused the TestProviderChat_PreservesReasoningContentInHistory test to fail. This fix ensures reasoning_content is included in both text-only messages and vision messages with media attachments. Co-authored-by: Zachary Guerrero <[email protected]>
…data URLs Without this function, media:// refs stored by MediaStore are passed directly to the LLM API, which rejects them as invalid URLs. resolveMediaRefs() runs after BuildMessages() and before the LLM call, converting each media:// ref to a data:image/...;base64,... URL that vision-capable models can process. Also adds mimeFromExtension() helper for MIME type inference from file extensions when ContentType metadata is not available.
Code Review: Vision/Image Support for Agent PipelineThanks for the contribution! The overall architecture is solid, but I found a few issues that should be addressed before merging. 🔴 Must Fix1. `serializeMessages` drops `ToolCallID` and `ToolCalls` when Media is present```go While "tool calls with images" may be rare, this inconsistency could cause hard-to-debug issues. Suggest adding: ```go 2. Missing Unit TestsThe PR mentions `All existing tests pass`, but the new functions have no test coverage:
This is critical since `serializeMessages` directly affects the API request format. 🟡 Suggestions3. Memory Risk in `resolveMediaRefs````go Reading entire files into memory. For high-res images (4K screenshots can be 10MB+), this could cause OOM under concurrent load. Suggestion: Add file size limit: info, err := os.Stat(localPath) 4. Silent Failure on Media Resolution ErrorsWhen image resolution fails, only a warning is logged - users get no feedback: ```go Suggestion: Consider adding a hint in the response or propagating the error. 5. `mimeFromExtension` Default Return Value```go Suggestion: For unknown extensions, skip or return an error instead of guessing jpeg. ✅ What's Done Well
Summary
Recommendation: Please add unit tests (at minimum for `serializeMessages`) and fix the ToolCallID/ToolCalls bug before merging. |
- serializeMessages: preserve ToolCallID/ToolCalls when Media is present - resolveMediaRefs: add 20MB file size limit to prevent OOM - mimeFromExtension: return empty string for unknown extensions - Add 11 unit tests for serializeMessages, resolveMediaRefs, mimeFromExtension Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
|
||
| // mimeFromExtension returns a MIME type for common image extensions. | ||
| // Returns empty string for unrecognized extensions. | ||
| func mimeFromExtension(ext string) string { |
There was a problem hiding this comment.
Please use github.com/h2non/filetype for file type detection
| continue | ||
| } | ||
|
|
||
| dataURL := "data:" + mime + ";base64," + base64.StdEncoding.EncodeToString(data) |
There was a problem hiding this comment.
Could we use file handler and encoder for later use, instead of allocation 2X of memory for media files.
|
|
||
| // maxMediaFileSize is the maximum file size (20 MB) for media resolution. | ||
| // Files larger than this are skipped to prevent OOM under concurrent load. | ||
| const maxMediaFileSize = 20 * 1024 * 1024 |
There was a problem hiding this comment.
Bake size limitation into config.
feat(agent): add vision/image support to agent pipeline
feat(agent): add vision/image support to agent pipeline
Description
Add vision/image support to the agent pipeline so that channels can pass media (images) from users to vision-capable LLMs.
Changes
Message.Mediafield (protocoltypes/types.go) — new[]stringfield for media URLs/refsserializeMessages()(openai_compat/provider.go) — converts messages with Media into OpenAI vision API multi-part format (image_urlcontent parts). Also preservesreasoning_contentfor thinking models.agent/loop.go,agent/context.go) — threadsMediafromInboundMessagethroughprocessOptions→BuildMessages()→ LLM callresolveMediaRefs()(agent/loop.go) — convertsmedia://refs (from MediaStore) todata:image/...;base64,...data URLs before the LLM call, so sessions store lightweight refs while the LLM receives full image dataHow it works
Relation to PR #555
This PR covers similar ground to #555 by @as3k (with attribution in commit messages). We needed this functionality urgently for our Feishu channel deployment and #555 has been inactive with merge conflicts, so we went ahead with our own implementation. Key additions beyond #555:
resolveMediaRefs()for MediaStore integration (without this,media://refs are sent raw to the LLM and rejected)reasoning_contentpreservation inserializeMessages()Type of Change
AI Code Generation
Test Environment
Checklist
make teston affected packages)