feat(agent): add vision/image support with streaming base64 and filetype detection#1020
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…etype detection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Separate third-party imports from local module imports (gci) - Fix byte slice literal formatting (gofumpt) - Rename shadowed err variable to ftErr (govet) - Remove trailing blank lines in test files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
|
thanks for the pr |
Add multipart content support for messages with Media attachments in CodexProvider (OAuth). This enables vision-capable models like gpt-5.2 to process images when using OAuth authentication. Changes: - Check for msg.Media in user messages - Build ResponseInputMessageContentListParam with text + image parts - Use ResponseInputImageParam with ImageURL and auto detail level This mirrors the existing vision support in openai_compat/provider.go and completes the vision pipeline for OAuth users (PR sipeed#1020 only updated openai_compat). Fixes: OAuth users couldn't use vision despite PR sipeed#1020 adding support
|
@shikihane Solid rewrite of the vision support. Switching to We are building the PicoClaw Dev Group on Discord for contributors to connect and collaborate. If you are interested, send an email to |
The merged sipeed#1020 added vision support to the openai_compat provider but left the native Anthropic provider without image handling. This patch closes the gap: when a Message carries Media (base64 data URLs), the Anthropic adapter now builds NewImageBlockBase64 content blocks using the official SDK, giving users of `anthropic/` prefixed models the same vision capability. Changes: - buildParams: handle msg.Media in user messages, convert data URLs to anthropic.NewImageBlockBase64 content blocks - parseDataURL: extract media type and base64 data from data URLs - Tests for parseDataURL (8 cases) and media message building
…ine-v2 feat(agent): add vision/image support with streaming base64 and filetype detection
…ine-v2 feat(agent): add vision/image support with streaming base64 and filetype detection
Description
Add vision/image support to the agent pipeline so that channels can pass media (images) from users to vision-capable LLMs.
This is a rewrite of #990 (which was reverted in #1010 due to dispute over file processing method). All three review comments from @mengzhuo on #990 have been addressed in this version.
Changes from reverted #990
h2non/filetypefor MIME detectionmimeFromExtension()— relied on file extensionfiletype.MatchFile()— magic-bytes detection, no extension dependencyos.ReadFile+base64.EncodeToString— peak 2.6x file size in memoryos.Open→base64.NewEncoder→io.Copy— streaming, peak ~1.33x (base64 buffer only)maxMediaFileSizehardcoded const (20MB)config.MaxMediaSizeJSON/env configurable, 20MB default viaGetMaxMediaSize()Implementation
New file:
pkg/agent/loop_media.go—resolveMediaRefs()function (extracted from loop.go for clarity)os.Open→base64.NewEncoder→bytes.Buffer→ data URLMediaMeta.ContentTypefrom store, fallback tofiletype.MatchFile()(magic bytes)config.MaxMediaSizewith warning logModified:
pkg/providers/openai_compat/provider.go—serializeMessages()replacesstripSystemParts()image_urlcontent parts with data URLs)reasoning_content,tool_calls,tool_call_idPipeline wiring:
Message.Media []stringfield on protocol typesprocessOptions.Mediathreads media refs fromInboundMessagethrough toBuildMessages()resolveMediaRefs()called before LLM to convertmedia://refs →data:URLsData flow
Test coverage
resolveMediaRefs: 6 tests (resolve to base64, skip oversized, skip unknown type, passthrough non-media URLs, immutability, metadata content-type preference)serializeMessages: 4 tests (plain text, with media, media + tool_call_id, system message stripping)E2E verification
Tested on Radxa Cubie A7A (arm64) with Feishu channel → Dashscope LLM (kimi-k2.5). Sent image via Feishu, LLM correctly identified image content.
Type of Change
AI Code Generation
Test Environment
Checklist
make test)