fix(agent): prevent history compression from orphaning tool_call/tool_result pairs#665
fix(agent): prevent history compression from orphaning tool_call/tool_result pairs#665nikolasdehor wants to merge 2 commits intosipeed:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent provider (notably Anthropic) 400s during history compression/truncation by sanitizing conversation history so tool_call ↔ tool_result pairs aren’t orphaned after forceCompression() and summarizeSession() modify session history.
Changes:
- Added
sanitizeToolPairs()helper to remove/adjust orphaned tool-related messages after history cuts. - Integrated sanitization into
forceCompression()and thesummarizeSession()truncation path. - Added unit tests covering a variety of tool-pair boundary scenarios.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| pkg/agent/sanitize.go | Introduces sanitizeToolPairs() and logging around removed/modified messages. |
| pkg/agent/sanitize_test.go | Adds unit tests for sanitization behavior across multiple tool-pair scenarios. |
| pkg/agent/loop.go | Calls sanitization after forced compression and after truncating history post-summary. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pkg/agent/loop.go
Outdated
| if len(sanitized) != len(remaining) { | ||
| agent.Sessions.SetHistory(sessionKey, sanitized) | ||
| } |
There was a problem hiding this comment.
In summarizeSession, the sanitized history is only written back when the slice length changes. sanitizeToolPairs can modify messages without changing length (e.g., stripping ToolCalls but keeping assistant text), so this condition can silently keep invalid history and still trigger provider 400s. Consider always calling SetHistory after sanitization, or have sanitizeToolPairs return a (sanitized, changed) flag and check that instead of len().
| if len(sanitized) != len(remaining) { | |
| agent.Sessions.SetHistory(sessionKey, sanitized) | |
| } | |
| agent.Sessions.SetHistory(sessionKey, sanitized) |
There was a problem hiding this comment.
You are right -- the len() comparison misses in-place modifications. sanitizeToolPairs can strip ToolCalls from an assistant message while preserving the text content, which changes the message without changing the slice length. Will fix this to always call SetHistory after sanitization, or track a modified flag alongside the length check.
pkg/agent/sanitize.go
Outdated
| // Build set of tool_call IDs present in assistant messages | ||
| toolCallIDs := make(map[string]bool) | ||
| for _, m := range messages { | ||
| if m.Role == "assistant" && len(m.ToolCalls) > 0 { | ||
| for _, tc := range m.ToolCalls { | ||
| if tc.ID != "" { | ||
| toolCallIDs[tc.ID] = true | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Build set of tool_result IDs present | ||
| toolResultIDs := make(map[string]bool) | ||
| for _, m := range messages { | ||
| if m.Role == "tool" && m.ToolCallID != "" { | ||
| toolResultIDs[m.ToolCallID] = true | ||
| } | ||
| } | ||
|
|
||
| // Filter: keep tool results only if their tool_call exists, | ||
| // and keep assistant tool_call messages only if all results exist | ||
| result := make([]providers.Message, 0, len(messages)) | ||
| removed := 0 | ||
|
|
||
| for _, m := range messages { | ||
| switch { | ||
| case m.Role == "tool" && m.ToolCallID != "": | ||
| // Keep tool result only if its tool_call is present | ||
| if toolCallIDs[m.ToolCallID] { | ||
| result = append(result, m) | ||
| } else { | ||
| removed++ | ||
| logger.DebugCF("agent", "sanitizeToolPairs: removing orphaned tool result", | ||
| map[string]interface{}{"tool_call_id": m.ToolCallID}) | ||
| } |
There was a problem hiding this comment.
sanitizeToolPairs currently decides whether to keep a tool result by checking if the ToolCallID exists anywhere in the input history (toolCallIDs), not whether the corresponding assistant ToolCalls message is retained and appears before the tool result in the sanitized output. This can keep tool results that come before their tool_call (after truncation) and can also keep tool results for an assistant tool_call message that the function later drops/strips, producing new orphans mid-history and still violating the provider contract. Recommend making the sanitization order-aware (pair tool results only with a preceding retained assistant tool_calls message) and ensuring that if an assistant tool_calls message is removed/stripped, any tool results for its IDs are removed as well (or alternatively keep only the subset of ToolCalls that actually have results and drop the rest consistently).
There was a problem hiding this comment.
This is the most substantive feedback -- thank you. You are correct that the current implementation is not fully order-aware: it builds a set of all tool_call IDs across the entire history and then uses membership to decide whether to keep tool results. After truncation, a tool result could appear before its corresponding tool_call, or the tool_call message could be removed while its results are kept.
I will refactor sanitizeToolPairs to be a single forward pass that:
- Tracks which tool_call IDs have been seen from retained assistant messages
- Only keeps tool results whose IDs match a previously-seen, retained assistant tool_call
- If an assistant tool_call message is stripped/dropped, also drops any subsequent tool results for those IDs
This makes the algorithm order-aware and eliminates the category of bugs you described.
pkg/agent/sanitize.go
Outdated
| } else if m.Content != "" { | ||
| // Keep the text content but strip the tool calls | ||
| removed++ | ||
| logger.DebugCF("agent", "sanitizeToolPairs: stripping orphaned tool_calls from assistant message, keeping text content", | ||
| map[string]interface{}{"tool_call_count": len(m.ToolCalls)}) | ||
| result = append(result, providers.Message{ | ||
| Role: "assistant", | ||
| Content: m.Content, | ||
| }) |
There was a problem hiding this comment.
The removed_count bookkeeping is inaccurate: in the branch that strips ToolCalls but keeps the assistant message text, removed++ is incremented even though no message is removed (only modified). This makes the warning log misleading and complicates debugging/monitoring. Consider tracking removed vs modified separately (or only incrementing removed when the message is actually dropped).
There was a problem hiding this comment.
Fair point. The removed_count is misleading when the message is modified in-place rather than dropped. Will split the logging into removed (message actually dropped from the slice) vs modified (message kept but ToolCalls stripped) for clearer diagnostics.
| // The assistant msg is dropped (not all tool_calls have results). | ||
| // call_1's tool result stays because toolCallIDs includes call_1 | ||
| // from the original scan. This is a secondary orphan handled by | ||
| // the existing Diegox-17 fix in BuildMessages (strips leading tool msgs). | ||
| if len(result) != 3 { | ||
| t.Errorf("expected 3 messages, got %d", len(result)) | ||
| } | ||
| if result[0].Role != "user" { | ||
| t.Errorf("expected first message to be user, got %s", result[0].Role) | ||
| } |
There was a problem hiding this comment.
TestSanitizeToolPairs_MultiToolCallPartialResults encodes behavior where the assistant tool_call message is dropped but tool results for its IDs are still kept (comment says it relies on the older BuildMessages leading-tool-strip). This leaves an orphaned tool result in the middle of history, which is exactly what sanitizeToolPairs is intended to prevent, so the test currently allows the core bug to persist/regress. After fixing sanitizeToolPairs to remove tool results when their corresponding tool_call message is removed/stripped (and to be order-aware), update this test to assert the fully sanitized output rather than depending on BuildMessages’ unrelated leading-tool cleanup.
| // The assistant msg is dropped (not all tool_calls have results). | |
| // call_1's tool result stays because toolCallIDs includes call_1 | |
| // from the original scan. This is a secondary orphan handled by | |
| // the existing Diegox-17 fix in BuildMessages (strips leading tool msgs). | |
| if len(result) != 3 { | |
| t.Errorf("expected 3 messages, got %d", len(result)) | |
| } | |
| if result[0].Role != "user" { | |
| t.Errorf("expected first message to be user, got %s", result[0].Role) | |
| } | |
| // The assistant msg is dropped because not all tool_calls have results. | |
| // Any tool results corresponding to that assistant's tool_calls should | |
| // also be dropped, so no orphaned tool messages remain. | |
| if len(result) != 2 { | |
| t.Errorf("expected 2 messages after sanitization, got %d", len(result)) | |
| } | |
| if result[0].Role != "user" || result[0].Content != "Do two things" { | |
| t.Errorf("unexpected first message: role=%s content=%q", result[0].Role, result[0].Content) | |
| } | |
| if result[1].Role != "user" || result[1].Content != "OK" { | |
| t.Errorf("unexpected second message: role=%s content=%q", result[1].Role, result[1].Content) | |
| } |
There was a problem hiding this comment.
Agreed -- this test currently encodes the "leaky" behavior where orphaned tool results are left for BuildMessages to clean up. After refactoring sanitizeToolPairs to be order-aware (per the comment above), this test should assert that the tool result is also dropped when its corresponding assistant tool_call message is removed, rather than relying on a downstream cleanup layer.
Will update the test expectations as part of the refactor.
pkg/agent/sanitize_test.go
Outdated
| t.Errorf("expected 4 messages, got %d", len(result)) | ||
| } |
There was a problem hiding this comment.
TestSanitizeToolPairs_CompletePairs only asserts the resulting length, so it would still pass if sanitizeToolPairs accidentally strips ToolCalls or drops the tool result while keeping four messages. Adding assertions that the assistant ToolCalls and matching tool ToolCallID are preserved would make this test actually validate the intended invariant.
| t.Errorf("expected 4 messages, got %d", len(result)) | |
| } | |
| t.Fatalf("expected 4 messages, got %d", len(result)) | |
| } | |
| // Verify roles and contents are preserved in order | |
| if result[0].Role != "user" || result[0].Content != "What is the weather?" { | |
| t.Errorf("unexpected first message: %#v", result[0]) | |
| } | |
| // Assistant tool_call message should still have the same ToolCalls | |
| assistantWithTool := result[1] | |
| if assistantWithTool.Role != "assistant" { | |
| t.Errorf("expected second message role=assistant, got %s", assistantWithTool.Role) | |
| } | |
| if len(assistantWithTool.ToolCalls) != 1 { | |
| t.Fatalf("expected second message to have 1 ToolCall, got %d", len(assistantWithTool.ToolCalls)) | |
| } | |
| if assistantWithTool.ToolCalls[0].ID != "call_1" { | |
| t.Errorf("expected ToolCall ID 'call_1', got %q", assistantWithTool.ToolCalls[0].ID) | |
| } | |
| if assistantWithTool.ToolCalls[0].Name != "get_weather" { | |
| t.Errorf("expected ToolCall Name 'get_weather', got %q", assistantWithTool.ToolCalls[0].Name) | |
| } | |
| // Tool result should still be present and linked via ToolCallID | |
| toolMsg := result[2] | |
| if toolMsg.Role != "tool" { | |
| t.Errorf("expected third message role=tool, got %s", toolMsg.Role) | |
| } | |
| if toolMsg.ToolCallID != "call_1" { | |
| t.Errorf("expected ToolCallID 'call_1', got %q", toolMsg.ToolCallID) | |
| } | |
| if toolMsg.Content != `{"temp": 72}` { | |
| t.Errorf("unexpected tool content: %q", toolMsg.Content) | |
| } | |
| // Final assistant response should be preserved | |
| finalAssistant := result[3] | |
| if finalAssistant.Role != "assistant" { | |
| t.Errorf("expected fourth message role=assistant, got %s", finalAssistant.Role) | |
| } | |
| if finalAssistant.Content != "The temperature is 72 degrees." { | |
| t.Errorf("unexpected final assistant content: %q", finalAssistant.Content) | |
| } |
There was a problem hiding this comment.
Good point -- length-only assertions are brittle. Will add full content assertions (role, content, ToolCalls, ToolCallID) to TestSanitizeToolPairs_CompletePairs to verify the complete-pair scenario preserves all fields correctly.
|
Update: PR #730 (merged by winterfx) addresses the same class of bug in
This PR and #730 are complementary layers — one prevents corruption at write time, the other at read time. |
…_result pairs When forceCompression() or summarizeSession() truncates conversation history, the cut point can fall between a tool_call and its tool_result, creating orphaned pairs that cause Anthropic API 400 errors. Add sanitizeToolPairs() to remove orphaned tool pairs after compression: - Tool results without matching tool_calls are removed - Assistant messages with tool_calls missing results are stripped or removed - Assistant messages with both text content and orphaned tool_calls keep the text but drop the tool_calls Applied in both forceCompression() and summarizeSession() paths. Closes sipeed#475
- Track retained tool_call IDs to eliminate secondary orphans: when an assistant message is dropped, its tool results are also dropped - Split removed/modified counters for accurate logging - Always call SetHistory after sanitization (not conditional on len) - Add full field assertions to CompletePairs test - Fix MultiToolCallPartialResults test: secondary orphans now cleaned
67409f3 to
52e510b
Compare
|
|
…and safe compression Separate context_window from max_tokens — they serve different purposes (input capacity vs output generation limit). The previous conflation caused premature summarization or missed compression triggers. Changes: - Add context_window field to AgentDefaults config (default: 4x max_tokens) - Extract boundary-safe truncation helpers (isSafeBoundary, findSafeBoundary) into context_budget.go — pure functions with no AgentLoop dependency - forceCompression: align split to safe boundary so tool-call sequences (assistant+ToolCalls → tool results) are never torn apart - summarizeSession: use findSafeBoundary instead of hardcoded keep-last-4 - estimateTokens: count ToolCalls arguments and ToolCallID metadata, not just Content — fixes systematic undercounting in tool-heavy sessions - Add proactive context budget check before LLM call in runAgentLoop, preventing 400 context-length errors instead of reacting to them - Add estimateToolDefsTokens for tool definition token cost Closes sipeed#556, closes sipeed#665 Ref sipeed#1439
Summary
sanitizeToolPairs()helper that removes orphaned tool_call/tool_result messages after history compressionforceCompression()andsummarizeSession()pathstool_use ids were provided that do not have a tool_use block)Root Cause
When conversation history exceeds context limits,
forceCompression()splits at the midpoint (mid := len(conversation) / 2) andsummarizeSession()keeps only the last 4 messages viaTruncateHistory(key, 4). Neither checks whether the cut point falls between atool_calland itstool_result, creating orphaned pairs that violate Anthropic's API contract.The existing partial fix by Diegox-17 in
context.goonly strips orphanedtoolmessages from the very start of history, missing orphanedassistantmessages withToolCallsat the end or broken pairs created by compression.Fix
sanitizeToolPairs()scans for tool_call IDs and tool_result IDs, then filters:Test Plan
sanitizeToolPairs()covering 9 scenariosgo test ./pkg/... -count=1)go build ./pkg/...succeedsCloses #475