Skip to content

fix(agent): prevent history compression from orphaning tool_call/tool_result pairs#665

Closed
nikolasdehor wants to merge 2 commits intosipeed:mainfrom
nikolasdehor:fix/tool-pair-orphan-475
Closed

fix(agent): prevent history compression from orphaning tool_call/tool_result pairs#665
nikolasdehor wants to merge 2 commits intosipeed:mainfrom
nikolasdehor:fix/tool-pair-orphan-475

Conversation

@nikolasdehor
Copy link

Summary

  • Adds sanitizeToolPairs() helper that removes orphaned tool_call/tool_result messages after history compression
  • Applied in both forceCompression() and summarizeSession() paths
  • Prevents Anthropic API 400 errors caused by orphaned tool pairs (tool_use ids were provided that do not have a tool_use block)

Root Cause

When conversation history exceeds context limits, forceCompression() splits at the midpoint (mid := len(conversation) / 2) and summarizeSession() keeps only the last 4 messages via TruncateHistory(key, 4). Neither checks whether the cut point falls between a tool_call and its tool_result, creating orphaned pairs that violate Anthropic's API contract.

The existing partial fix by Diegox-17 in context.go only strips orphaned tool messages from the very start of history, missing orphaned assistant messages with ToolCalls at the end or broken pairs created by compression.

Fix

sanitizeToolPairs() scans for tool_call IDs and tool_result IDs, then filters:

  • Orphaned tool results (no matching tool_call) are removed
  • Orphaned assistant tool_calls (missing results) with text content keep the text but drop the tool_calls
  • Orphaned assistant tool_calls without text content are removed entirely
  • Complete pairs are preserved unchanged

Test Plan

  • Unit tests for sanitizeToolPairs() covering 9 scenarios
  • Normal messages without tool pairs pass through unchanged
  • Complete tool pairs are preserved
  • Orphaned tool_result at boundary is removed
  • Orphaned tool_call at boundary is removed
  • Orphaned tool_call with text content keeps text, strips tool_calls
  • Multiple consecutive complete pairs are preserved
  • Multi-tool-call with partial results handled correctly
  • Mixed orphaned and complete pairs handled correctly
  • Empty messages handled correctly
  • All existing tests pass (go test ./pkg/... -count=1)
  • go build ./pkg/... succeeds

Closes #475

Copilot AI review requested due to automatic review settings February 23, 2026 00:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent provider (notably Anthropic) 400s during history compression/truncation by sanitizing conversation history so tool_calltool_result pairs aren’t orphaned after forceCompression() and summarizeSession() modify session history.

Changes:

  • Added sanitizeToolPairs() helper to remove/adjust orphaned tool-related messages after history cuts.
  • Integrated sanitization into forceCompression() and the summarizeSession() truncation path.
  • Added unit tests covering a variety of tool-pair boundary scenarios.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
pkg/agent/sanitize.go Introduces sanitizeToolPairs() and logging around removed/modified messages.
pkg/agent/sanitize_test.go Adds unit tests for sanitization behavior across multiple tool-pair scenarios.
pkg/agent/loop.go Calls sanitization after forced compression and after truncating history post-summary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +933 to +935
if len(sanitized) != len(remaining) {
agent.Sessions.SetHistory(sessionKey, sanitized)
}
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In summarizeSession, the sanitized history is only written back when the slice length changes. sanitizeToolPairs can modify messages without changing length (e.g., stripping ToolCalls but keeping assistant text), so this condition can silently keep invalid history and still trigger provider 400s. Consider always calling SetHistory after sanitization, or have sanitizeToolPairs return a (sanitized, changed) flag and check that instead of len().

Suggested change
if len(sanitized) != len(remaining) {
agent.Sessions.SetHistory(sessionKey, sanitized)
}
agent.Sessions.SetHistory(sessionKey, sanitized)

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right -- the len() comparison misses in-place modifications. sanitizeToolPairs can strip ToolCalls from an assistant message while preserving the text content, which changes the message without changing the slice length. Will fix this to always call SetHistory after sanitization, or track a modified flag alongside the length check.

Comment on lines +16 to +51
// Build set of tool_call IDs present in assistant messages
toolCallIDs := make(map[string]bool)
for _, m := range messages {
if m.Role == "assistant" && len(m.ToolCalls) > 0 {
for _, tc := range m.ToolCalls {
if tc.ID != "" {
toolCallIDs[tc.ID] = true
}
}
}
}

// Build set of tool_result IDs present
toolResultIDs := make(map[string]bool)
for _, m := range messages {
if m.Role == "tool" && m.ToolCallID != "" {
toolResultIDs[m.ToolCallID] = true
}
}

// Filter: keep tool results only if their tool_call exists,
// and keep assistant tool_call messages only if all results exist
result := make([]providers.Message, 0, len(messages))
removed := 0

for _, m := range messages {
switch {
case m.Role == "tool" && m.ToolCallID != "":
// Keep tool result only if its tool_call is present
if toolCallIDs[m.ToolCallID] {
result = append(result, m)
} else {
removed++
logger.DebugCF("agent", "sanitizeToolPairs: removing orphaned tool result",
map[string]interface{}{"tool_call_id": m.ToolCallID})
}
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sanitizeToolPairs currently decides whether to keep a tool result by checking if the ToolCallID exists anywhere in the input history (toolCallIDs), not whether the corresponding assistant ToolCalls message is retained and appears before the tool result in the sanitized output. This can keep tool results that come before their tool_call (after truncation) and can also keep tool results for an assistant tool_call message that the function later drops/strips, producing new orphans mid-history and still violating the provider contract. Recommend making the sanitization order-aware (pair tool results only with a preceding retained assistant tool_calls message) and ensuring that if an assistant tool_calls message is removed/stripped, any tool results for its IDs are removed as well (or alternatively keep only the subset of ToolCalls that actually have results and drop the rest consistently).

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most substantive feedback -- thank you. You are correct that the current implementation is not fully order-aware: it builds a set of all tool_call IDs across the entire history and then uses membership to decide whether to keep tool results. After truncation, a tool result could appear before its corresponding tool_call, or the tool_call message could be removed while its results are kept.

I will refactor sanitizeToolPairs to be a single forward pass that:

  1. Tracks which tool_call IDs have been seen from retained assistant messages
  2. Only keeps tool results whose IDs match a previously-seen, retained assistant tool_call
  3. If an assistant tool_call message is stripped/dropped, also drops any subsequent tool results for those IDs

This makes the algorithm order-aware and eliminates the category of bugs you described.

Comment on lines +64 to +72
} else if m.Content != "" {
// Keep the text content but strip the tool calls
removed++
logger.DebugCF("agent", "sanitizeToolPairs: stripping orphaned tool_calls from assistant message, keeping text content",
map[string]interface{}{"tool_call_count": len(m.ToolCalls)})
result = append(result, providers.Message{
Role: "assistant",
Content: m.Content,
})
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removed_count bookkeeping is inaccurate: in the branch that strips ToolCalls but keeps the assistant message text, removed++ is incremented even though no message is removed (only modified). This makes the warning log misleading and complicates debugging/monitoring. Consider tracking removed vs modified separately (or only incrementing removed when the message is actually dropped).

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. The removed_count is misleading when the message is modified in-place rather than dropped. Will split the logging into removed (message actually dropped from the slice) vs modified (message kept but ToolCalls stripped) for clearer diagnostics.

Comment on lines +150 to +159
// The assistant msg is dropped (not all tool_calls have results).
// call_1's tool result stays because toolCallIDs includes call_1
// from the original scan. This is a secondary orphan handled by
// the existing Diegox-17 fix in BuildMessages (strips leading tool msgs).
if len(result) != 3 {
t.Errorf("expected 3 messages, got %d", len(result))
}
if result[0].Role != "user" {
t.Errorf("expected first message to be user, got %s", result[0].Role)
}
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestSanitizeToolPairs_MultiToolCallPartialResults encodes behavior where the assistant tool_call message is dropped but tool results for its IDs are still kept (comment says it relies on the older BuildMessages leading-tool-strip). This leaves an orphaned tool result in the middle of history, which is exactly what sanitizeToolPairs is intended to prevent, so the test currently allows the core bug to persist/regress. After fixing sanitizeToolPairs to remove tool results when their corresponding tool_call message is removed/stripped (and to be order-aware), update this test to assert the fully sanitized output rather than depending on BuildMessages’ unrelated leading-tool cleanup.

Suggested change
// The assistant msg is dropped (not all tool_calls have results).
// call_1's tool result stays because toolCallIDs includes call_1
// from the original scan. This is a secondary orphan handled by
// the existing Diegox-17 fix in BuildMessages (strips leading tool msgs).
if len(result) != 3 {
t.Errorf("expected 3 messages, got %d", len(result))
}
if result[0].Role != "user" {
t.Errorf("expected first message to be user, got %s", result[0].Role)
}
// The assistant msg is dropped because not all tool_calls have results.
// Any tool results corresponding to that assistant's tool_calls should
// also be dropped, so no orphaned tool messages remain.
if len(result) != 2 {
t.Errorf("expected 2 messages after sanitization, got %d", len(result))
}
if result[0].Role != "user" || result[0].Content != "Do two things" {
t.Errorf("unexpected first message: role=%s content=%q", result[0].Role, result[0].Content)
}
if result[1].Role != "user" || result[1].Content != "OK" {
t.Errorf("unexpected second message: role=%s content=%q", result[1].Role, result[1].Content)
}

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed -- this test currently encodes the "leaky" behavior where orphaned tool results are left for BuildMessages to clean up. After refactoring sanitizeToolPairs to be order-aware (per the comment above), this test should assert that the tool result is also dropped when its corresponding assistant tool_call message is removed, rather than relying on a downstream cleanup layer.

Will update the test expectations as part of the refactor.

Comment on lines +44 to +45
t.Errorf("expected 4 messages, got %d", len(result))
}
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestSanitizeToolPairs_CompletePairs only asserts the resulting length, so it would still pass if sanitizeToolPairs accidentally strips ToolCalls or drops the tool result while keeping four messages. Adding assertions that the assistant ToolCalls and matching tool ToolCallID are preserved would make this test actually validate the intended invariant.

Suggested change
t.Errorf("expected 4 messages, got %d", len(result))
}
t.Fatalf("expected 4 messages, got %d", len(result))
}
// Verify roles and contents are preserved in order
if result[0].Role != "user" || result[0].Content != "What is the weather?" {
t.Errorf("unexpected first message: %#v", result[0])
}
// Assistant tool_call message should still have the same ToolCalls
assistantWithTool := result[1]
if assistantWithTool.Role != "assistant" {
t.Errorf("expected second message role=assistant, got %s", assistantWithTool.Role)
}
if len(assistantWithTool.ToolCalls) != 1 {
t.Fatalf("expected second message to have 1 ToolCall, got %d", len(assistantWithTool.ToolCalls))
}
if assistantWithTool.ToolCalls[0].ID != "call_1" {
t.Errorf("expected ToolCall ID 'call_1', got %q", assistantWithTool.ToolCalls[0].ID)
}
if assistantWithTool.ToolCalls[0].Name != "get_weather" {
t.Errorf("expected ToolCall Name 'get_weather', got %q", assistantWithTool.ToolCalls[0].Name)
}
// Tool result should still be present and linked via ToolCallID
toolMsg := result[2]
if toolMsg.Role != "tool" {
t.Errorf("expected third message role=tool, got %s", toolMsg.Role)
}
if toolMsg.ToolCallID != "call_1" {
t.Errorf("expected ToolCallID 'call_1', got %q", toolMsg.ToolCallID)
}
if toolMsg.Content != `{"temp": 72}` {
t.Errorf("unexpected tool content: %q", toolMsg.Content)
}
// Final assistant response should be preserved
finalAssistant := result[3]
if finalAssistant.Role != "assistant" {
t.Errorf("expected fourth message role=assistant, got %s", finalAssistant.Role)
}
if finalAssistant.Content != "The temperature is 72 degrees." {
t.Errorf("unexpected final assistant content: %q", finalAssistant.Content)
}

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point -- length-only assertions are brittle. Will add full content assertions (role, content, ToolCalls, ToolCallID) to TestSanitizeToolPairs_CompletePairs to verify the complete-pair scenario preserves all fields correctly.

@nikolasdehor
Copy link
Author

Update: PR #730 (merged by winterfx) addresses the same class of bug in sanitizeHistoryForProvider() at the context.go layer. Our fix remains valuable because:

  1. Defense in depth: PR fix: multi-tool-call results incorrectly dropped causing LLM API 400 #730 sanitizes at query time (BuildMessages), but corrupted history is still persisted to the session store. Our fix prevents the corruption from being stored in the first place.
  2. Covers summarizeSession(): Neither fix: multi-tool-call results incorrectly dropped causing LLM API 400 #730 nor the new PR fix: preserve tool call/response pairing in forceCompression #871 addresses the TruncateHistory(key, 4) path in summarizeSession(), which can equally orphan tool pairs.
  3. Handles partial assistant messages: Our sanitizeToolPairs() correctly handles assistant messages with both text content AND tool_calls (strips tool_calls, preserves text). PR fix: preserve tool call/response pairing in forceCompression #871's removeOrphanedToolMessages() only strips leading tool messages and does not handle orphaned assistant messages with ToolCalls.

This PR and #730 are complementary layers — one prevents corruption at write time, the other at read time.

…_result pairs

When forceCompression() or summarizeSession() truncates conversation
history, the cut point can fall between a tool_call and its tool_result,
creating orphaned pairs that cause Anthropic API 400 errors.

Add sanitizeToolPairs() to remove orphaned tool pairs after compression:
- Tool results without matching tool_calls are removed
- Assistant messages with tool_calls missing results are stripped or removed
- Assistant messages with both text content and orphaned tool_calls keep
  the text but drop the tool_calls

Applied in both forceCompression() and summarizeSession() paths.

Closes sipeed#475
- Track retained tool_call IDs to eliminate secondary orphans: when an
  assistant message is dropped, its tool results are also dropped
- Split removed/modified counters for accurate logging
- Always call SetHistory after sanitization (not conditional on len)
- Add full field assertions to CompletePairs test
- Fix MultiToolCallPartialResults test: secondary orphans now cleaned
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

is-Xiaoen added a commit to is-Xiaoen/picoclaw that referenced this pull request Mar 13, 2026
…and safe compression

Separate context_window from max_tokens — they serve different purposes
(input capacity vs output generation limit). The previous conflation caused
premature summarization or missed compression triggers.

Changes:
- Add context_window field to AgentDefaults config (default: 4x max_tokens)
- Extract boundary-safe truncation helpers (isSafeBoundary, findSafeBoundary)
  into context_budget.go — pure functions with no AgentLoop dependency
- forceCompression: align split to safe boundary so tool-call sequences
  (assistant+ToolCalls → tool results) are never torn apart
- summarizeSession: use findSafeBoundary instead of hardcoded keep-last-4
- estimateTokens: count ToolCalls arguments and ToolCallID metadata,
  not just Content — fixes systematic undercounting in tool-heavy sessions
- Add proactive context budget check before LLM call in runAgentLoop,
  preventing 400 context-length errors instead of reacting to them
- Add estimateToolDefsTokens for tool definition token cost

Closes sipeed#556, closes sipeed#665
Ref sipeed#1439
@yinwm yinwm closed this in 9c82b0b Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: agent type: bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: History compression orphans tool_call/tool_result pairs, causing Anthropic API errors

3 participants