Skip to content

Incomplete stream can drop custom tool outputs and break thread resume #16255

@wjhuang2016

Description

@wjhuang2016

Summary

If a Responses stream errors or closes after emitting a custom_tool_call but before response.completed, Codex can persist a session history that contains the custom_tool_call without the matching custom_tool_call_output.

Later, resuming the thread fails during local history normalization before any new model request is sent. In practice this can look like the UI staying in Working or failing immediately when restoring the thread.

Impact

  • A session JSONL can be left in an internally inconsistent state on disk.
  • Resuming the thread fails locally during history reconstruction.
  • Retry requests can also be built from stale history and omit already-completed tool outputs from the previous attempt.

Reproduction outline

  1. Start a turn where the model emits a custom_tool_call (for example apply_patch or another custom tool).
  2. Let the tool complete successfully.
  3. Terminate the SSE stream before response.completed arrives.
  4. Let Codex retry the stream, or resume the thread later.
  5. Observe that history may contain a custom_tool_call with no matching custom_tool_call_output, and thread resume fails.

Expected behavior

  • Error paths should still drain in-flight tool futures and persist completed tool outputs.
  • Retry attempts should rebuild the prompt from the latest session history so previously completed tool outputs are included.
  • A successfully completed tool call should not leave the session unrecoverable just because the stream failed before response.completed.

Actual behavior

There are two bugs involved:

  1. In try_run_sampling_request(), some stream/tool error paths return early before drain_in_flight() runs, so completed custom tool outputs are never written into history.
  2. In run_sampling_request(), the prompt is built once before the retry loop, so retries can reuse stale history even if the session state has been updated.

That combination can leave persisted history inconsistent and also cause retries to omit the prior tool output.

Suspected root cause

Relevant locations in the current tree:

  • codex-rs/core/src/codex.rs
  • codex-rs/core/src/context_manager/normalize.rs
  • codex-rs/core/src/context_manager/history.rs

History normalization correctly enforces the invariant that every custom_tool_call must have a matching custom_tool_call_output. The real issue is that sampling-request cleanup/retry logic can violate that invariant.

More specifically:

  • A stream item error path like Some(res) => res? can exit try_run_sampling_request() before cleanup.
  • An error from handle_output_item_done(...) can also exit early before drain_in_flight().
  • Because build_prompt(...) is outside the retry loop, the next retry may still send a prompt built from stale history.

Likely introduction timeline

From local tracing/blame:

  • The missing-output persistence bug appears to be at least as old as f2555422b (Simplify parallel, 2025-10-07), where early stream-error returns bypassed later cleanup.
  • The stale-prompt-on-retry bug appears older and seems to date back to the initial Rust implementation (31d0d7a30, 2025-04-24), where the prompt was already built outside the retry loop.
  • The history-normalization check that exposes the bad state is present in 1a89f7001 (refactor Conversation history file into its own directory, 2025-11-05).

Suggested fix

  • In try_run_sampling_request(), convert early error-path returns into break Err(err) so the function still reaches the shared cleanup path and drain_in_flight().
  • In run_sampling_request(), rebuild the prompt inside the retry loop from fresh sess.clone_history().for_prompt(...).
  • Add a regression test covering:
    • custom_tool_call
    • incomplete stream before response.completed
    • retry
    • assertion that the retried request contains the previous custom_tool_call_output

Notes

I reproduced this with a custom tool call that completed successfully before the stream ended early. The tool's effect was visible, but the corresponding output record was missing from session history, which strongly suggests a persistence/cleanup bug rather than a tool-execution failure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtool-callsIssues related to tool calling

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions