Summary
If a Responses stream errors or closes after emitting a custom_tool_call but before response.completed, Codex can persist a session history that contains the custom_tool_call without the matching custom_tool_call_output.
Later, resuming the thread fails during local history normalization before any new model request is sent. In practice this can look like the UI staying in Working or failing immediately when restoring the thread.
Impact
- A session JSONL can be left in an internally inconsistent state on disk.
- Resuming the thread fails locally during history reconstruction.
- Retry requests can also be built from stale history and omit already-completed tool outputs from the previous attempt.
Reproduction outline
- Start a turn where the model emits a
custom_tool_call (for example apply_patch or another custom tool).
- Let the tool complete successfully.
- Terminate the SSE stream before
response.completed arrives.
- Let Codex retry the stream, or resume the thread later.
- Observe that history may contain a
custom_tool_call with no matching custom_tool_call_output, and thread resume fails.
Expected behavior
- Error paths should still drain in-flight tool futures and persist completed tool outputs.
- Retry attempts should rebuild the prompt from the latest session history so previously completed tool outputs are included.
- A successfully completed tool call should not leave the session unrecoverable just because the stream failed before
response.completed.
Actual behavior
There are two bugs involved:
- In
try_run_sampling_request(), some stream/tool error paths return early before drain_in_flight() runs, so completed custom tool outputs are never written into history.
- In
run_sampling_request(), the prompt is built once before the retry loop, so retries can reuse stale history even if the session state has been updated.
That combination can leave persisted history inconsistent and also cause retries to omit the prior tool output.
Suspected root cause
Relevant locations in the current tree:
codex-rs/core/src/codex.rs
codex-rs/core/src/context_manager/normalize.rs
codex-rs/core/src/context_manager/history.rs
History normalization correctly enforces the invariant that every custom_tool_call must have a matching custom_tool_call_output. The real issue is that sampling-request cleanup/retry logic can violate that invariant.
More specifically:
- A stream item error path like
Some(res) => res? can exit try_run_sampling_request() before cleanup.
- An error from
handle_output_item_done(...) can also exit early before drain_in_flight().
- Because
build_prompt(...) is outside the retry loop, the next retry may still send a prompt built from stale history.
Likely introduction timeline
From local tracing/blame:
- The missing-output persistence bug appears to be at least as old as
f2555422b (Simplify parallel, 2025-10-07), where early stream-error returns bypassed later cleanup.
- The stale-prompt-on-retry bug appears older and seems to date back to the initial Rust implementation (
31d0d7a30, 2025-04-24), where the prompt was already built outside the retry loop.
- The history-normalization check that exposes the bad state is present in
1a89f7001 (refactor Conversation history file into its own directory, 2025-11-05).
Suggested fix
- In
try_run_sampling_request(), convert early error-path returns into break Err(err) so the function still reaches the shared cleanup path and drain_in_flight().
- In
run_sampling_request(), rebuild the prompt inside the retry loop from fresh sess.clone_history().for_prompt(...).
- Add a regression test covering:
custom_tool_call
- incomplete stream before
response.completed
- retry
- assertion that the retried request contains the previous
custom_tool_call_output
Notes
I reproduced this with a custom tool call that completed successfully before the stream ended early. The tool's effect was visible, but the corresponding output record was missing from session history, which strongly suggests a persistence/cleanup bug rather than a tool-execution failure.
Summary
If a Responses stream errors or closes after emitting a
custom_tool_callbut beforeresponse.completed, Codex can persist a session history that contains thecustom_tool_callwithout the matchingcustom_tool_call_output.Later, resuming the thread fails during local history normalization before any new model request is sent. In practice this can look like the UI staying in
Workingor failing immediately when restoring the thread.Impact
Reproduction outline
custom_tool_call(for exampleapply_patchor another custom tool).response.completedarrives.custom_tool_callwith no matchingcustom_tool_call_output, and thread resume fails.Expected behavior
response.completed.Actual behavior
There are two bugs involved:
try_run_sampling_request(), some stream/tool error paths return early beforedrain_in_flight()runs, so completed custom tool outputs are never written into history.run_sampling_request(), the prompt is built once before the retry loop, so retries can reuse stale history even if the session state has been updated.That combination can leave persisted history inconsistent and also cause retries to omit the prior tool output.
Suspected root cause
Relevant locations in the current tree:
codex-rs/core/src/codex.rscodex-rs/core/src/context_manager/normalize.rscodex-rs/core/src/context_manager/history.rsHistory normalization correctly enforces the invariant that every
custom_tool_callmust have a matchingcustom_tool_call_output. The real issue is that sampling-request cleanup/retry logic can violate that invariant.More specifically:
Some(res) => res?can exittry_run_sampling_request()before cleanup.handle_output_item_done(...)can also exit early beforedrain_in_flight().build_prompt(...)is outside the retry loop, the next retry may still send a prompt built from stale history.Likely introduction timeline
From local tracing/blame:
f2555422b(Simplify parallel, 2025-10-07), where early stream-error returns bypassed later cleanup.31d0d7a30, 2025-04-24), where the prompt was already built outside the retry loop.1a89f7001(refactor Conversation history file into its own directory, 2025-11-05).Suggested fix
try_run_sampling_request(), convert early error-path returns intobreak Err(err)so the function still reaches the shared cleanup path anddrain_in_flight().run_sampling_request(), rebuild the prompt inside the retry loop from freshsess.clone_history().for_prompt(...).custom_tool_callresponse.completedcustom_tool_call_outputNotes
I reproduced this with a custom tool call that completed successfully before the stream ended early. The tool's effect was visible, but the corresponding output record was missing from session history, which strongly suggests a persistence/cleanup bug rather than a tool-execution failure.