Incomplete stream can drop custom tool outputs and break thread resume

## Summary

If a Responses stream errors or closes after emitting a `custom_tool_call` but before `response.completed`, Codex can persist a session history that contains the `custom_tool_call` without the matching `custom_tool_call_output`.

Later, resuming the thread fails during local history normalization before any new model request is sent. In practice this can look like the UI staying in `Working` or failing immediately when restoring the thread.

## Impact

- A session JSONL can be left in an internally inconsistent state on disk.
- Resuming the thread fails locally during history reconstruction.
- Retry requests can also be built from stale history and omit already-completed tool outputs from the previous attempt.

## Reproduction outline

1. Start a turn where the model emits a `custom_tool_call` (for example `apply_patch` or another custom tool).
2. Let the tool complete successfully.
3. Terminate the SSE stream before `response.completed` arrives.
4. Let Codex retry the stream, or resume the thread later.
5. Observe that history may contain a `custom_tool_call` with no matching `custom_tool_call_output`, and thread resume fails.

## Expected behavior

- Error paths should still drain in-flight tool futures and persist completed tool outputs.
- Retry attempts should rebuild the prompt from the latest session history so previously completed tool outputs are included.
- A successfully completed tool call should not leave the session unrecoverable just because the stream failed before `response.completed`.

## Actual behavior

There are two bugs involved:

1. In `try_run_sampling_request()`, some stream/tool error paths return early before `drain_in_flight()` runs, so completed custom tool outputs are never written into history.
2. In `run_sampling_request()`, the prompt is built once before the retry loop, so retries can reuse stale history even if the session state has been updated.

That combination can leave persisted history inconsistent and also cause retries to omit the prior tool output.

## Suspected root cause

Relevant locations in the current tree:

- `codex-rs/core/src/codex.rs`
- `codex-rs/core/src/context_manager/normalize.rs`
- `codex-rs/core/src/context_manager/history.rs`

History normalization correctly enforces the invariant that every `custom_tool_call` must have a matching `custom_tool_call_output`. The real issue is that sampling-request cleanup/retry logic can violate that invariant.

More specifically:

- A stream item error path like `Some(res) => res?` can exit `try_run_sampling_request()` before cleanup.
- An error from `handle_output_item_done(...)` can also exit early before `drain_in_flight()`.
- Because `build_prompt(...)` is outside the retry loop, the next retry may still send a prompt built from stale history.

## Likely introduction timeline

From local tracing/blame:

- The missing-output persistence bug appears to be at least as old as `f2555422b` (`Simplify parallel`, 2025-10-07), where early stream-error returns bypassed later cleanup.
- The stale-prompt-on-retry bug appears older and seems to date back to the initial Rust implementation (`31d0d7a30`, 2025-04-24), where the prompt was already built outside the retry loop.
- The history-normalization check that exposes the bad state is present in `1a89f7001` (`refactor Conversation history file into its own directory`, 2025-11-05).

## Suggested fix

- In `try_run_sampling_request()`, convert early error-path returns into `break Err(err)` so the function still reaches the shared cleanup path and `drain_in_flight()`.
- In `run_sampling_request()`, rebuild the prompt inside the retry loop from fresh `sess.clone_history().for_prompt(...)`.
- Add a regression test covering:
  - `custom_tool_call`
  - incomplete stream before `response.completed`
  - retry
  - assertion that the retried request contains the previous `custom_tool_call_output`

## Notes

I reproduced this with a custom tool call that completed successfully before the stream ended early. The tool's effect was visible, but the corresponding output record was missing from session history, which strongly suggests a persistence/cleanup bug rather than a tool-execution failure.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incomplete stream can drop custom tool outputs and break thread resume #16255

Summary

Impact

Reproduction outline

Expected behavior

Actual behavior

Suspected root cause

Likely introduction timeline

Suggested fix

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Incomplete stream can drop custom tool outputs and break thread resume #16255

Description

Summary

Impact

Reproduction outline

Expected behavior

Actual behavior

Suspected root cause

Likely introduction timeline

Suggested fix

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions