Summary
There are currently no behavioral evaluations covering error recovery / self-correction — the agent's ability to detect a failed tool call (e.g., a failing test run), diagnose the issue, make corrections, and re-verify.
This is arguably one of the most critical agent behaviors: a coding agent that cannot recover from mistakes burns tokens and frustrates users.
Proposed Eval
Scaffold a small TypeScript project with:
- A function containing a subtle bug (off-by-one in array filtering)
- A passing test file that exposes the bug
Prompt the agent to "fix the failing tests." Assert that:
- The agent runs the test suite (detecting failure)
- The agent edits the source file to fix the bug
- The agent re-runs the tests to verify the fix worked
This tests the full observe → diagnose → fix → verify loop that distinguishes robust agents from single-shot ones.
Context
This gap was identified while reviewing the existing eval suite for GSoC 2026 Idea #2 (Behavioral Evaluation Test Framework). The existing evals cover tool selection, efficiency, validation fidelity, and delegation — but none cover the iterative self-correction loop.
Related: Evals roadmap item #18257
Summary
There are currently no behavioral evaluations covering error recovery / self-correction — the agent's ability to detect a failed tool call (e.g., a failing test run), diagnose the issue, make corrections, and re-verify.
This is arguably one of the most critical agent behaviors: a coding agent that cannot recover from mistakes burns tokens and frustrates users.
Proposed Eval
Scaffold a small TypeScript project with:
Prompt the agent to "fix the failing tests." Assert that:
This tests the full observe → diagnose → fix → verify loop that distinguishes robust agents from single-shot ones.
Context
This gap was identified while reviewing the existing eval suite for GSoC 2026 Idea #2 (Behavioral Evaluation Test Framework). The existing evals cover tool selection, efficiency, validation fidelity, and delegation — but none cover the iterative self-correction loop.
Related: Evals roadmap item #18257