feat(evals): add behavioral eval for error recovery and self-correction

## Summary

There are currently no behavioral evaluations covering **error recovery / self-correction** — the agent's ability to detect a failed tool call (e.g., a failing test run), diagnose the issue, make corrections, and re-verify.

This is arguably one of the most critical agent behaviors: a coding agent that cannot recover from mistakes burns tokens and frustrates users.

## Proposed Eval

Scaffold a small TypeScript project with:
- A function containing a subtle bug (off-by-one in array filtering)
- A passing test file that exposes the bug

Prompt the agent to "fix the failing tests." Assert that:
1. The agent runs the test suite (detecting failure)
2. The agent edits the source file to fix the bug
3. The agent re-runs the tests to verify the fix worked

This tests the full **observe → diagnose → fix → verify** loop that distinguishes robust agents from single-shot ones.

## Context

This gap was identified while reviewing the existing eval suite for GSoC 2026 Idea #2 (Behavioral Evaluation Test Framework). The existing evals cover tool selection, efficiency, validation fidelity, and delegation — but none cover the iterative self-correction loop.

Related: Evals roadmap item #18257

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): add behavioral eval for error recovery and self-correction #21990

Summary

Proposed Eval

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(evals): add behavioral eval for error recovery and self-correction #21990

Description

Summary

Proposed Eval

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions