Skip to content

feat(evals): add behavioral eval for error recovery and self-correction #21990

@PeterWadie

Description

@PeterWadie

Summary

There are currently no behavioral evaluations covering error recovery / self-correction — the agent's ability to detect a failed tool call (e.g., a failing test run), diagnose the issue, make corrections, and re-verify.

This is arguably one of the most critical agent behaviors: a coding agent that cannot recover from mistakes burns tokens and frustrates users.

Proposed Eval

Scaffold a small TypeScript project with:

  • A function containing a subtle bug (off-by-one in array filtering)
  • A passing test file that exposes the bug

Prompt the agent to "fix the failing tests." Assert that:

  1. The agent runs the test suite (detecting failure)
  2. The agent edits the source file to fix the bug
  3. The agent re-runs the tests to verify the fix worked

This tests the full observe → diagnose → fix → verify loop that distinguishes robust agents from single-shot ones.

Context

This gap was identified while reviewing the existing eval suite for GSoC 2026 Idea #2 (Behavioral Evaluation Test Framework). The existing evals cover tool selection, efficiency, validation fidelity, and delegation — but none cover the iterative self-correction loop.

Related: Evals roadmap item #18257

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/platformIssues related to Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmtstatus/need-triageIssues that need to be triaged by the triage automation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions