-
Notifications
You must be signed in to change notification settings - Fork 13k
Add behavioral evaluations for planning tools and workflow #17169
Copy link
Copy link
Labels
area/coreIssues related to User Interface, OS Support, Core FunctionalityIssues related to User Interface, OS Support, Core Functionalityworkstream-rollupLabel used to tag epics and features that are associated with one of the three primary workstreamsLabel used to tag epics and features that are associated with one of the three primary workstreams🔒 maintainer only⛔ Do not contribute. Internal roadmap item.⛔ Do not contribute. Internal roadmap item.
Metadata
Metadata
Assignees
Labels
area/coreIssues related to User Interface, OS Support, Core FunctionalityIssues related to User Interface, OS Support, Core Functionalityworkstream-rollupLabel used to tag epics and features that are associated with one of the three primary workstreamsLabel used to tag epics and features that are associated with one of the three primary workstreams🔒 maintainer only⛔ Do not contribute. Internal roadmap item.⛔ Do not contribute. Internal roadmap item.
Type
Fields
Give feedbackNo fields configured for Task.
Projects
Status
Closed
Add behavioral evals to verify that the agent correctly adheres to its restrictions to read-only tools. These tests validate that the model consistently refuses file modifications when in PLAN mode. Also include evals for
EnterPlanModeandExitPlanMode toolsThis is dependent on refactor described in #17168.
Evals for
AskUsertool is tracked in #17956.