Skip to content

Commit 002b0dc

Browse files
committed
Behavioral evals best practices docs.
1 parent 07f51c3 commit 002b0dc

File tree

1 file changed

+19
-1
lines changed

1 file changed

+19
-1
lines changed

evals/README.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
Behavioral evaluations (evals) are tests designed to validate the agent's
44
behavior in response to specific prompts. They serve as a critical feedback loop
55
for changes to system prompts, tool definitions, and other model-steering
6-
mechanisms.
6+
mechanisms, and as a tool for assessing feature reliability by model, and preventing regressions.
77

88
## Why Behavioral Evals?
99

@@ -30,6 +30,24 @@ CLI's features.
3030
those that are generally reliable but might occasionally vary
3131
(`USUALLY_PASSES`).
3232

33+
## Best Practices
34+
35+
When designing behavioral evals, aim for scenarios that accurately reflect real-world usage while remaining small and maintainable.
36+
37+
- **Realistic Complexity**: Evals should be complicated enough to be "realistic." They should operate on actual files and a source directory, mirroring how a real agent interacts with a workspace. Remember that the agent may behave differently in a larger codebase, so we want to avoid scenarios that are too simple to be realistic.
38+
- *Good*: An eval that provides a small, functional React component and asks the agent to add a specific feature, requiring it to read the file, understand the context, and write the correct changes.
39+
- *Bad*: An eval that simply asks the agent a trivia question or asks it to write a generic script without providing any local workspace context.
40+
- **Maintainable Size**: Evals should be small enough to reason about and maintain. We probably can't check in an entire repo as a test case, though over time we will want these evals to mature into more and more realistic scenarios.
41+
- *Good*: A test setup with 2-3 files (e.g., a source file, a config file, and a test file) that isolates the specific behavior being evaluated.
42+
- *Bad*: A test setup containing dozens of files from a complex framework where the setup logic itself is prone to breaking.
43+
- **Unambiguous and Reliable Assertions**: Assertions must be clear and specific to ensure the test passes for the right reason.
44+
- *Good*: Checking that a modified file contains a specific AST node or exact string, or verifying that a tool was called with with the right parameters.
45+
- *Bad*: Only checking for a tool call, which could happen for an unrelated reason. Expecting specific LLM output.
46+
- **Fail First**: Have tests that failed before your prompt or tool change. We want to be sure the test fails before your "fix". It's pretty easy to accidentally create a passing test that asserts behaviors we get for free. In general, every eval should be accompanied by prompt change, and most prompt changes should be accompanied by an eval.
47+
- *Good*: Observing a failure, writing an eval that reliably reproduces the failure, modifying the prompt/tool, and then verifying the eval passes.
48+
- *Bad*: Writing an eval that passes on the first run and assuming your new prompt change was responsible.
49+
- **Less is More**: Prefer fewer, more realistic tests that assert the major paths vs. more tests that are more unit-test like. These are evals, so the value is in testing how the agent works in a semi-realistic scenario.
50+
3351
## Creating an Evaluation
3452

3553
Evaluations are located in the `evals` directory. Each evaluation is a Vitest

0 commit comments

Comments
 (0)