feat(evals): add behavioral evals for ask_user tool#20620
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a set of behavioral evaluations designed to refine the agent's interaction with the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Size Change: -2 B (0%) Total Size: 25.8 MB ℹ️ View Unchanged
|
evals/ask_user.eval.ts
Outdated
|
|
||
| describe('ask_user', () => { | ||
| evalTest('USUALLY_PASSES', { | ||
| name: 'Agent uses AskUser tool when explicitly instructed', |
There was a problem hiding this comment.
explicitly instructed -- Where is the explicit instruction?
It looks like we're asking a pointed question but we don't specifically tell the agent to use the ask_user tool.
Do we want ask_user used for every case where the agent wants to ask the user a question or do we sometimes let the user reply in the chat pane?
There was a problem hiding this comment.
Updated one of the evals to explicitly instruct the agent to use the ask_user tool, I think we want to use ask_user tool whenever the agent wants to ask a question
There was a problem hiding this comment.
Code Review
This pull request adds valuable behavioral evaluations for the ask_user tool. The test cases cover important scenarios, including explicit instructions, ambiguity clarification, and ensuring the tool is not used for shell command confirmations. I've suggested a small improvement to the negative test case to make it more robust by waiting for the agent's turn to complete before checking the logs.
evals/ask_user.eval.ts
Outdated
| const wasShellCalled = await rig.waitForToolCall('run_shell_command'); | ||
| expect( | ||
| wasShellCalled, | ||
| 'Expected run_shell_command tool to be called', | ||
| ).toBe(true); | ||
|
|
||
| await rig.waitForTelemetryReady(); | ||
| const wasAskUserCalled = rig | ||
| .readToolLogs() | ||
| .some((log) => log.toolRequest.name === 'ask_user'); | ||
| expect( | ||
| wasAskUserCalled, | ||
| 'ask_user should not be called to confirm shell commands', | ||
| ).toBe(false); |
There was a problem hiding this comment.
The waiting logic in this test can be made more robust. Instead of waiting for a specific tool call and then for telemetry, it's better to wait for the entire agent turn to complete before checking the logs. The drainBreakpointsUntilIdle method is designed for this, ensuring all tool calls for the turn have been logged before assertions are made. This prevents potential race conditions and makes the test's intent clearer.
await rig.drainBreakpointsUntilIdle();
const toolLogs = rig.readToolLogs();
const wasShellCalled = toolLogs.some(
(log) => log.toolRequest.name === 'run_shell_command'
);
const wasAskUserCalled = toolLogs.some(
(log) => log.toolRequest.name === 'ask_user'
);
expect(
wasShellCalled,
'Expected run_shell_command tool to be called'
).toBe(true);
expect(
wasAskUserCalled,
'ask_user should not be called to confirm shell commands'
).toBe(false);| import { describe, expect } from 'vitest'; | ||
| import { evalTest } from './test-helper.js'; | ||
|
|
||
| describe('ask_user', () => { |
There was a problem hiding this comment.
In general I'd recommend writing the test, being sure it fails (so you know you are testing something that wasn't working), and then accompany your change with a prompt change that fixes it.
It's not a clear at a glance which of these are behaviors fixed via recent tool or prompt changes vs. things the model was doing anyways.
There was a problem hiding this comment.
Added a comment for the test that was a bug before and is now fixed via a recent prompt change
There was a problem hiding this comment.
Can you describe what the specific fixes are in the comments? Each of these tests has some maintenance burden. We want to be sure that we have the minimal set required to provide a good level of coverage of product behaviors.
|
|
||
| evalTest('USUALLY_PASSES', { | ||
| name: 'Agent uses AskUser tool before performing significant ambiguous rework', | ||
| prompt: `Refactor the entire core package to be better.`, |
There was a problem hiding this comment.
I tried this one manually in Gemini CLI. I see the agent enter plan mode, then call ask_user. Is that the scenario you wanted to validate? Can we add asserts to check for things like in/not-in plan mode to be sure we're testing the right scenario?
There was a problem hiding this comment.
Note that since none of these tests have a files member, we could be inadvertently testing that the agent asks you what to do when there are no files.
There was a problem hiding this comment.
Yes, that's what I'm testing. Updated test to add some asserts for checking if we are in in/not-in plan mode
There was a problem hiding this comment.
Ok, thank you! This is better indeed. Is there any overlap and/or difference with the tests in plan_mode.eval.test: https://github.com/google-gemini/gemini-cli/blob/main/evals/plan_mode.eval.ts#L101-L116
Maybe we should put these all in one file?
There was a problem hiding this comment.
plan mode evals are focused on planning specific eval cases
ask user evals are focused on ask_user tool, which is available in default and auto-edit modes as well
There was a problem hiding this comment.
ask user evals are focused on ask_user tool, which is available in default and auto-edit modes as well
One thing that's not clear to me is when we'd want ask_user tool vs. just letting the user type in their answer.
There was a problem hiding this comment.
imo it's most beneficial for multi-choice (both single-select and multi-select) and yes/no question types
we're having some UX discussions about the open-ended question type vs just using chat
|
|
||
| evalTest('USUALLY_PASSES', { | ||
| name: 'Agent does NOT use AskUser to confirm shell commands', | ||
| prompt: `Run 'npm run build' in the current directory.`, |
There was a problem hiding this comment.
Please add a description above each test with info about the misbehavior. Also, please make sure that the test fails when you revert the fix.
|
In general I think we want evals that:
Apologies if this isn't well specified in evals/README.md. I'd welcome any improvements you want to make to the doc. |
gundermanc
left a comment
There was a problem hiding this comment.
Approved with two more suggestions. Note that we have plan_mode.eval.ts. Does this overlap with that at all?
plan mode evals are focused on planning specific eval cases ask user evals are focused on ask_user tool, which is available in default and auto-edit modes as well |
Summary
Add behavioral evaluations for the
ask_usertool to ensure the agent correctly triggers it when needed and avoids it when unnecessary.Details
evals/ask_user.eval.tswith test cases for:Related Issues
Fixes #17956
How to Validate
Run the specific eval:
RUN_EVALS=1 npm run test:always_passing_evals -- evals/ask_user.eval.tsPre-Merge Checklist