-
Notifications
You must be signed in to change notification settings - Fork 13k
feat(evals): add behavioral evals for ask_user tool #20620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
15d595a
d6ca8d8
8be138f
912a16f
80ca81a
6f2e643
04606c5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| /** | ||
| * @license | ||
| * Copyright 2026 Google LLC | ||
| * SPDX-License-Identifier: Apache-2.0 | ||
| */ | ||
|
|
||
| import { describe, expect } from 'vitest'; | ||
| import { evalTest } from './test-helper.js'; | ||
|
|
||
| describe('ask_user', () => { | ||
| evalTest('USUALLY_PASSES', { | ||
| name: 'Agent uses AskUser tool to present multiple choice options', | ||
| prompt: `Use the ask_user tool to ask me what my favorite color is. Provide 3 options: red, green, or blue.`, | ||
| assert: async (rig) => { | ||
| const wasToolCalled = await rig.waitForToolCall('ask_user'); | ||
| expect(wasToolCalled, 'Expected ask_user tool to be called').toBe(true); | ||
| }, | ||
| }); | ||
|
|
||
| evalTest('USUALLY_PASSES', { | ||
| name: 'Agent uses AskUser tool to clarify ambiguous requirements', | ||
| files: { | ||
| 'package.json': JSON.stringify({ name: 'my-app', version: '1.0.0' }), | ||
| }, | ||
| prompt: `I want to build a new feature in this app. Ask me questions to clarify the requirements before proceeding.`, | ||
| assert: async (rig) => { | ||
| const wasToolCalled = await rig.waitForToolCall('ask_user'); | ||
| expect(wasToolCalled, 'Expected ask_user tool to be called').toBe(true); | ||
| }, | ||
| }); | ||
|
|
||
| evalTest('USUALLY_PASSES', { | ||
| name: 'Agent uses AskUser tool before performing significant ambiguous rework', | ||
| files: { | ||
| 'packages/core/src/index.ts': '// index\nexport const version = "1.0.0";', | ||
| 'packages/core/src/util.ts': '// util\nexport function help() {}', | ||
| 'packages/core/package.json': JSON.stringify({ | ||
| name: '@google/gemini-cli-core', | ||
| }), | ||
| 'README.md': '# Gemini CLI', | ||
| }, | ||
| prompt: `Refactor the entire core package to be better.`, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried this one manually in Gemini CLI. I see the agent enter plan mode, then call ask_user. Is that the scenario you wanted to validate? Can we add asserts to check for things like in/not-in plan mode to be sure we're testing the right scenario?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that since none of these tests have a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's what I'm testing. Updated test to add some asserts for checking if we are in in/not-in plan mode
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, thank you! This is better indeed. Is there any overlap and/or difference with the tests in plan_mode.eval.test: https://github.com/google-gemini/gemini-cli/blob/main/evals/plan_mode.eval.ts#L101-L116 Maybe we should put these all in one file?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. plan mode evals are focused on planning specific eval cases ask user evals are focused on ask_user tool, which is available in default and auto-edit modes as well
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
One thing that's not clear to me is when we'd want ask_user tool vs. just letting the user type in their answer.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. imo it's most beneficial for multi-choice (both single-select and multi-select) and yes/no question types we're having some UX discussions about the open-ended question type vs just using chat |
||
| assert: async (rig) => { | ||
| const wasPlanModeCalled = await rig.waitForToolCall('enter_plan_mode'); | ||
| expect(wasPlanModeCalled, 'Expected enter_plan_mode to be called').toBe( | ||
| true, | ||
| ); | ||
|
|
||
| const wasAskUserCalled = await rig.waitForToolCall('ask_user'); | ||
| expect( | ||
| wasAskUserCalled, | ||
| 'Expected ask_user tool to be called to clarify the significant rework', | ||
| ).toBe(true); | ||
| }, | ||
| }); | ||
|
|
||
| // --- Regression Tests for Recent Fixes --- | ||
|
|
||
| // Regression test for issue #20177: Ensure the agent does not use `ask_user` to | ||
| // confirm shell commands. Fixed via prompt refinements and tool definition | ||
| // updates to clarify that shell command confirmation is handled by the UI. | ||
| // See fix: https://github.com/google-gemini/gemini-cli/pull/20504 | ||
| evalTest('USUALLY_PASSES', { | ||
| name: 'Agent does NOT use AskUser to confirm shell commands', | ||
| files: { | ||
| 'package.json': JSON.stringify({ | ||
| scripts: { build: 'echo building' }, | ||
| }), | ||
| }, | ||
| prompt: `Run 'npm run build' in the current directory.`, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Was this a bug at one time?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this was the bug #20177
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add a description above each test with info about the misbehavior. Also, please make sure that the test fails when you revert the fix. |
||
| assert: async (rig) => { | ||
| await rig.waitForTelemetryReady(); | ||
|
|
||
| const toolLogs = rig.readToolLogs(); | ||
| const wasShellCalled = toolLogs.some( | ||
| (log) => log.toolRequest.name === 'run_shell_command', | ||
| ); | ||
| const wasAskUserCalled = toolLogs.some( | ||
| (log) => log.toolRequest.name === 'ask_user', | ||
| ); | ||
|
|
||
| expect( | ||
| wasShellCalled, | ||
| 'Expected run_shell_command tool to be called', | ||
| ).toBe(true); | ||
| expect( | ||
| wasAskUserCalled, | ||
| 'ask_user should not be called to confirm shell commands', | ||
| ).toBe(false); | ||
| }, | ||
| }); | ||
| }); | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'd recommend writing the test, being sure it fails (so you know you are testing something that wasn't working), and then accompany your change with a prompt change that fixes it.
It's not a clear at a glance which of these are behaviors fixed via recent tool or prompt changes vs. things the model was doing anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment for the test that was a bug before and is now fixed via a recent prompt change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe what the specific fixes are in the comments? Each of these tests has some maintenance burden. We want to be sure that we have the minimal set required to provide a good level of coverage of product behaviors.