Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions evals/ask_user.eval.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
/**
* @license
* Copyright 2026 Google LLC
* SPDX-License-Identifier: Apache-2.0
*/

import { describe, expect } from 'vitest';
import { evalTest } from './test-helper.js';

describe('ask_user', () => {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'd recommend writing the test, being sure it fails (so you know you are testing something that wasn't working), and then accompany your change with a prompt change that fixes it.

It's not a clear at a glance which of these are behaviors fixed via recent tool or prompt changes vs. things the model was doing anyways.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment for the test that was a bug before and is now fixed via a recent prompt change

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what the specific fixes are in the comments? Each of these tests has some maintenance burden. We want to be sure that we have the minimal set required to provide a good level of coverage of product behaviors.

evalTest('USUALLY_PASSES', {
name: 'Agent uses AskUser tool to present multiple choice options',
prompt: `Use the ask_user tool to ask me what my favorite color is. Provide 3 options: red, green, or blue.`,
assert: async (rig) => {
const wasToolCalled = await rig.waitForToolCall('ask_user');
expect(wasToolCalled, 'Expected ask_user tool to be called').toBe(true);
},
});

evalTest('USUALLY_PASSES', {
name: 'Agent uses AskUser tool to clarify ambiguous requirements',
files: {
'package.json': JSON.stringify({ name: 'my-app', version: '1.0.0' }),
},
prompt: `I want to build a new feature in this app. Ask me questions to clarify the requirements before proceeding.`,
assert: async (rig) => {
const wasToolCalled = await rig.waitForToolCall('ask_user');
expect(wasToolCalled, 'Expected ask_user tool to be called').toBe(true);
},
});

evalTest('USUALLY_PASSES', {
name: 'Agent uses AskUser tool before performing significant ambiguous rework',
files: {
'packages/core/src/index.ts': '// index\nexport const version = "1.0.0";',
'packages/core/src/util.ts': '// util\nexport function help() {}',
'packages/core/package.json': JSON.stringify({
name: '@google/gemini-cli-core',
}),
'README.md': '# Gemini CLI',
},
prompt: `Refactor the entire core package to be better.`,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this one manually in Gemini CLI. I see the agent enter plan mode, then call ask_user. Is that the scenario you wanted to validate? Can we add asserts to check for things like in/not-in plan mode to be sure we're testing the right scenario?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that since none of these tests have a files member, we could be inadvertently testing that the agent asks you what to do when there are no files.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I'm testing. Updated test to add some asserts for checking if we are in in/not-in plan mode

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thank you! This is better indeed. Is there any overlap and/or difference with the tests in plan_mode.eval.test: https://github.com/google-gemini/gemini-cli/blob/main/evals/plan_mode.eval.ts#L101-L116

Maybe we should put these all in one file?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plan mode evals are focused on planning specific eval cases

ask user evals are focused on ask_user tool, which is available in default and auto-edit modes as well

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ask user evals are focused on ask_user tool, which is available in default and auto-edit modes as well

One thing that's not clear to me is when we'd want ask_user tool vs. just letting the user type in their answer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo it's most beneficial for multi-choice (both single-select and multi-select) and yes/no question types

we're having some UX discussions about the open-ended question type vs just using chat

assert: async (rig) => {
const wasPlanModeCalled = await rig.waitForToolCall('enter_plan_mode');
expect(wasPlanModeCalled, 'Expected enter_plan_mode to be called').toBe(
true,
);

const wasAskUserCalled = await rig.waitForToolCall('ask_user');
expect(
wasAskUserCalled,
'Expected ask_user tool to be called to clarify the significant rework',
).toBe(true);
},
});

// --- Regression Tests for Recent Fixes ---

// Regression test for issue #20177: Ensure the agent does not use `ask_user` to
// confirm shell commands. Fixed via prompt refinements and tool definition
// updates to clarify that shell command confirmation is handled by the UI.
// See fix: https://github.com/google-gemini/gemini-cli/pull/20504
evalTest('USUALLY_PASSES', {
name: 'Agent does NOT use AskUser to confirm shell commands',
files: {
'package.json': JSON.stringify({
scripts: { build: 'echo building' },
}),
},
prompt: `Run 'npm run build' in the current directory.`,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug at one time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was the bug #20177

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a description above each test with info about the misbehavior. Also, please make sure that the test fails when you revert the fix.

assert: async (rig) => {
await rig.waitForTelemetryReady();

const toolLogs = rig.readToolLogs();
const wasShellCalled = toolLogs.some(
(log) => log.toolRequest.name === 'run_shell_command',
);
const wasAskUserCalled = toolLogs.some(
(log) => log.toolRequest.name === 'ask_user',
);

expect(
wasShellCalled,
'Expected run_shell_command tool to be called',
).toBe(true);
expect(
wasAskUserCalled,
'ask_user should not be called to confirm shell commands',
).toBe(false);
},
});
});
Loading