feat(evals): add behavioral evals for ask_user tool by Adib234 · Pull Request #20620 · google-gemini/gemini-cli

Adib234 · 2026-02-27T22:18:50Z

Summary

Add behavioral evaluations for the ask_user tool to ensure the agent correctly triggers it when needed and avoids it when unnecessary.

Details

Added evals/ask_user.eval.ts with test cases for:
- Explicit instruction to ask user.
- Ambiguous requirements clarification.
- Significant rework clarification.
- Ensuring it's NOT used for shell command confirmation.

Related Issues

Fixes #17956

How to Validate

Run the specific eval:
RUN_EVALS=1 npm run test:always_passing_evals -- evals/ask_user.eval.ts

Pre-Merge Checklist

gemini-code-assist · 2026-02-27T22:19:00Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a set of behavioral evaluations designed to refine the agent's interaction with the ask_user tool. The primary goal is to enhance the agent's intelligence in discerning when user clarification is genuinely necessary versus when it should proceed autonomously, particularly concerning shell command execution. This improvement aims to make the agent more intuitive and efficient in its operations.

Highlights

New Behavioral Evaluations: Added a new evaluation file, evals/ask_user.eval.ts, to test the behavioral aspects of the ask_user tool.
Tool Triggering Scenarios: Included test cases to ensure the agent correctly triggers the ask_user tool when explicitly instructed, for clarifying ambiguous requirements, and before performing significant, ambiguous rework.
Tool Avoidance Scenarios: Implemented a test case to verify that the ask_user tool is NOT used to confirm shell commands, ensuring appropriate tool usage.

Changelog

evals/ask_user.eval.ts
- Added test case for explicit instruction to use ask_user.
- Added test case for clarifying ambiguous requirements using ask_user.
- Added test case for using ask_user before significant ambiguous rework.
- Added test case to ensure ask_user is not used for shell command confirmation.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-02-27T22:22:52Z

Size Change: -2 B (0%)

Total Size: 25.8 MB

ℹ️ View Unchanged

Filename	Size	Change
`./bundle/gemini.js`	25.3 MB	-2 B (0%)
`./bundle/node_modules/@google/gemini-cli-devtools/dist/client/main.js`	221 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/src/_client-assets.js`	227 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/src/index.js`	11.5 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/src/types.js`	132 B	0 B
`./bundle/sandbox-macos-permissive-open.sb`	890 B	0 B
`./bundle/sandbox-macos-permissive-proxied.sb`	1.31 kB	0 B
`./bundle/sandbox-macos-restrictive-open.sb`	3.36 kB	0 B
`./bundle/sandbox-macos-restrictive-proxied.sb`	3.56 kB	0 B
`./bundle/sandbox-macos-strict-open.sb`	4.82 kB	0 B
`./bundle/sandbox-macos-strict-proxied.sb`	5.02 kB	0 B

_{compressed-size-action}

gundermanc · 2026-02-27T22:22:59Z

evals/ask_user.eval.ts

+
+describe('ask_user', () => {
+  evalTest('USUALLY_PASSES', {
+    name: 'Agent uses AskUser tool when explicitly instructed',


explicitly instructed -- Where is the explicit instruction?

It looks like we're asking a pointed question but we don't specifically tell the agent to use the ask_user tool.

Do we want ask_user used for every case where the agent wants to ask the user a question or do we sometimes let the user reply in the chat pane?

Updated one of the evals to explicitly instruct the agent to use the ask_user tool, I think we want to use ask_user tool whenever the agent wants to ask a question

gemini-code-assist

Code Review

This pull request adds valuable behavioral evaluations for the ask_user tool. The test cases cover important scenarios, including explicit instructions, ambiguity clarification, and ensuring the tool is not used for shell command confirmations. I've suggested a small improvement to the negative test case to make it more robust by waiting for the agent's turn to complete before checking the logs.

gemini-code-assist · 2026-02-27T22:25:19Z

evals/ask_user.eval.ts

+      const wasShellCalled = await rig.waitForToolCall('run_shell_command');
+      expect(
+        wasShellCalled,
+        'Expected run_shell_command tool to be called',
+      ).toBe(true);
+
+      await rig.waitForTelemetryReady();
+      const wasAskUserCalled = rig
+        .readToolLogs()
+        .some((log) => log.toolRequest.name === 'ask_user');
+      expect(
+        wasAskUserCalled,
+        'ask_user should not be called to confirm shell commands',
+      ).toBe(false);


The waiting logic in this test can be made more robust. Instead of waiting for a specific tool call and then for telemetry, it's better to wait for the entire agent turn to complete before checking the logs. The drainBreakpointsUntilIdle method is designed for this, ensuring all tool calls for the turn have been logged before assertions are made. This prevents potential race conditions and makes the test's intent clearer.

await rig.drainBreakpointsUntilIdle(); const toolLogs = rig.readToolLogs(); const wasShellCalled = toolLogs.some( (log) => log.toolRequest.name === 'run_shell_command' ); const wasAskUserCalled = toolLogs.some( (log) => log.toolRequest.name === 'ask_user' ); expect( wasShellCalled, 'Expected run_shell_command tool to be called' ).toBe(true); expect( wasAskUserCalled, 'ask_user should not be called to confirm shell commands' ).toBe(false);

gundermanc · 2026-02-27T22:27:09Z

evals/ask_user.eval.ts

+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('ask_user', () => {


In general I'd recommend writing the test, being sure it fails (so you know you are testing something that wasn't working), and then accompany your change with a prompt change that fixes it.

It's not a clear at a glance which of these are behaviors fixed via recent tool or prompt changes vs. things the model was doing anyways.

Added a comment for the test that was a bug before and is now fixed via a recent prompt change

Can you describe what the specific fixes are in the comments? Each of these tests has some maintenance burden. We want to be sure that we have the minimal set required to provide a good level of coverage of product behaviors.

gundermanc · 2026-02-27T22:28:38Z

evals/ask_user.eval.ts

+
+  evalTest('USUALLY_PASSES', {
+    name: 'Agent uses AskUser tool before performing significant ambiguous rework',
+    prompt: `Refactor the entire core package to be better.`,


I tried this one manually in Gemini CLI. I see the agent enter plan mode, then call ask_user. Is that the scenario you wanted to validate? Can we add asserts to check for things like in/not-in plan mode to be sure we're testing the right scenario?

Note that since none of these tests have a files member, we could be inadvertently testing that the agent asks you what to do when there are no files.

Yes, that's what I'm testing. Updated test to add some asserts for checking if we are in in/not-in plan mode

Ok, thank you! This is better indeed. Is there any overlap and/or difference with the tests in plan_mode.eval.test: https://github.com/google-gemini/gemini-cli/blob/main/evals/plan_mode.eval.ts#L101-L116

Maybe we should put these all in one file?

plan mode evals are focused on planning specific eval cases

ask user evals are focused on ask_user tool, which is available in default and auto-edit modes as well

ask user evals are focused on ask_user tool, which is available in default and auto-edit modes as well

One thing that's not clear to me is when we'd want ask_user tool vs. just letting the user type in their answer.

imo it's most beneficial for multi-choice (both single-select and multi-select) and yes/no question types

we're having some UX discussions about the open-ended question type vs just using chat

gundermanc · 2026-02-27T22:29:29Z

evals/ask_user.eval.ts

+
+  evalTest('USUALLY_PASSES', {
+    name: 'Agent does NOT use AskUser to confirm shell commands',
+    prompt: `Run 'npm run build' in the current directory.`,


Was this a bug at one time?

Yes, this was the bug #20177

Please add a description above each test with info about the misbehavior. Also, please make sure that the test fails when you revert the fix.

gundermanc · 2026-02-27T22:40:47Z

In general I think we want evals that:

Are complicated enough to be "realistic" -- have files and a source directory, like a real agent would.
Are small enough to reason about and be maintainable -- we probably can't check in an entire repo as a test case, though maybe in the future we can reference them by URI.
Have asserts that are fairly unambiguous. -- we want to make sure the test passes for the right reason.
Have tests that failed before your prompt or tool change -- we want to be sure the test fails before your "fix". It's pretty easy to accidentally create a passing test that asserts behaviors we get for free.
Less is more -- prefer 2 fairly realistic tests that assert the major paths vs. 5 that are more unit-test like. These are evals, so the value is in testing how the agent works in a semi-realistic scenario.

Apologies if this isn't well specified in evals/README.md. I'd welcome any improvements you want to make to the doc.

gundermanc

Approved with two more suggestions. Note that we have plan_mode.eval.ts. Does this overlap with that at all?

jerop · 2026-03-02T22:48:34Z

Note that we have plan_mode.eval.ts. Does this overlap with that at all?

plan mode evals are focused on planning specific eval cases

ask user evals are focused on ask_user tool, which is available in default and auto-edit modes as well

)

feat(evals): add behavioral evals for ask_user tool

15d595a

Adib234 requested a review from a team as a code owner February 27, 2026 22:18

Adib234 self-assigned this Feb 27, 2026

gundermanc reviewed Feb 27, 2026

View reviewed changes

gemini-code-assist bot reviewed Feb 27, 2026

View reviewed changes

gemini-cli bot added area/core Issues related to User Interface, OS Support, Core Functionality area/platform Issues related to Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item. labels Feb 27, 2026

gundermanc reviewed Feb 27, 2026

View reviewed changes

gundermanc requested changes Feb 27, 2026

View reviewed changes

Adib234 and others added 3 commits March 2, 2026 10:01

docs(evals): improve ask_user evaluations based on PR feedback

d6ca8d8

add comment regarding an existing eval for an earlier bug

8be138f

Merge branch 'main' into adibakm/eval-ask-user

912a16f

jerop requested a review from gundermanc March 2, 2026 21:49

gundermanc approved these changes Mar 2, 2026

View reviewed changes

This was referenced Mar 3, 2026

📊 AI CLI 工具社区动态日报 2026-03-03 duanyytop/agents-radar#45

Closed

📊 AI CLI 工具社区动态日报 2026-03-03 rollysys/agents-radar#27

Open

add comment regarding previous misbehavior

80ca81a

Adib234 enabled auto-merge March 3, 2026 14:25

Adib234 added 2 commits March 3, 2026 09:41

Merge branch 'main' into adibakm/eval-ask-user

6f2e643

Merge branch 'main' into adibakm/eval-ask-user

04606c5

Adib234 added this pull request to the merge queue Mar 3, 2026

Merged via the queue into main with commit fe332bb Mar 3, 2026
27 checks passed

Adib234 deleted the adibakm/eval-ask-user branch March 3, 2026 18:03

jwhelangoog pushed a commit that referenced this pull request Mar 3, 2026

feat(evals): add behavioral evals for ask_user tool (#20620)

658ef41

gemini-code-assist bot mentioned this pull request Mar 3, 2026

Changelog for v0.33.0-preview.0 #21030

Merged

BryanBradfo pushed a commit to BryanBradfo/gemini-cli that referenced this pull request Mar 5, 2026

feat(evals): add behavioral evals for ask_user tool (google-gemini#20620

e86193e

)

struckoff pushed a commit to struckoff/gemini-cli that referenced this pull request Mar 6, 2026

feat(evals): add behavioral evals for ask_user tool (google-gemini#20620

c6d8fe1

)

liamhelmer pushed a commit to badal-io/gemini-cli that referenced this pull request Mar 12, 2026

feat(evals): add behavioral evals for ask_user tool (google-gemini#20620

1c355d1

)

Conversation

Adib234 commented Feb 27, 2026

Summary

Details

Related Issues

How to Validate

Pre-Merge Checklist

Uh oh!

gemini-code-assist bot commented Feb 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gundermanc commented Feb 27, 2026

Uh oh!

gundermanc left a comment

Choose a reason for hiding this comment

Uh oh!

jerop commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Feb 27, 2026 •

edited

Loading