Skip to content

feat(evals): add behavioral evals for ask_user tool#20620

Merged
Adib234 merged 7 commits intomainfrom
adibakm/eval-ask-user
Mar 3, 2026
Merged

feat(evals): add behavioral evals for ask_user tool#20620
Adib234 merged 7 commits intomainfrom
adibakm/eval-ask-user

Conversation

@Adib234
Copy link
Copy Markdown
Contributor

@Adib234 Adib234 commented Feb 27, 2026

Summary

Add behavioral evaluations for the ask_user tool to ensure the agent correctly triggers it when needed and avoids it when unnecessary.

Details

  • Added evals/ask_user.eval.ts with test cases for:
    • Explicit instruction to ask user.
    • Ambiguous requirements clarification.
    • Significant rework clarification.
    • Ensuring it's NOT used for shell command confirmation.

Related Issues

Fixes #17956

How to Validate

Run the specific eval:
RUN_EVALS=1 npm run test:always_passing_evals -- evals/ask_user.eval.ts

Pre-Merge Checklist

  • Updated relevant documentation and README (if needed)
  • Added/updated tests (if needed)
  • Noted breaking changes (if any)
  • Validated on required platforms/methods:
    • MacOS
      • npm run
      • npx
      • Docker
      • Podman
      • Seatbelt

@Adib234 Adib234 requested a review from a team as a code owner February 27, 2026 22:18
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a set of behavioral evaluations designed to refine the agent's interaction with the ask_user tool. The primary goal is to enhance the agent's intelligence in discerning when user clarification is genuinely necessary versus when it should proceed autonomously, particularly concerning shell command execution. This improvement aims to make the agent more intuitive and efficient in its operations.

Highlights

  • New Behavioral Evaluations: Added a new evaluation file, evals/ask_user.eval.ts, to test the behavioral aspects of the ask_user tool.
  • Tool Triggering Scenarios: Included test cases to ensure the agent correctly triggers the ask_user tool when explicitly instructed, for clarifying ambiguous requirements, and before performing significant, ambiguous rework.
  • Tool Avoidance Scenarios: Implemented a test case to verify that the ask_user tool is NOT used to confirm shell commands, ensuring appropriate tool usage.
Changelog
  • evals/ask_user.eval.ts
    • Added test case for explicit instruction to use ask_user.
    • Added test case for clarifying ambiguous requirements using ask_user.
    • Added test case for using ask_user before significant ambiguous rework.
    • Added test case to ensure ask_user is not used for shell command confirmation.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Adib234 Adib234 self-assigned this Feb 27, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 27, 2026

Size Change: -2 B (0%)

Total Size: 25.8 MB

ℹ️ View Unchanged
Filename Size Change
./bundle/gemini.js 25.3 MB -2 B (0%)
./bundle/node_modules/@google/gemini-cli-devtools/dist/client/main.js 221 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/_client-assets.js 227 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/index.js 11.5 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/types.js 132 B 0 B
./bundle/sandbox-macos-permissive-open.sb 890 B 0 B
./bundle/sandbox-macos-permissive-proxied.sb 1.31 kB 0 B
./bundle/sandbox-macos-restrictive-open.sb 3.36 kB 0 B
./bundle/sandbox-macos-restrictive-proxied.sb 3.56 kB 0 B
./bundle/sandbox-macos-strict-open.sb 4.82 kB 0 B
./bundle/sandbox-macos-strict-proxied.sb 5.02 kB 0 B

compressed-size-action


describe('ask_user', () => {
evalTest('USUALLY_PASSES', {
name: 'Agent uses AskUser tool when explicitly instructed',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicitly instructed -- Where is the explicit instruction?

It looks like we're asking a pointed question but we don't specifically tell the agent to use the ask_user tool.

Do we want ask_user used for every case where the agent wants to ask the user a question or do we sometimes let the user reply in the chat pane?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated one of the evals to explicitly instruct the agent to use the ask_user tool, I think we want to use ask_user tool whenever the agent wants to ask a question

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds valuable behavioral evaluations for the ask_user tool. The test cases cover important scenarios, including explicit instructions, ambiguity clarification, and ensuring the tool is not used for shell command confirmations. I've suggested a small improvement to the negative test case to make it more robust by waiting for the agent's turn to complete before checking the logs.

Comment on lines +45 to +58
const wasShellCalled = await rig.waitForToolCall('run_shell_command');
expect(
wasShellCalled,
'Expected run_shell_command tool to be called',
).toBe(true);

await rig.waitForTelemetryReady();
const wasAskUserCalled = rig
.readToolLogs()
.some((log) => log.toolRequest.name === 'ask_user');
expect(
wasAskUserCalled,
'ask_user should not be called to confirm shell commands',
).toBe(false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The waiting logic in this test can be made more robust. Instead of waiting for a specific tool call and then for telemetry, it's better to wait for the entire agent turn to complete before checking the logs. The drainBreakpointsUntilIdle method is designed for this, ensuring all tool calls for the turn have been logged before assertions are made. This prevents potential race conditions and makes the test's intent clearer.

      await rig.drainBreakpointsUntilIdle();

      const toolLogs = rig.readToolLogs();
      const wasShellCalled = toolLogs.some(
        (log) => log.toolRequest.name === 'run_shell_command'
      );
      const wasAskUserCalled = toolLogs.some(
        (log) => log.toolRequest.name === 'ask_user'
      );

      expect(
        wasShellCalled,
        'Expected run_shell_command tool to be called'
      ).toBe(true);
      expect(
        wasAskUserCalled,
        'ask_user should not be called to confirm shell commands'
      ).toBe(false);

@gemini-cli gemini-cli bot added area/core Issues related to User Interface, OS Support, Core Functionality area/platform Issues related to Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item. labels Feb 27, 2026
import { describe, expect } from 'vitest';
import { evalTest } from './test-helper.js';

describe('ask_user', () => {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'd recommend writing the test, being sure it fails (so you know you are testing something that wasn't working), and then accompany your change with a prompt change that fixes it.

It's not a clear at a glance which of these are behaviors fixed via recent tool or prompt changes vs. things the model was doing anyways.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment for the test that was a bug before and is now fixed via a recent prompt change

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what the specific fixes are in the comments? Each of these tests has some maintenance burden. We want to be sure that we have the minimal set required to provide a good level of coverage of product behaviors.


evalTest('USUALLY_PASSES', {
name: 'Agent uses AskUser tool before performing significant ambiguous rework',
prompt: `Refactor the entire core package to be better.`,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this one manually in Gemini CLI. I see the agent enter plan mode, then call ask_user. Is that the scenario you wanted to validate? Can we add asserts to check for things like in/not-in plan mode to be sure we're testing the right scenario?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that since none of these tests have a files member, we could be inadvertently testing that the agent asks you what to do when there are no files.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I'm testing. Updated test to add some asserts for checking if we are in in/not-in plan mode

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thank you! This is better indeed. Is there any overlap and/or difference with the tests in plan_mode.eval.test: https://github.com/google-gemini/gemini-cli/blob/main/evals/plan_mode.eval.ts#L101-L116

Maybe we should put these all in one file?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plan mode evals are focused on planning specific eval cases

ask user evals are focused on ask_user tool, which is available in default and auto-edit modes as well

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ask user evals are focused on ask_user tool, which is available in default and auto-edit modes as well

One thing that's not clear to me is when we'd want ask_user tool vs. just letting the user type in their answer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo it's most beneficial for multi-choice (both single-select and multi-select) and yes/no question types

we're having some UX discussions about the open-ended question type vs just using chat


evalTest('USUALLY_PASSES', {
name: 'Agent does NOT use AskUser to confirm shell commands',
prompt: `Run 'npm run build' in the current directory.`,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug at one time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was the bug #20177

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a description above each test with info about the misbehavior. Also, please make sure that the test fails when you revert the fix.

@gundermanc
Copy link
Copy Markdown
Member

In general I think we want evals that:

  • Are complicated enough to be "realistic" -- have files and a source directory, like a real agent would.
  • Are small enough to reason about and be maintainable -- we probably can't check in an entire repo as a test case, though maybe in the future we can reference them by URI.
  • Have asserts that are fairly unambiguous. -- we want to make sure the test passes for the right reason.
  • Have tests that failed before your prompt or tool change -- we want to be sure the test fails before your "fix". It's pretty easy to accidentally create a passing test that asserts behaviors we get for free.
  • Less is more -- prefer 2 fairly realistic tests that assert the major paths vs. 5 that are more unit-test like. These are evals, so the value is in testing how the agent works in a semi-realistic scenario.

Apologies if this isn't well specified in evals/README.md. I'd welcome any improvements you want to make to the doc.

@jerop jerop requested a review from gundermanc March 2, 2026 21:49
Copy link
Copy Markdown
Member

@gundermanc gundermanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with two more suggestions. Note that we have plan_mode.eval.ts. Does this overlap with that at all?

@jerop
Copy link
Copy Markdown
Contributor

jerop commented Mar 2, 2026

Note that we have plan_mode.eval.ts. Does this overlap with that at all?

plan mode evals are focused on planning specific eval cases

ask user evals are focused on ask_user tool, which is available in default and auto-edit modes as well

@Adib234 Adib234 enabled auto-merge March 3, 2026 14:25
@Adib234 Adib234 added this pull request to the merge queue Mar 3, 2026
Merged via the queue into main with commit fe332bb Mar 3, 2026
27 checks passed
@Adib234 Adib234 deleted the adibakm/eval-ask-user branch March 3, 2026 18:03
BryanBradfo pushed a commit to BryanBradfo/gemini-cli that referenced this pull request Mar 5, 2026
struckoff pushed a commit to struckoff/gemini-cli that referenced this pull request Mar 6, 2026
liamhelmer pushed a commit to badal-io/gemini-cli that referenced this pull request Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/core Issues related to User Interface, OS Support, Core Functionality area/platform Issues related to Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add behavioral evals for AskUser tool

3 participants