Add behavioral evals for tracker by anj-s · Pull Request #20069 · google-gemini/gemini-cli

anj-s · 2026-02-23T20:48:33Z

Summary

This PR introduces behavioral evaluations for the Task Tracker in evals/tracker.eval.ts. These tests ensure the model correctly utilizes tracker tools (tracker_create_task, tracker_update_task) in both explicit and implicit scenarios when ApprovalMode.YOLO is enabled.

Details

Explicit Management Eval: Validates that the model can follow instructions to create a task, perform a fix, and then close the task in the tracker.
Implicit Organization Eval: Verifies that the model autonomously deduces when to use tracker tools to organize a complex implementation plan, even when not explicitly prompted to use the tracker.
Safety Verification: Ensures the model respects "plan-only" prompts by confirming no code modifications are made during the planning phase.
Test Setup: Properly configures the evaluation rig by injecting experimental.taskTracker = true into the model settings.

Related Issues

Fixes #19965

How to Validate

Run the evaluation tests locally:

npm run test:all_evals -- evals/tracker.eval.ts

Pre-Merge Checklist

Updated relevant documentation and README (if needed)
Added/updated tests (if needed)
Noted breaking changes (if any)
Validated on required platforms/methods:
- MacOS
  - npm run

- Use cryptographically secure ID generation with node:crypto - Implement runtime validation for JSON parsing using Zod - Optimize circular dependency validation to avoid N+1 file reads

…se-2

…ructure

gemini-code-assist · 2026-02-23T20:48:47Z

Summary of Changes

Hello @anj-s, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive suite of behavioral evaluations for the tracker tool. These evaluations are designed to verify that the tool correctly responds to both explicit user commands and implicitly understands when to engage its functionalities, such as task initialization, creation, listing, visualization, and status updates, ensuring robust and intelligent interaction with the model.

Highlights

New Behavioral Evaluations: Added a new suite of behavioral evaluation tests specifically for the tracker tool, ensuring its functionality is robustly tested.
Explicit Tracker Usage Tests: Included tests that verify the tracker tool responds correctly to explicit user commands, such as initializing the tracker, creating tasks, listing/visualizing tasks, and updating task statuses.
Implicit Tracker Usage Tests: Implemented evaluations to confirm the tracker tool can implicitly understand user intent and proactively engage its functionalities, like creating tasks for feature plans or initializing for new projects, without direct instructions.

Changelog

evals/tracker.eval.ts
- Added behavioral tests for the tracker tool.
- Included explicit tests for tracker_init, tracker_create_task, tracker_list_tasks, tracker_visualize, and tracker_update_task.
- Added implicit tests for tracker_create_task and tracker_init based on user intent.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-cli · 2026-02-23T20:48:50Z

Hi @anj-s, thank you so much for your contribution to Gemini CLI! We really appreciate the time and effort you've put into this.

We're making some updates to our contribution process to improve how we track and review changes. Please take a moment to review our recent discussion post: Improving Our Contribution Process & Introducing New Guidelines.

Key Update: Starting January 26, 2026, the Gemini CLI project will require all pull requests to be associated with an existing issue. Any pull requests not linked to an issue by that date will be automatically closed.

Thank you for your understanding and for being a part of our community!

gemini-cli · 2026-02-23T20:48:52Z

Hi there! Thank you for your contribution to Gemini CLI.

To improve our contribution process and better track changes, we now require all pull requests to be associated with an existing issue, as announced in our recent discussion and as detailed in our CONTRIBUTING.md.

This pull request is being closed because it is not currently linked to an issue. Once you have updated the description of this PR to link an issue (e.g., by adding Fixes #123 or Related to #123), it will be automatically reopened.

How to link an issue:
Add a keyword followed by the issue number (e.g., Fixes #123) in the description of your pull request. For more details on supported keywords and how linking works, please refer to the GitHub Documentation on linking pull requests to issues.

Thank you for your understanding and for being a part of our community!

github-actions · 2026-03-09T19:02:14Z

Size Change: -4 B (0%)

Total Size: 26.2 MB

ℹ️ View Unchanged

Filename	Size	Change
`./bundle/gemini.js`	25.7 MB	-4 B (0%)
`./bundle/node_modules/@google/gemini-cli-devtools/dist/client/main.js`	221 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/src/_client-assets.js`	227 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/src/index.js`	11.5 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/src/types.js`	132 B	0 B
`./bundle/sandbox-macos-permissive-open.sb`	890 B	0 B
`./bundle/sandbox-macos-permissive-proxied.sb`	1.31 kB	0 B
`./bundle/sandbox-macos-restrictive-open.sb`	3.36 kB	0 B
`./bundle/sandbox-macos-restrictive-proxied.sb`	3.56 kB	0 B
`./bundle/sandbox-macos-strict-open.sb`	4.82 kB	0 B
`./bundle/sandbox-macos-strict-proxied.sb`	5.02 kB	0 B

_{compressed-size-action}

gemini-code-assist

Code Review

This pull request introduces valuable behavioral evaluation tests for the task tracker feature, covering both explicit and implicit tool usage scenarios. The tests are well-structured and the prompts are clear. The identified logical issue in an assertion was valid, and a suggestion has been provided to correct it.

evals/tracker.eval.ts

gundermanc · 2026-03-09T19:36:33Z

FYI: Ran these through the nightly run: https://github.com/google-gemini/gemini-cli/actions/runs/22870896466

Looks like they pass 66-100% of the time, depending on model, though it fails at 0% for some models.

Not necessarily blocking.

evals/tracker.eval.ts

anj-s · 2026-03-09T21:41:28Z

FYI: Ran these through the nightly run: https://github.com/google-gemini/gemini-cli/actions/runs/22870896466

Looks like they pass 66-100% of the time, depending on model, though it fails at 0% for some models.

Not necessarily blocking.

Got it! What does no numbers in the table imply?

gundermanc · 2026-03-10T18:30:15Z

Got it! What does no numbers in the table imply?

The report lists the results from the last ~7 runs. The rightmost column is the current run. The columns left of that are from previous runs. No numbers means the test did not run in that run, in this case, because it didn't exist.

anj-s added 24 commits February 18, 2026 12:03

feat(core): implement task tracker foundation and service (Phase 1)

b5db5d3

feat(core,cli): implement task tracker tools and feature flag (Phase 2)

f737126

chore(core): improve ID generation and add runtime task validation

448af0a

fix: address code review comments from bot in trackerService.ts

2d6ee8f

- Use cryptographically secure ID generation with node:crypto - Implement runtime validation for JSON parsing using Zod - Optimize circular dependency validation to avoid N+1 file reads

Merge branch 'u/anj/task-tracker-phase-1' into u/anj/task-tracker-pha…

9618f8f

…se-2

docs: update implementation plan for Phase 2

6a24077

feat(tracker): move tracker storage to project temp directory

ae96477

Merge branch 'u/anj/task-tracker-phase-1' into u/anj/task-tracker-pha…

91e7881

…se-2

feat(tracker): integrate dynamic storage path in Config and tools

e6b68e7

feat(tracker): simplify tracker storage path

ba73124

Merge branch 'u/anj/task-tracker-phase-1' into u/anj/task-tracker-pha…

e32d8a2

…se-2

feat(tracker): update config to use simplified tracker path

1e743ea

feat(tracker): restore session-specific nested storage path

23194dd

Merge branch 'u/anj/task-tracker-phase-1' into u/anj/task-tracker-pha…

de0ce2c

…se-2

feat(tracker): restore nested tracker path in Config

eed1ca6

feat(tracker): simplify tracker storage path and flatten directory st…

eb33db5

…ructure

fix(tracker): lazily initialize tracker directory

daf9da2

chore(tracker): remove plans configuration directory from git tracking

41cf395

remove .gitignore changes

fc298e3

remove .gitignore changes

30fe8ff

si changes: task tracker prep implementation

5d31761

si changes

53755c5

test: add explicit and implicit behavioral evals for tracker

b53aa80

behavioral evals for tracker

1a345a2

gemini-cli bot closed this Feb 23, 2026

anj-s reopened this Feb 23, 2026

anj-s closed this Feb 23, 2026

Base automatically changed from u/anj/task-tracker-phase-3 to main March 6, 2026 00:29

update behavioral evals

441ab08

anj-s added 3 commits March 9, 2026 12:05

merge

c03d3db

Merge branch 'main' into anj/tracker-evals

8ac7fac

address review comments

f50b80b

anj-s requested a review from gundermanc March 9, 2026 19:18

anj-s marked this pull request as ready for review March 9, 2026 19:18

gemini-code-assist bot reviewed Mar 9, 2026

View reviewed changes

evals/tracker.eval.ts Show resolved Hide resolved

gundermanc reviewed Mar 9, 2026

View reviewed changes

evals/tracker.eval.ts Outdated Show resolved Hide resolved

gundermanc reviewed Mar 9, 2026

View reviewed changes

evals/tracker.eval.ts Outdated Show resolved Hide resolved

gundermanc reviewed Mar 9, 2026

View reviewed changes

evals/tracker.eval.ts Show resolved Hide resolved

gundermanc approved these changes Mar 9, 2026

View reviewed changes

anj-s and others added 2 commits March 10, 2026 11:26

addressed comments

b10223b

Merge branch 'main' into anj/tracker-evals

b5b62f5

anj-s enabled auto-merge March 10, 2026 18:28

anj-s added this pull request to the merge queue Mar 10, 2026

Merged via the queue into main with commit 2dd0376 Mar 10, 2026
27 checks passed

anj-s deleted the anj/tracker-evals branch March 10, 2026 19:05

gemini-code-assist bot mentioned this pull request Mar 11, 2026

Changelog for v0.34.0-preview.0 #21965

Merged

JaisalJain pushed a commit to JaisalJain/gemini-cli that referenced this pull request Mar 11, 2026

Add behavioral evals for tracker (google-gemini#20069)

98acdf5

kunal-10-cloud pushed a commit to kunal-10-cloud/gemini-cli that referenced this pull request Mar 12, 2026

Add behavioral evals for tracker (google-gemini#20069)

719ac56

liamhelmer pushed a commit to badal-io/gemini-cli that referenced this pull request Mar 12, 2026

Add behavioral evals for tracker (google-gemini#20069)

d7b2371

yashodipmore pushed a commit to yashodipmore/geemi-cli that referenced this pull request Mar 21, 2026

Add behavioral evals for tracker (google-gemini#20069)

777f92f

SUNDRAM07 pushed a commit to SUNDRAM07/gemini-cli that referenced this pull request Mar 30, 2026

Add behavioral evals for tracker (google-gemini#20069)

b17b273

warrenzhu25 pushed a commit to warrenzhu25/gemini-cli that referenced this pull request Apr 9, 2026

Add behavioral evals for tracker (google-gemini#20069)

6bd07d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add behavioral evals for tracker#20069

Add behavioral evals for tracker#20069
anj-s merged 30 commits intomainfrom
anj/tracker-evals

anj-s commented Feb 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 23, 2026

Uh oh!

gemini-cli bot commented Feb 23, 2026

Uh oh!

gemini-cli bot commented Feb 23, 2026

Uh oh!

github-actions bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gundermanc commented Mar 9, 2026

Uh oh!

Uh oh!

anj-s commented Mar 9, 2026

Uh oh!

gundermanc commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anj-s commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Related Issues

How to Validate

Pre-Merge Checklist

Uh oh!

gemini-code-assist bot commented Feb 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-cli bot commented Feb 23, 2026

Uh oh!

gemini-cli bot commented Feb 23, 2026

Uh oh!

github-actions bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gundermanc commented Mar 9, 2026

Uh oh!

Uh oh!

anj-s commented Mar 9, 2026

Uh oh!

gundermanc commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anj-s commented Feb 23, 2026 •

edited

Loading

github-actions bot commented Mar 9, 2026 •

edited

Loading