feat(core): add audio and video token estimation to tokenCalculation by himanshu748 · Pull Request #20542 · google-gemini/gemini-cli

himanshu748 · 2026-02-27T10:29:15Z

Fixes #20655

Summary

Adds proper audio and video token estimation to estimateMediaTokens() in tokenCalculation.ts. Previously, audio and video Part objects fell through to a generic JSON.stringify(part).length / 4 fallback, which estimated tokens from the raw base64 string length rather than the actual media content duration.

Problem

The estimateMediaTokens() function handled images (3,000 tokens) and PDFs (25,800 tokens) but had no handling for audio/* or video/* MIME types. When audio or video content was processed (e.g., via @file references or MCP tool responses), the token count was estimated from the base64-encoded data string -- wildly inaccurate because:

A 30-second MP3 clip (~480 KB base64) would estimate as ~120,000 tokens via JSON.stringify fallback
The Gemini API actually tokenizes that same clip at ~960 tokens (32 tokens/sec x 30s)

This caused the token budget to be massively over-counted for audio/video content, potentially leading to unnecessary context truncation.

Solution

Adds duration-based token estimation following the Gemini API's documented rates:

Audio: ~32 tokens/second (docs)
- Estimates duration from base64 data size using 128 kbps compressed audio bitrate
- Default fallback: 3,840 tokens (~2 minutes) when base64 data unavailable
Video: ~290 tokens/second (258 tokens/frame at 1fps + 32 audio tokens/sec) (docs)
- Estimates duration from base64 data size using ~2 Mbps compressed video bitrate
- Default fallback: 17,400 tokens (~1 minute) when base64 data unavailable

When the countTokens API is available (the primary path), it is still used for exact counts. These heuristics only apply as the sync fallback estimation path.

Changes

packages/core/src/utils/tokenCalculation.ts: Added estimateAudioTokens(), estimateVideoTokens(), and extended estimateMediaTokens() to handle audio/* and video/* MIME types
packages/core/src/utils/tokenCalculation.test.ts: Added 14 new tests covering audio/video estimation

Testing

All 28 tests pass (14 existing + 14 new):

Audio inlineData with base64 data (data-size-based estimation)
Audio fileData without base64 data (default estimate)
Various audio MIME types (mpeg, wav, ogg, flac, aac, mp4)
Small audio clips (~1 second)
Video inlineData with base64 data
Video fileData without base64 data
Various video MIME types (mp4, webm, quicktime, x-msvideo)
Mixed multimodal content (text + audio + image)
Nested Gemini 3 audio parts in functionResponse
Async calculateRequestTokenCount API and fallback paths for both audio and video

gemini-code-assist · 2026-02-27T10:29:32Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the accuracy of token estimation within the tokenCalculation utility by introducing specific, duration-based heuristics for audio and video content. Previously, these media types were inaccurately estimated based on the raw base64 string length, leading to substantial overcounting. The updated logic now aligns with documented Gemini API rates, using conservative bitrate estimations to derive more realistic token counts. This enhancement ensures more efficient and accurate management of token budgets for multimodal inputs, preventing unnecessary context truncation and optimizing interactions with generative models.

Highlights

Accurate Audio Token Estimation: Implemented duration-based token estimation for audio content, using a rate of ~32 tokens per second and a conservative bitrate for base64 data size conversion. A default estimate of 3,840 tokens (~2 minutes) is used when base64 data is unavailable.
Accurate Video Token Estimation: Introduced duration-based token estimation for video content, accounting for ~290 tokens per second (258 tokens/frame at 1fps + 32 audio tokens/sec). A default estimate of 17,400 tokens (~1 minute) is applied when base64 data is not provided.
Enhanced estimateMediaTokens Function: The estimateMediaTokens function was updated to correctly identify and apply the new audio and video token estimation logic based on MIME types, moving away from the inaccurate generic JSON.stringify fallback.
Comprehensive Test Coverage: Added 14 new test cases to tokenCalculation.test.ts to validate the new audio and video token estimation, covering scenarios with inlineData, fileData, various MIME types, small clips, mixed multimodal content, and nested parts within function responses.

Changelog

packages/core/src/utils/tokenCalculation.test.ts
- Added tests to verify calculateRequestTokenCount correctly uses the countTokens API for audio requests.
- Added tests to verify calculateRequestTokenCount falls back to duration-based estimation for audio when the API fails.
- Added tests to verify calculateRequestTokenCount correctly uses the countTokens API for video requests.
- Added tests to verify calculateRequestTokenCount falls back to duration-based estimation for video when the API fails.
- Included a new test suite for audio token estimation within estimateTokenCountSync, covering inlineData, fileData, various MIME types, and small audio clips.
- Included a new test suite for video token estimation within estimateTokenCountSync, covering inlineData, fileData, and various MIME types.
- Added tests for mixed multimodal content to ensure correct token summation for combinations of text, audio, and image parts.
- Added a test to handle Gemini 3 nested audio parts within functionResponse.
packages/core/src/utils/tokenCalculation.ts
- Defined new constants for audio token estimation, including AUDIO_TOKENS_PER_SECOND, COMPRESSED_AUDIO_BYTES_PER_SECOND, and DEFAULT_AUDIO_TOKEN_ESTIMATE.
- Defined new constants for video token estimation, including VIDEO_TOKENS_PER_SECOND, COMPRESSED_VIDEO_BYTES_PER_SECOND, and DEFAULT_VIDEO_TOKEN_ESTIMATE.
- Introduced the estimateAudioTokens function to calculate audio tokens based on base64 data size or return a default value.
- Introduced the estimateVideoTokens function to calculate video tokens based on base64 data size or return a default value.
- Updated the estimateMediaTokens function to incorporate calls to estimateAudioTokens and estimateVideoTokens for audio/* and video/* MIME types, respectively.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds token estimation for audio and video content by implementing duration-based heuristics as a fallback to the countTokens API. The changes include new estimation logic in tokenCalculation.ts and corresponding tests. My review focuses on improving the maintainability of the newly added constants to ensure they are derived from base rates, making the implementation more robust.

gemini-code-assist · 2026-02-27T10:31:53Z

packages/core/src/utils/tokenCalculation.ts

+// Used to convert raw file size to an approximate duration.
+const COMPRESSED_AUDIO_BYTES_PER_SECOND = 16_000;
+// Default audio token estimate when base64 data is unavailable (~2 min).
+const DEFAULT_AUDIO_TOKEN_ESTIMATE = 3840;


To improve maintainability, the default audio token estimate should be calculated from the per-second token rate and a duration constant, rather than being hardcoded. This makes the relationship between the constants explicit. If AUDIO_TOKENS_PER_SECOND is ever updated, this value will update automatically, preventing potential inconsistencies.

Suggested change

const DEFAULT_AUDIO_TOKEN_ESTIMATE = 3840;

const DEFAULT_AUDIO_TOKEN_ESTIMATE = 120 * AUDIO_TOKENS_PER_SECOND;

gemini-code-assist · 2026-02-27T10:31:53Z

packages/core/src/utils/tokenCalculation.ts

+// Conservative bitrate for compressed video duration estimation (~2 Mbps).
+const COMPRESSED_VIDEO_BYTES_PER_SECOND = 250_000;
+// Default video token estimate when base64 data is unavailable (~1 min).
+const DEFAULT_VIDEO_TOKEN_ESTIMATE = 17_400;


To improve maintainability and consistency, the default video token estimate should be calculated from the per-second token rate and a duration constant, rather than being hardcoded. This follows the pattern already used for VIDEO_TOKENS_PER_SECOND and makes the relationship between the constants explicit. If VIDEO_TOKENS_PER_SECOND is ever updated, this value will update automatically, preventing potential inconsistencies.

Suggested change

const DEFAULT_VIDEO_TOKEN_ESTIMATE = 17_400;

const DEFAULT_VIDEO_TOKEN_ESTIMATE = 60 * VIDEO_TOKENS_PER_SECOND;

Previously, estimateMediaTokens() only handled images (3000 tokens) and PDFs (25800 tokens). Audio and video parts fell through to a generic JSON.stringify fallback, which estimated tokens from the base64 string length -- wildly inaccurate for media content. This adds proper estimation based on the Gemini API's documented token rates: - Audio: ~32 tokens/second, estimated from base64 data size using a 128 kbps compressed-audio bitrate heuristic - Video: ~290 tokens/second (258 tokens/frame at 1 fps + 32 audio tokens/sec), estimated from base64 data size using a 2 Mbps bitrate heuristic When base64 data is unavailable (e.g. fileData URI references), fixed default estimates are used (~2 min for audio, ~1 min for video). Adds 14 new tests covering audio/video estimation with inlineData, fileData, various MIME types, small clips, mixed multimodal content, and nested Gemini 3 functionResponse audio parts.

Address review feedback: replace hardcoded DEFAULT_AUDIO_TOKEN_ESTIMATE (3840) and DEFAULT_VIDEO_TOKEN_ESTIMATE (17400) with expressions that derive from AUDIO_TOKENS_PER_SECOND and VIDEO_TOKENS_PER_SECOND, making the relationship explicit and auto-updating if base rates change.

…okens Replace truthy check (!base64Data) with explicit undefined check (base64Data === undefined) so empty-string base64 data correctly yields 0 tokens instead of the default estimate. Added edge case tests for both audio and video.

gemini-cli · 2026-02-28T02:47:43Z

Hi there! Thank you for your contribution to Gemini CLI.

To improve our contribution process and better track changes, we now require all pull requests to be associated with an existing issue, as announced in our recent discussion and as detailed in our CONTRIBUTING.md.

This pull request is being closed because it is not currently linked to an issue. Once you have updated the description of this PR to link an issue (e.g., by adding Fixes #123 or Related to #123), it will be automatically reopened.

How to link an issue:
Add a keyword followed by the issue number (e.g., Fixes #123) in the description of your pull request. For more details on supported keywords and how linking works, please refer to the GitHub Documentation on linking pull requests to issues.

Thank you for your understanding and for being a part of our community!

himanshu748 requested a review from a team as a code owner February 27, 2026 10:29

gemini-code-assist bot reviewed Feb 27, 2026

View reviewed changes

gemini-cli bot added the status/need-issue Pull requests that need to have an associated issue. label Feb 27, 2026

himanshu748 added 3 commits February 27, 2026 21:27

himanshu748 force-pushed the feat/add-audio-video-token-estimation branch from 3dbb750 to 24c2246 Compare February 27, 2026 16:01

gemini-cli bot closed this Feb 28, 2026

github-actions bot mentioned this pull request Feb 28, 2026

📊 AI CLI 工具社区动态日报 2026-02-28 duanyytop/agents-radar#27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): add audio and video token estimation to tokenCalculation#20542

feat(core): add audio and video token estimation to tokenCalculation#20542
himanshu748 wants to merge 3 commits intogoogle-gemini:mainfrom
himanshu748:feat/add-audio-video-token-estimation

himanshu748 commented Feb 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 27, 2026

Uh oh!

gemini-code-assist bot Feb 27, 2026

Uh oh!

gemini-cli bot commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	const DEFAULT_AUDIO_TOKEN_ESTIMATE = 3840;
	const DEFAULT_AUDIO_TOKEN_ESTIMATE = 120 * AUDIO_TOKENS_PER_SECOND;

	const DEFAULT_VIDEO_TOKEN_ESTIMATE = 17_400;
	const DEFAULT_VIDEO_TOKEN_ESTIMATE = 60 * VIDEO_TOKENS_PER_SECOND;

Conversation

himanshu748 commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Testing

Uh oh!

gemini-code-assist bot commented Feb 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-cli bot commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

himanshu748 commented Feb 27, 2026 •

edited

Loading