Skip to content

feat(core): add audio and video token estimation to tokenCalculation#20542

Closed
himanshu748 wants to merge 3 commits intogoogle-gemini:mainfrom
himanshu748:feat/add-audio-video-token-estimation
Closed

feat(core): add audio and video token estimation to tokenCalculation#20542
himanshu748 wants to merge 3 commits intogoogle-gemini:mainfrom
himanshu748:feat/add-audio-video-token-estimation

Conversation

@himanshu748
Copy link
Copy Markdown

@himanshu748 himanshu748 commented Feb 27, 2026

Fixes #20655

Summary

Adds proper audio and video token estimation to estimateMediaTokens() in tokenCalculation.ts. Previously, audio and video Part objects fell through to a generic JSON.stringify(part).length / 4 fallback, which estimated tokens from the raw base64 string length rather than the actual media content duration.

Problem

The estimateMediaTokens() function handled images (3,000 tokens) and PDFs (25,800 tokens) but had no handling for audio/* or video/* MIME types. When audio or video content was processed (e.g., via @file references or MCP tool responses), the token count was estimated from the base64-encoded data string -- wildly inaccurate because:

  • A 30-second MP3 clip (~480 KB base64) would estimate as ~120,000 tokens via JSON.stringify fallback
  • The Gemini API actually tokenizes that same clip at ~960 tokens (32 tokens/sec x 30s)

This caused the token budget to be massively over-counted for audio/video content, potentially leading to unnecessary context truncation.

Solution

Adds duration-based token estimation following the Gemini API's documented rates:

  • Audio: ~32 tokens/second (docs)
    • Estimates duration from base64 data size using 128 kbps compressed audio bitrate
    • Default fallback: 3,840 tokens (~2 minutes) when base64 data unavailable
  • Video: ~290 tokens/second (258 tokens/frame at 1fps + 32 audio tokens/sec) (docs)
    • Estimates duration from base64 data size using ~2 Mbps compressed video bitrate
    • Default fallback: 17,400 tokens (~1 minute) when base64 data unavailable

When the countTokens API is available (the primary path), it is still used for exact counts. These heuristics only apply as the sync fallback estimation path.

Changes

  • packages/core/src/utils/tokenCalculation.ts: Added estimateAudioTokens(), estimateVideoTokens(), and extended estimateMediaTokens() to handle audio/* and video/* MIME types
  • packages/core/src/utils/tokenCalculation.test.ts: Added 14 new tests covering audio/video estimation

Testing

All 28 tests pass (14 existing + 14 new):

  • Audio inlineData with base64 data (data-size-based estimation)
  • Audio fileData without base64 data (default estimate)
  • Various audio MIME types (mpeg, wav, ogg, flac, aac, mp4)
  • Small audio clips (~1 second)
  • Video inlineData with base64 data
  • Video fileData without base64 data
  • Various video MIME types (mp4, webm, quicktime, x-msvideo)
  • Mixed multimodal content (text + audio + image)
  • Nested Gemini 3 audio parts in functionResponse
  • Async calculateRequestTokenCount API and fallback paths for both audio and video

@himanshu748 himanshu748 requested a review from a team as a code owner February 27, 2026 10:29
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the accuracy of token estimation within the tokenCalculation utility by introducing specific, duration-based heuristics for audio and video content. Previously, these media types were inaccurately estimated based on the raw base64 string length, leading to substantial overcounting. The updated logic now aligns with documented Gemini API rates, using conservative bitrate estimations to derive more realistic token counts. This enhancement ensures more efficient and accurate management of token budgets for multimodal inputs, preventing unnecessary context truncation and optimizing interactions with generative models.

Highlights

  • Accurate Audio Token Estimation: Implemented duration-based token estimation for audio content, using a rate of ~32 tokens per second and a conservative bitrate for base64 data size conversion. A default estimate of 3,840 tokens (~2 minutes) is used when base64 data is unavailable.
  • Accurate Video Token Estimation: Introduced duration-based token estimation for video content, accounting for ~290 tokens per second (258 tokens/frame at 1fps + 32 audio tokens/sec). A default estimate of 17,400 tokens (~1 minute) is applied when base64 data is not provided.
  • Enhanced estimateMediaTokens Function: The estimateMediaTokens function was updated to correctly identify and apply the new audio and video token estimation logic based on MIME types, moving away from the inaccurate generic JSON.stringify fallback.
  • Comprehensive Test Coverage: Added 14 new test cases to tokenCalculation.test.ts to validate the new audio and video token estimation, covering scenarios with inlineData, fileData, various MIME types, small clips, mixed multimodal content, and nested parts within function responses.
Changelog
  • packages/core/src/utils/tokenCalculation.test.ts
    • Added tests to verify calculateRequestTokenCount correctly uses the countTokens API for audio requests.
    • Added tests to verify calculateRequestTokenCount falls back to duration-based estimation for audio when the API fails.
    • Added tests to verify calculateRequestTokenCount correctly uses the countTokens API for video requests.
    • Added tests to verify calculateRequestTokenCount falls back to duration-based estimation for video when the API fails.
    • Included a new test suite for audio token estimation within estimateTokenCountSync, covering inlineData, fileData, various MIME types, and small audio clips.
    • Included a new test suite for video token estimation within estimateTokenCountSync, covering inlineData, fileData, and various MIME types.
    • Added tests for mixed multimodal content to ensure correct token summation for combinations of text, audio, and image parts.
    • Added a test to handle Gemini 3 nested audio parts within functionResponse.
  • packages/core/src/utils/tokenCalculation.ts
    • Defined new constants for audio token estimation, including AUDIO_TOKENS_PER_SECOND, COMPRESSED_AUDIO_BYTES_PER_SECOND, and DEFAULT_AUDIO_TOKEN_ESTIMATE.
    • Defined new constants for video token estimation, including VIDEO_TOKENS_PER_SECOND, COMPRESSED_VIDEO_BYTES_PER_SECOND, and DEFAULT_VIDEO_TOKEN_ESTIMATE.
    • Introduced the estimateAudioTokens function to calculate audio tokens based on base64 data size or return a default value.
    • Introduced the estimateVideoTokens function to calculate video tokens based on base64 data size or return a default value.
    • Updated the estimateMediaTokens function to incorporate calls to estimateAudioTokens and estimateVideoTokens for audio/* and video/* MIME types, respectively.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds token estimation for audio and video content by implementing duration-based heuristics as a fallback to the countTokens API. The changes include new estimation logic in tokenCalculation.ts and corresponding tests. My review focuses on improving the maintainability of the newly added constants to ensure they are derived from base rates, making the implementation more robust.

// Used to convert raw file size to an approximate duration.
const COMPRESSED_AUDIO_BYTES_PER_SECOND = 16_000;
// Default audio token estimate when base64 data is unavailable (~2 min).
const DEFAULT_AUDIO_TOKEN_ESTIMATE = 3840;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To improve maintainability, the default audio token estimate should be calculated from the per-second token rate and a duration constant, rather than being hardcoded. This makes the relationship between the constants explicit. If AUDIO_TOKENS_PER_SECOND is ever updated, this value will update automatically, preventing potential inconsistencies.

Suggested change
const DEFAULT_AUDIO_TOKEN_ESTIMATE = 3840;
const DEFAULT_AUDIO_TOKEN_ESTIMATE = 120 * AUDIO_TOKENS_PER_SECOND;

// Conservative bitrate for compressed video duration estimation (~2 Mbps).
const COMPRESSED_VIDEO_BYTES_PER_SECOND = 250_000;
// Default video token estimate when base64 data is unavailable (~1 min).
const DEFAULT_VIDEO_TOKEN_ESTIMATE = 17_400;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To improve maintainability and consistency, the default video token estimate should be calculated from the per-second token rate and a duration constant, rather than being hardcoded. This follows the pattern already used for VIDEO_TOKENS_PER_SECOND and makes the relationship between the constants explicit. If VIDEO_TOKENS_PER_SECOND is ever updated, this value will update automatically, preventing potential inconsistencies.

Suggested change
const DEFAULT_VIDEO_TOKEN_ESTIMATE = 17_400;
const DEFAULT_VIDEO_TOKEN_ESTIMATE = 60 * VIDEO_TOKENS_PER_SECOND;

@gemini-cli gemini-cli bot added the status/need-issue Pull requests that need to have an associated issue. label Feb 27, 2026
Previously, estimateMediaTokens() only handled images (3000 tokens) and
PDFs (25800 tokens). Audio and video parts fell through to a generic
JSON.stringify fallback, which estimated tokens from the base64 string
length -- wildly inaccurate for media content.

This adds proper estimation based on the Gemini API's documented token
rates:
- Audio: ~32 tokens/second, estimated from base64 data size using a
  128 kbps compressed-audio bitrate heuristic
- Video: ~290 tokens/second (258 tokens/frame at 1 fps + 32 audio
  tokens/sec), estimated from base64 data size using a 2 Mbps bitrate
  heuristic

When base64 data is unavailable (e.g. fileData URI references), fixed
default estimates are used (~2 min for audio, ~1 min for video).

Adds 14 new tests covering audio/video estimation with inlineData,
fileData, various MIME types, small clips, mixed multimodal content,
and nested Gemini 3 functionResponse audio parts.
Address review feedback: replace hardcoded DEFAULT_AUDIO_TOKEN_ESTIMATE
(3840) and DEFAULT_VIDEO_TOKEN_ESTIMATE (17400) with expressions that
derive from AUDIO_TOKENS_PER_SECOND and VIDEO_TOKENS_PER_SECOND,
making the relationship explicit and auto-updating if base rates change.
…okens

Replace truthy check (!base64Data) with explicit undefined check
(base64Data === undefined) so empty-string base64 data correctly
yields 0 tokens instead of the default estimate. Added edge case
tests for both audio and video.
@himanshu748 himanshu748 force-pushed the feat/add-audio-video-token-estimation branch from 3dbb750 to 24c2246 Compare February 27, 2026 16:01
@gemini-cli
Copy link
Copy Markdown
Contributor

gemini-cli bot commented Feb 28, 2026

Hi there! Thank you for your contribution to Gemini CLI.

To improve our contribution process and better track changes, we now require all pull requests to be associated with an existing issue, as announced in our recent discussion and as detailed in our CONTRIBUTING.md.

This pull request is being closed because it is not currently linked to an issue. Once you have updated the description of this PR to link an issue (e.g., by adding Fixes #123 or Related to #123), it will be automatically reopened.

How to link an issue:
Add a keyword followed by the issue number (e.g., Fixes #123) in the description of your pull request. For more details on supported keywords and how linking works, please refer to the GitHub Documentation on linking pull requests to issues.

Thank you for your understanding and for being a part of our community!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status/need-issue Pull requests that need to have an associated issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Token estimation falls through to generic fallback for audio/video parts

1 participant