feat(core): add audio and video token estimation to tokenCalculation#20542
feat(core): add audio and video token estimation to tokenCalculation#20542himanshu748 wants to merge 3 commits intogoogle-gemini:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the accuracy of token estimation within the Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds token estimation for audio and video content by implementing duration-based heuristics as a fallback to the countTokens API. The changes include new estimation logic in tokenCalculation.ts and corresponding tests. My review focuses on improving the maintainability of the newly added constants to ensure they are derived from base rates, making the implementation more robust.
| // Used to convert raw file size to an approximate duration. | ||
| const COMPRESSED_AUDIO_BYTES_PER_SECOND = 16_000; | ||
| // Default audio token estimate when base64 data is unavailable (~2 min). | ||
| const DEFAULT_AUDIO_TOKEN_ESTIMATE = 3840; |
There was a problem hiding this comment.
To improve maintainability, the default audio token estimate should be calculated from the per-second token rate and a duration constant, rather than being hardcoded. This makes the relationship between the constants explicit. If AUDIO_TOKENS_PER_SECOND is ever updated, this value will update automatically, preventing potential inconsistencies.
| const DEFAULT_AUDIO_TOKEN_ESTIMATE = 3840; | |
| const DEFAULT_AUDIO_TOKEN_ESTIMATE = 120 * AUDIO_TOKENS_PER_SECOND; |
| // Conservative bitrate for compressed video duration estimation (~2 Mbps). | ||
| const COMPRESSED_VIDEO_BYTES_PER_SECOND = 250_000; | ||
| // Default video token estimate when base64 data is unavailable (~1 min). | ||
| const DEFAULT_VIDEO_TOKEN_ESTIMATE = 17_400; |
There was a problem hiding this comment.
To improve maintainability and consistency, the default video token estimate should be calculated from the per-second token rate and a duration constant, rather than being hardcoded. This follows the pattern already used for VIDEO_TOKENS_PER_SECOND and makes the relationship between the constants explicit. If VIDEO_TOKENS_PER_SECOND is ever updated, this value will update automatically, preventing potential inconsistencies.
| const DEFAULT_VIDEO_TOKEN_ESTIMATE = 17_400; | |
| const DEFAULT_VIDEO_TOKEN_ESTIMATE = 60 * VIDEO_TOKENS_PER_SECOND; |
Previously, estimateMediaTokens() only handled images (3000 tokens) and PDFs (25800 tokens). Audio and video parts fell through to a generic JSON.stringify fallback, which estimated tokens from the base64 string length -- wildly inaccurate for media content. This adds proper estimation based on the Gemini API's documented token rates: - Audio: ~32 tokens/second, estimated from base64 data size using a 128 kbps compressed-audio bitrate heuristic - Video: ~290 tokens/second (258 tokens/frame at 1 fps + 32 audio tokens/sec), estimated from base64 data size using a 2 Mbps bitrate heuristic When base64 data is unavailable (e.g. fileData URI references), fixed default estimates are used (~2 min for audio, ~1 min for video). Adds 14 new tests covering audio/video estimation with inlineData, fileData, various MIME types, small clips, mixed multimodal content, and nested Gemini 3 functionResponse audio parts.
Address review feedback: replace hardcoded DEFAULT_AUDIO_TOKEN_ESTIMATE (3840) and DEFAULT_VIDEO_TOKEN_ESTIMATE (17400) with expressions that derive from AUDIO_TOKENS_PER_SECOND and VIDEO_TOKENS_PER_SECOND, making the relationship explicit and auto-updating if base rates change.
…okens Replace truthy check (!base64Data) with explicit undefined check (base64Data === undefined) so empty-string base64 data correctly yields 0 tokens instead of the default estimate. Added edge case tests for both audio and video.
3dbb750 to
24c2246
Compare
|
Hi there! Thank you for your contribution to Gemini CLI. To improve our contribution process and better track changes, we now require all pull requests to be associated with an existing issue, as announced in our recent discussion and as detailed in our CONTRIBUTING.md. This pull request is being closed because it is not currently linked to an issue. Once you have updated the description of this PR to link an issue (e.g., by adding How to link an issue: Thank you for your understanding and for being a part of our community! |
Fixes #20655
Summary
Adds proper audio and video token estimation to
estimateMediaTokens()intokenCalculation.ts. Previously, audio and videoPartobjects fell through to a genericJSON.stringify(part).length / 4fallback, which estimated tokens from the raw base64 string length rather than the actual media content duration.Problem
The
estimateMediaTokens()function handled images (3,000 tokens) and PDFs (25,800 tokens) but had no handling foraudio/*orvideo/*MIME types. When audio or video content was processed (e.g., via@filereferences or MCP tool responses), the token count was estimated from the base64-encoded data string -- wildly inaccurate because:JSON.stringifyfallbackThis caused the token budget to be massively over-counted for audio/video content, potentially leading to unnecessary context truncation.
Solution
Adds duration-based token estimation following the Gemini API's documented rates:
When the
countTokensAPI is available (the primary path), it is still used for exact counts. These heuristics only apply as the sync fallback estimation path.Changes
packages/core/src/utils/tokenCalculation.ts: AddedestimateAudioTokens(),estimateVideoTokens(), and extendedestimateMediaTokens()to handleaudio/*andvideo/*MIME typespackages/core/src/utils/tokenCalculation.test.ts: Added 14 new tests covering audio/video estimationTesting
All 28 tests pass (14 existing + 14 new):
calculateRequestTokenCountAPI and fallback paths for both audio and video