Conversation
|
|
spacewander
left a comment
There was a problem hiding this comment.
According to the OpenAI doc: https://platform.openai.com/docs/api-reference/chat/object#chat-object-usage-total_tokens
total = prompt + completion
reasoning_tokens is counted as completion_tokens
This is correct for OpenAI/Anthropic. The Gemini API did count the thoughts in the total, but |
| end | ||
| if response_object.usage.total_tokens then | ||
| request_analytics_plugin[log_entry_keys.USAGE_CONTAINER][log_entry_keys.TOTAL_TOKENS] = response_object.usage.total_tokens | ||
| ai_plugin_o11y.metrics_set("llm_total_tokens_count", response_object.usage.total_tokens) |
There was a problem hiding this comment.
Let's update normalize-sse-chunk.lua and parse-json-response.lua too.
|
|
||
|
|
||
| function _M.metrics_get(key) | ||
| function _M.metrics_get(key, skip_calculation) |
There was a problem hiding this comment.
We can keep it simple - no need to add skip_calculation, just skip if the key exists
|
@git-hulk |
@spacewander Thanks for your prompt and kind reply. For this fix, I'm wondering if it would be better to 'correct' the |
@fffonion Pro: this behavior follows the OpenAI one: completion token count is the number of response tokens. |
I think this fix is not quite correct. We should correct the candidates' token count by adding the thinking token count to it. Another way is to add the thoughts token count field for And counting the thoughts/tool_use token count as the candidates (completions) token count won't cause the billing usage issue, since the reasoning token shares the same price as the candidates tokens, see [1]. Instead, Kong no longer counts the thoughts/tool_use token part, which might confuse users because the billing usage would be higher than the tokens were recorded on the Kong side. [1] https://cloud.google.com/vertex-ai/generative-ai/pricing |
|
Let's ignore Kong's behaviour for now. If I understand correctly, currently if user take the total token count from gemini, multiply by the token price, the number the user get is not same as Google actually charges. And we are trying to fix this behaviour right? |
@fffonion, the price of prompt/completion tokens differs, so we cannot simply multiply the total count by a single price. From my side, the main issue is that the @aprameyak I'm not sure if you're suffering the same issue. |
|
This PR is intentionally scoped to the original issue (#14816), which is that llm_total_tokens_count ignored the explicit usage.total_tokens value and instead recomputed it as prompt + completion. Normalizing completion token semantics across providers (e.g. Gemini thoughts/tool-use vs candidates) is a related but separate concern and wasn’t part of the issue being addressed here. I think that’s worth discussing separately if we want to change how completion tokens are defined. |
|
@fffonion, @spacewander, @git-hulk I wanted to ask if there is anything else I should do for the PR. I am not clear on whether there should be changes at the moment |
Summary
Fix the AI Proxy plugin so that
llm_total_tokens_countPrometheus metric respects explicittotal_tokensvalues returned by LLM providers.Previously, the metric was incorrectly calculated as
prompt_tokens + completion_tokens, which underreported token usage for models using reasoning tokens.This fix:
llm_total_tokens_countin the driver whenresponse_object.usage.total_tokensexists.prompt + completionfor backward compatibility.Checklist
skip-changelogaddedIssue reference
Fix #14816
Verification / QA
total_tokensnow correctly reported in Prometheus metrictotal_tokensis missing_M.metrics_gettotal_tokens