Skip to content

bugfix#14817

Open
aprameyak wants to merge 2 commits intoKong:masterfrom
aprameyak:master
Open

bugfix#14817
aprameyak wants to merge 2 commits intoKong:masterfrom
aprameyak:master

Conversation

@aprameyak
Copy link

@aprameyak aprameyak commented Jan 16, 2026

Summary

Fix the AI Proxy plugin so that llm_total_tokens_count Prometheus metric respects explicit total_tokens values returned by LLM providers.
Previously, the metric was incorrectly calculated as prompt_tokens + completion_tokens, which underreported token usage for models using reasoning tokens.

This fix:

  1. Emits llm_total_tokens_count in the driver when response_object.usage.total_tokens exists.
  2. Updates observability logic to prefer the explicit total, falling back to prompt + completion for backward compatibility.

Checklist

  • The Pull Request has tests (or prepared to add in a follow-up PR if required)
  • Changelog updated or skip-changelog added
  • User-facing docs PR linked (if needed)

Issue reference

Fix #14816


Verification / QA

  • Explicit total_tokens now correctly reported in Prometheus metric
  • Fallback calculation works when total_tokens is missing
  • No recursion occurs in _M.metrics_get
  • Existing tests pass, no linting errors
  • Backward compatible for providers without total_tokens

@CLAassistant
Copy link

CLAassistant commented Jan 16, 2026

CLA assistant check
All committers have signed the CLA.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@spacewander spacewander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the OpenAI doc: https://platform.openai.com/docs/api-reference/chat/object#chat-object-usage-total_tokens

total = prompt + completion

reasoning_tokens is counted as completion_tokens

@git-hulk
Copy link
Contributor

According to the OpenAI doc: https://platform.openai.com/docs/api-reference/chat/object#chat-object-usage-total_tokens

total = prompt + completion

reasoning_tokens is counted as completion_tokens

This is correct for OpenAI/Anthropic. The Gemini API did count the thoughts in the total, but thoughts and too use prompt tokens are NOT included in the candidates(completion) token.

end
if response_object.usage.total_tokens then
request_analytics_plugin[log_entry_keys.USAGE_CONTAINER][log_entry_keys.TOTAL_TOKENS] = response_object.usage.total_tokens
ai_plugin_o11y.metrics_set("llm_total_tokens_count", response_object.usage.total_tokens)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update normalize-sse-chunk.lua and parse-json-response.lua too.



function _M.metrics_get(key)
function _M.metrics_get(key, skip_calculation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep it simple - no need to add skip_calculation, just skip if the key exists

@spacewander
Copy link
Contributor

@git-hulk
Yes, I checked it, and you are right.

@git-hulk
Copy link
Contributor

@git-hulk Yes, I checked it, and you are right.

@spacewander Thanks for your prompt and kind reply. For this fix, I'm wondering if it would be better to 'correct' the completion token count if the total count exists.

@spacewander
Copy link
Contributor

@git-hulk Yes, I checked it, and you are right.

@spacewander Thanks for your prompt and kind reply. For this fix, I'm wondering if it would be better to 'correct' the completion token count if the total count exists.

@fffonion
What do you think?

Pro: this behavior follows the OpenAI one: completion token count is the number of response tokens.
Con: the result doesn't match the usage given by the gemini. People may think we are doing wrong when checking the bill.

@git-hulk
Copy link
Contributor

git-hulk commented Jan 28, 2026

@git-hulk Yes, I checked it, and you are right.

@spacewander Thanks for your prompt and kind reply. For this fix, I'm wondering if it would be better to 'correct' the completion token count if the total count exists.

@fffonion What do you think?

Pro: this behavior follows the OpenAI one: completion token count is the number of response tokens. Con: the result doesn't match the usage given by the gemini. People may think we are doing wrong when checking the bill.

I think this fix is not quite correct. We should correct the candidates' token count by adding the thinking token count to it. Another way is to add the thoughts token count field for usage, but I feel it's not a good approach because it's a specific behavior in Gemini.

And counting the thoughts/tool_use token count as the candidates (completions) token count won't cause the billing usage issue, since the reasoning token shares the same price as the candidates tokens, see [1]. Instead, Kong no longer counts the thoughts/tool_use token part, which might confuse users because the billing usage would be higher than the tokens were recorded on the Kong side.

[1] https://cloud.google.com/vertex-ai/generative-ai/pricing

@fffonion
Copy link
Contributor

Let's ignore Kong's behaviour for now. If I understand correctly, currently if user take the total token count from gemini, multiply by the token price, the number the user get is not same as Google actually charges. And we are trying to fix this behaviour right?

@git-hulk
Copy link
Contributor

Let's ignore Kong's behaviour for now. If I understand correctly, currently if user take the total token count from gemini, multiply by the token price, the number the user get is not same as Google actually charges. And we are trying to fix this behaviour right?

@fffonion, the price of prompt/completion tokens differs, so we cannot simply multiply the total count by a single price. From my side, the main issue is that the thoughts token count wasn't counted into the completion token count in Gemini. So, the cost from Kong will be lower than the actual billing usage.

@aprameyak I'm not sure if you're suffering the same issue.

@aprameyak
Copy link
Author

This PR is intentionally scoped to the original issue (#14816), which is that llm_total_tokens_count ignored the explicit usage.total_tokens value and instead recomputed it as prompt + completion.

Normalizing completion token semantics across providers (e.g. Gemini thoughts/tool-use vs candidates) is a related but separate concern and wasn’t part of the issue being addressed here. I think that’s worth discussing separately if we want to change how completion tokens are defined.

@aprameyak
Copy link
Author

@fffonion, @spacewander, @git-hulk I wanted to ask if there is anything else I should do for the PR. I am not clear on whether there should be changes at the moment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AI Proxy] Incorrect llm_total_tokens_count metric for models with reasoning/hidden tokens

5 participants