Skip to content

Improve: Implement accurate token estimation for cost calculations #52

@evansantos

Description

@evansantos

GitSniff finding from PR #49 (Warning level - non-blocking)

Current implementation uses rough approximation: Math.ceil(text.length / 4)

Different embedding models use varying tokenization methods. This approximation may lead to:

  • Inaccurate cost tracking for non-English texts
  • Misleading budget alerts
  • Discrepancies with actual API billing

Suggested improvements:

  1. Integrate proper tokenizer (tiktoken for OpenAI models)
  2. Add configurable charsPerToken ratio per model
  3. Document limitation clearly in config schema

Priority: Medium - impacts cost tracking accuracy

See: https://github.com/openai/tiktoken

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions