Skip to content

Conversation

@stared
Copy link
Collaborator

@stared stared commented Nov 7, 2025

Results:

Configuration:

  • Agent: naive
  • Temperature: 1.0
  • Max tokens: 4096
  • Workers: 16 parallel

Key findings:

  • 7.9% improvement over Claude 3.5 Sonnet (Oct 2024)
  • Lower variance (±4.6) vs Gemini-2.5-Pro (±8.2)
  • BabaIsAI is 7.5× more token-intensive than estimated

🤖 Generated with Claude Code

Results:
- BabaIsAI: 50.0% ± 4.6%
- Rank: #2 on leaderboard (competitive with Gemini-2.5-Pro)
- Episodes: 120 (40 tasks × 3 episodes)
- Cost: ~$90 (29.6M input + 0.1M output tokens)

Configuration:
- Agent: naive
- Temperature: 1.0
- Max tokens: 4096
- Workers: 16 parallel

Key findings:
- 7.9% improvement over Claude 3.5 Sonnet (Oct 2024)
- Lower variance (±4.6) vs Gemini-2.5-Pro (±8.2)
- BabaIsAI is 7.5× more token-intensive than estimated

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants