QuesmaOrg · stared · Nov 7, 2025
diff --git a/EXPERIMENT_NOTES.md b/EXPERIMENT_NOTES.md
@@ -0,0 +1,109 @@
+# BALROG Benchmark Experiment Notes
+
+**Date**: 2025-11-06
+**Model**: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929)
+**Environment**: BabaIsAI
+
+## Setup
+
+### Migration: Conda → UV
+- Created `.python-version` (Python 3.10)
+- Migrated dependencies from `setup.py` to `pyproject.toml`
+- Updated all documentation (README, docs/)
+- Updated Dockerfile to use uv instead of Miniconda
+- Fixed Python requirement: `>=3.8` → `>=3.10` (matching conda recommendation)
+
+### Installation Command
+```bash
+uv sync
+source .venv/bin/activate
+balrog-post-install
+```
+
+## Benchmark Execution
+
+### Parameters (Official Fair Benchmark)
+```bash
+uv run python eval.py \
+  agent.type=naive \
+  agent.max_image_history=0 \
+  agent.max_text_history=16 \
+  eval.num_workers=16 \
+  client.client_name=claude \
+  'client.model_id=claude-sonnet-4-5-20250929' \
+  'client.generate_kwargs.temperature=1.0' \
+  'client.generate_kwargs.max_tokens=4096' \
+  envs.names=babaisai
+```
+
+### Configuration
+- **Agent**: naive (zero-shot, no reasoning)
+- **Temperature**: 1.0 (as recommended by BALROG docs)
+- **Episodes**: 120 (40 tasks × 3 episodes)
+- **Runtime**: ~2.5 hours
+- **Workers**: 16 parallel
+
+## Results
+
+### Performance
+- **BabaIsAI Score**: 50.0% ± 4.6%
+- **Leaderboard Rank**: #2 (would be, if submitted)
+- **Comparison**:
+  - Grok-4: 62.9% (#1)
+  - Gemini-2.5-Pro: 49.2% (#3)
+  - Claude 3.5 Sonnet (Oct 2024): 42.1% (#4)
+
+### Token Usage & Cost
+- **Input tokens**: 29,563,019 (29.6M)
+- **Output tokens**: 99,126 (0.1M)
+- **Total cost**: ~$90
+  - Input: $88.68 ($3/M)
+  - Output: $1.49 ($15/M)
+
+### Cost Analysis
+- **Estimated**: $12
+- **Actual**: $90
+- **Variance**: 7.5× higher than estimated
+
+**Why?**
+- BabaIsAI has very complex, verbose state descriptions
+- Each episode uses ~246K tokens (vs. estimated ~12.5K)
+- Context history (16 turns) accumulates ~24K tokens per step
+- Longer episodes than expected
+
+## Key Learnings
+
+1. **Claude Sonnet 4.5 shows significant improvement** (+7.9%) over Claude 3.5 Sonnet
+2. **Competitive with Gemini-2.5-Pro** (essentially tied at 50% vs 49.2%)
+3. **Lower variance** (±4.6 vs ±8.2) suggests more consistent performance
+4. **BabaIsAI is token-intensive** - complex environments cost 5-10× more than simple gridworlds
+5. **UV migration successful** - clean, reproducible setup with lockfile
+
+## Naive Agent Behavior
+
+The "naive" agent is the baseline approach:
+- Direct action output from observation history
+- No chain-of-thought reasoning
+- No planning or strategy
+- Simply: "What should I do next?" without thinking time
+
+## Files & Results
+
+**Results location**: `results/2025-11-06_21-09-03_naive_claude-sonnet-4-5-20250929/`
+- `summary.json` - aggregate stats
+- `babaisai/` - 120 episode trajectories
+- `eval.log` - execution log
+
+**Branch**: `uv`
+**Commits**:
+- `ab829df` - Migrate from conda to uv
+- `26e742d` - Set Python minimum to 3.10
+
+## Next Steps
+
+Potential follow-ups:
+- [ ] Run full 6-environment benchmark (~$500-600)
+- [ ] Submit BabaIsAI results for verification
+- [ ] Test with chain_of_thought agent (likely higher scores, more tokens)
+- [ ] Compare cost/performance with Gemini Flash (much cheaper)
+- [ ] Analyze failure cases in episode trajectories