diff --git a/EXPERIMENT_NOTES.md b/EXPERIMENT_NOTES.md new file mode 100644 index 00000000..0bcb1d52 --- /dev/null +++ b/EXPERIMENT_NOTES.md @@ -0,0 +1,109 @@ +# BALROG Benchmark Experiment Notes + +**Date**: 2025-11-06 +**Model**: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) +**Environment**: BabaIsAI + +## Setup + +### Migration: Conda → UV +- Created `.python-version` (Python 3.10) +- Migrated dependencies from `setup.py` to `pyproject.toml` +- Updated all documentation (README, docs/) +- Updated Dockerfile to use uv instead of Miniconda +- Fixed Python requirement: `>=3.8` → `>=3.10` (matching conda recommendation) + +### Installation Command +```bash +uv sync +source .venv/bin/activate +balrog-post-install +``` + +## Benchmark Execution + +### Parameters (Official Fair Benchmark) +```bash +uv run python eval.py \ + agent.type=naive \ + agent.max_image_history=0 \ + agent.max_text_history=16 \ + eval.num_workers=16 \ + client.client_name=claude \ + 'client.model_id=claude-sonnet-4-5-20250929' \ + 'client.generate_kwargs.temperature=1.0' \ + 'client.generate_kwargs.max_tokens=4096' \ + envs.names=babaisai +``` + +### Configuration +- **Agent**: naive (zero-shot, no reasoning) +- **Temperature**: 1.0 (as recommended by BALROG docs) +- **Episodes**: 120 (40 tasks × 3 episodes) +- **Runtime**: ~2.5 hours +- **Workers**: 16 parallel + +## Results + +### Performance +- **BabaIsAI Score**: 50.0% ± 4.6% +- **Leaderboard Rank**: #2 (would be, if submitted) +- **Comparison**: + - Grok-4: 62.9% (#1) + - Gemini-2.5-Pro: 49.2% (#3) + - Claude 3.5 Sonnet (Oct 2024): 42.1% (#4) + +### Token Usage & Cost +- **Input tokens**: 29,563,019 (29.6M) +- **Output tokens**: 99,126 (0.1M) +- **Total cost**: ~$90 + - Input: $88.68 ($3/M) + - Output: $1.49 ($15/M) + +### Cost Analysis +- **Estimated**: $12 +- **Actual**: $90 +- **Variance**: 7.5× higher than estimated + +**Why?** +- BabaIsAI has very complex, verbose state descriptions +- Each episode uses ~246K tokens (vs. estimated ~12.5K) +- Context history (16 turns) accumulates ~24K tokens per step +- Longer episodes than expected + +## Key Learnings + +1. **Claude Sonnet 4.5 shows significant improvement** (+7.9%) over Claude 3.5 Sonnet +2. **Competitive with Gemini-2.5-Pro** (essentially tied at 50% vs 49.2%) +3. **Lower variance** (±4.6 vs ±8.2) suggests more consistent performance +4. **BabaIsAI is token-intensive** - complex environments cost 5-10× more than simple gridworlds +5. **UV migration successful** - clean, reproducible setup with lockfile + +## Naive Agent Behavior + +The "naive" agent is the baseline approach: +- Direct action output from observation history +- No chain-of-thought reasoning +- No planning or strategy +- Simply: "What should I do next?" without thinking time + +## Files & Results + +**Results location**: `results/2025-11-06_21-09-03_naive_claude-sonnet-4-5-20250929/` +- `summary.json` - aggregate stats +- `babaisai/` - 120 episode trajectories +- `eval.log` - execution log + +**Branch**: `uv` +**Commits**: +- `ab829df` - Migrate from conda to uv +- `26e742d` - Set Python minimum to 3.10 + +## Next Steps + +Potential follow-ups: +- [ ] Run full 6-environment benchmark (~$500-600) +- [ ] Submit BabaIsAI results for verification +- [ ] Test with chain_of_thought agent (likely higher scores, more tokens) +- [ ] Compare cost/performance with Gemini Flash (much cheaper) +- [ ] Analyze failure cases in episode trajectories