Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions EXPERIMENT_NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# BALROG Benchmark Experiment Notes

**Date**: 2025-11-06
**Model**: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929)
**Environment**: BabaIsAI

## Setup

### Migration: Conda → UV
- Created `.python-version` (Python 3.10)
- Migrated dependencies from `setup.py` to `pyproject.toml`
- Updated all documentation (README, docs/)
- Updated Dockerfile to use uv instead of Miniconda
- Fixed Python requirement: `>=3.8` → `>=3.10` (matching conda recommendation)

### Installation Command
```bash
uv sync
source .venv/bin/activate
balrog-post-install
```

## Benchmark Execution

### Parameters (Official Fair Benchmark)
```bash
uv run python eval.py \
agent.type=naive \
agent.max_image_history=0 \
agent.max_text_history=16 \
eval.num_workers=16 \
client.client_name=claude \
'client.model_id=claude-sonnet-4-5-20250929' \
'client.generate_kwargs.temperature=1.0' \
'client.generate_kwargs.max_tokens=4096' \
envs.names=babaisai
```

### Configuration
- **Agent**: naive (zero-shot, no reasoning)
- **Temperature**: 1.0 (as recommended by BALROG docs)
- **Episodes**: 120 (40 tasks × 3 episodes)
- **Runtime**: ~2.5 hours
- **Workers**: 16 parallel

## Results

### Performance
- **BabaIsAI Score**: 50.0% ± 4.6%
- **Leaderboard Rank**: #2 (would be, if submitted)
- **Comparison**:
- Grok-4: 62.9% (#1)
- Gemini-2.5-Pro: 49.2% (#3)
- Claude 3.5 Sonnet (Oct 2024): 42.1% (#4)

### Token Usage & Cost
- **Input tokens**: 29,563,019 (29.6M)
- **Output tokens**: 99,126 (0.1M)
- **Total cost**: ~$90
- Input: $88.68 ($3/M)
- Output: $1.49 ($15/M)

### Cost Analysis
- **Estimated**: $12
- **Actual**: $90
- **Variance**: 7.5× higher than estimated

**Why?**
- BabaIsAI has very complex, verbose state descriptions
- Each episode uses ~246K tokens (vs. estimated ~12.5K)
- Context history (16 turns) accumulates ~24K tokens per step
- Longer episodes than expected

## Key Learnings

1. **Claude Sonnet 4.5 shows significant improvement** (+7.9%) over Claude 3.5 Sonnet
2. **Competitive with Gemini-2.5-Pro** (essentially tied at 50% vs 49.2%)
3. **Lower variance** (±4.6 vs ±8.2) suggests more consistent performance
4. **BabaIsAI is token-intensive** - complex environments cost 5-10× more than simple gridworlds
5. **UV migration successful** - clean, reproducible setup with lockfile

## Naive Agent Behavior

The "naive" agent is the baseline approach:
- Direct action output from observation history
- No chain-of-thought reasoning
- No planning or strategy
- Simply: "What should I do next?" without thinking time

## Files & Results

**Results location**: `results/2025-11-06_21-09-03_naive_claude-sonnet-4-5-20250929/`
- `summary.json` - aggregate stats
- `babaisai/` - 120 episode trajectories
- `eval.log` - execution log

**Branch**: `uv`
**Commits**:
- `ab829df` - Migrate from conda to uv
- `26e742d` - Set Python minimum to 3.10

## Next Steps

Potential follow-ups:
- [ ] Run full 6-environment benchmark (~$500-600)
- [ ] Submit BabaIsAI results for verification
- [ ] Test with chain_of_thought agent (likely higher scores, more tokens)
- [ ] Compare cost/performance with Gemini Flash (much cheaper)
- [ ] Analyze failure cases in episode trajectories