Rollout is a framework for evaluating and optimizing AI agents in container environments. It manages the complexity of defining containerized tasks, executing agent trials, and collecting results at scale.
- Modular architecture: Pluggable environment providers (Docker, Modal), agents, and verifiers
- Concurrent execution: Run multiple trials in parallel with configurable concurrency
- Registry support: Load tasks from local paths or remote git-based registries
- Resource management: Configure CPU, memory, and storage limits per task
- Structured results: JSON output with timing, rewards, and error details
go install github.com/spachava753/rollout/cmd/rollout@latestOr build from source:
git clone https://github.com/spachava753/rollout.git
cd rollout
go build -o rollout ./cmd/rolloutA task is a directory with:
my-task/
├── task.toml # Configuration and metadata
├── instruction.md # Instructions for the agent
├── environment/
│ └── Dockerfile # Container environment definition
├── solution/
│ └── solve.sh # Oracle solution (optional)
└── tests/
└── test.sh # Verification script
Example task.toml:
version = "1.0"
[metadata]
difficulty = "easy"
category = "programming"
[verifier]
timeout_sec = 120.0
[agent]
timeout_sec = 300.0
[environment]
cpus = 2
memory_mb = 4096# job.yaml
name: my-evaluation
jobs_dir: jobs
n_attempts: 3
n_concurrent_trials: 4
environment:
type: docker
agents:
- name: oracle
datasets:
- path: ./my-taskrollout job.yamlResults are saved to jobs/<job-name>/ with per-trial logs and a summary JSON.
| Field | Type | Default | Description |
|---|---|---|---|
name |
string | timestamp | Job name (used for output directory) |
jobs_dir |
string | jobs |
Base directory for job output |
n_attempts |
int | 1 | Number of attempts per agent-task pair |
n_concurrent_trials |
int | 1 | Maximum parallel trial executions |
timeout_multiplier |
float | 1.0 | Multiplier for all timeouts |
instruction_path |
string | /tmp/instruction.md |
Path where instruction is placed in container |
environment.type |
string | docker |
Environment provider (docker or modal) |
environment.preserve_env |
string | never |
When to keep environments (never, always, on_failure) |
verifier.override_timeout_sec |
float | - | Override verifier timeout for all tasks |
verifier.max_timeout_sec |
float | - | Maximum allowed verifier timeout |
| Field | Type | Default | Description |
|---|---|---|---|
version |
string | 1.0 |
Task format version |
environment.docker_image |
string | - | Pre-built image (skips Dockerfile build) |
environment.cpus |
int | 1 | CPU cores |
environment.memory_mb |
int | 2048 | Memory in MB |
environment.storage_mb |
int | 10240 | Storage in MB |
environment.build_timeout_sec |
float | 600 | Image build timeout |
agent.timeout_sec |
float | 600 | Agent execution timeout |
agent.install_timeout_sec |
float | 300 | Agent install script timeout |
verifier.timeout_sec |
float | 600 | Verification timeout |
Uses local Docker daemon. Requires Docker to be installed and running.
environment:
type: dockerUses Modal cloud sandboxes for remote execution.
environment:
type: modal
provider_config:
region: us-eastRequirements:
- Modal CLI installed and authenticated (
pip install modal && modal setup) - Image builder version 2025.06+:
modal config set image_builder_version 2025.06
Limitations:
- Dockerfiles cannot use
COPYorADDwith local files (no build context support) - Environment names are truncated to 64 characters
The built-in oracle agent executes the task's solution/solve.sh script. Use it to validate that tasks are solvable.
agents:
- name: oracleDefine custom agents with install and run scripts:
agents:
- name: my-agent
install_script: |
pip install my-agent-package
run_script: |
my-agent solve --instruction $ROLLOUT_INSTRUCTION_PATHLoad all tasks from a directory:
datasets:
- path: ./my-tasksLoad specific tasks from a remote registry:
datasets:
- registry:
url: https://example.com/registry.json
name: dataset-name
version: "1.0"Registry format (registry.json):
{
"datasets": [
{
"name": "my-dataset",
"version": "1.0",
"tasks": [
{
"name": "task-1",
"git_url": "https://github.com/org/repo.git",
"git_commit_id": "abc123",
"path": "tasks/task-1"
}
]
}
]
}my-task/
├── task.toml # Required: configuration
├── instruction.md # Required: agent instructions
├── environment/
│ └── Dockerfile # Required: container definition
├── solution/
│ └── solve.sh # Optional: oracle solution
└── tests/
└── test.sh # Required: verification script
The instruction.md file is what the agent sees. Write clear, unambiguous instructions:
# Task: Create a greeting file
Create a file called `hello.txt` in the current directory.
The file should contain exactly: `Hello, world!` (with a newline).
Do not include any additional text or formatting.Tips:
- Be explicit about expected output format
- Specify file paths (absolute or relative to working directory)
- Include constraints and edge cases
- Avoid ambiguous language
Rollout expects OS-like container images with standard utilities. The Dockerfile should:
- Use a full base image (e.g.,
ubuntu:24.04, not Alpine or distroless) - Ensure
bashis available (used for script execution) - Set
WORKDIRexplicitly - Pre-install task-specific dependencies
FROM ubuntu:24.04
# Install dependencies needed for the task
RUN apt-get update && apt-get install -y \
python3 \
python3-pip \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /appReserved paths in the container:
| Path | Purpose |
|---|---|
/tmp/instruction.md |
Task instruction (configurable via instruction_path) |
/tests/ |
Verification scripts (copied at runtime) |
/oracle/ |
Solution files (copied by oracle agent) |
/logs/agent/ |
Agent can write logs here |
/logs/verifier/ |
Verifier writes reward here |
The solution/solve.sh script is executed by the oracle agent to validate the task is solvable:
#!/bin/bash
set -e
# Your solution here
echo "Hello, world!" > hello.txtRequirements:
- Must be a bash script named
solve.sh - Should complete within the agent timeout
- Can include additional files in
solution/directory
The verifier checks if the agent succeeded. It must write a reward value (0.0-1.0) to /logs/verifier/reward.txt.
Simple bash verifier:
#!/bin/bash
# tests/test.sh
mkdir -p /logs/verifier
if [ -f /app/hello.txt ] && grep -q "Hello, world!" /app/hello.txt; then
echo "1.0" > /logs/verifier/reward
echo "PASS: File exists with correct content"
else
echo "0.0" > /logs/verifier/reward
echo "FAIL: File missing or incorrect content"
exit 1
fiPytest verifier with partial credit:
# tests/test_state.py
import pytest
import os
def test_file_exists():
assert os.path.exists("/app/hello.txt")
def test_content_correct():
with open("/app/hello.txt") as f:
assert f.read().strip() == "Hello, world!"
@pytest.fixture(scope="session", autouse=True)
def write_reward(request):
yield
failed = request.session.testsfailed
total = request.session.testscollected
reward = (total - failed) / total if total > 0 else 0.0
os.makedirs("/logs/verifier", exist_ok=True)
with open("/logs/verifier/reward.txt", "w") as f:
f.write(str(reward))Run pytest tests with a wrapper script:
#!/bin/bash
# tests/test.sh
cd /tests
python3 -m pytest test_state.py -vFor faster iteration or complex environments, use a pre-built image instead of building from Dockerfile:
[environment]
docker_image = "myregistry/my-task-env:v1"The image is pulled if not present locally. The environment/Dockerfile is ignored when docker_image is set.
-
Validate with oracle agent:
rollout job.yaml # with oracle agent -
Check the output:
result.jsonshould showreward: 1.0verifier.logshould show passing tests
-
Debug failures:
- Set
preserve_env: on_failureto keep failed containers - Check
agent.logandverifier.log
- Set
Each trial executes through 6 phases:
- Setup: Build or pull container image, create environment
- Install: Run agent's install script (if defined)
- Execute: Run agent against the task instruction
- Verify: Copy tests and run verification script
- Collect: Download logs and artifacts from container
- Teardown: Destroy the environment
jobs/
└── my-evaluation/
├── config.json # Job configuration snapshot
├── results.json # Aggregate results
└── oracle/
└── my-dataset/
└── my-task__1/
├── result.json # Trial result
├── agent.log # Agent stdout/stderr
└── verifier.log # Verifier output
The verifier script must write a reward value (0.0-1.0) to /logs/verifier/reward.txt:
#!/bin/bash
# tests/test.sh
if [ -f /app/hello.txt ]; then
echo "1.0" > /logs/verifier/reward
else
echo "0.0" > /logs/verifier/reward
fiOr use pytest with the reward file:
# tests/test_state.py
import pytest
def test_solution():
with open("/app/hello.txt") as f:
assert f.read().strip() == "Hello, world!"
@pytest.fixture(scope="session", autouse=True)
def write_reward(request):
yield
failed = request.session.testsfailed
total = request.session.testscollected
reward = (total - failed) / total if total > 0 else 0.0
with open("/logs/verifier/reward.txt", "w") as f:
f.write(str(reward))- SPEC.md - Product specification with detailed concepts
- ARCHITECTURE.md - Technical architecture and interfaces
# Run unit tests
go test -short ./...
# Run all tests (requires Docker)
go test -v ./...
# Build
go build -o rollout ./cmd/rolloutMIT