Rollout is a framework for evaluating and optimizing agents in container environments. Rollout can be used in many ways like executing against custom agent evals, optimizing prompts, and CI/CD agent testing.
Defining and managing containerized tasks at scale is hard. Rollout is built to make it easy.
Rollout provides:
- Simple, modular interfaces for environments, agents, and tasks, making easy to extend
- All popular CLI agents pre-integrated
- A registry of popular benchmarks and datasets
Rollout has the following core concetps:
A task is a single instruction, container environment, and test script. Tasks are used to evaluate agents. A task is implemented as a directory of files in the following task format:
-
instruction.md: The instruction is a markdown file that contains the task's instruction
-
task.toml: The
task.tomlfile contains the task's configuration and metadata. Metadata is arbitrary and can consist of any information a task developer wants. Config params are nested into their respective sections rather than flat. An example is shown below:version = "1.0" [metadata] author_name = "Steve Jobs" author_email = "[email protected]" difficulty = "easy" category = "programming" tags = ["trivial"] [verifier] timeout_sec = 120.0 [agent] install_timeout_sec = 300.0 timeout_sec = 120.0 [environment] build_timeout_sec = 600.0 docker_image = "some-org/some-name:some-tag" cpus = 1 memory_mb = 2048 storage_mb = 10240
Configuration parameters are defined by the following JSON Schema:
{ "$schema": "http://json-schema.org/draft-07/schema#", "title": "Task Configuration", "type": "object", "required": ["version"], "properties": { "version": { "type": "string", "default": "1.0", "description": "Version of the task configuration format." }, "source": { "type": ["string", "null"], "default": null, "description": "Optional source string for task provenance." }, "metadata": { "type": "object", "description": "Arbitrary metadata provided by the task author.", "additionalProperties": true }, "verifier": { "type": "object", "properties": { "timeout_sec": { "type": "number", "default": 600.0, "description": "Timeout in seconds for the verifier." } } }, "agent": { "type": "object", "properties": { "install_timeout_sec": { "type": "number", "default": 300.0, "description": "Timeout in seconds for agent installation." }, "timeout_sec": { "type": "number", "default": 600.0, "description": "Timeout in seconds for agent execution." } } }, "environment": { "type": "object", "properties": { "build_timeout_sec": { "type": "number", "default": 600.0, "description": "Timeout in seconds for environment build." }, "docker_image": { "type": ["string", "null"], "default": null, "description": "A pre-built Docker image to use for the environment. When set, this image is used instead of building from environment/Dockerfile. If the image is not present locally, Rollout pulls it from the registry before starting the container." }, "cpus": { "type": "integer", "default": 1, "description": "Number of CPUs available to the environment." }, "memory_mb": { "type": "integer", "default": 2048, "description": "Amount of RAM available to the environment in MB." }, "memory_mb": { "type": "integer", "default": 10240, "description": "Amount of storage available to the environment." } } } } }The
cpusfield is an integer number of cores. Thememory_mbandstorage_mbfields are integers representing Megabytes (MB). Rollout translates these values to the appropriate configuration format expected by each environment provider. -
environment/: The environment definition is placed in an
environment/folder. Rollout expects OS-like container images with standard utilities available (bash, coreutils, etc.). Since tasks evaluate full-featured agents that may need to install dependencies, run build tools, or execute complex scripts, tasks typically build from base images likeubuntu:24.04or similar. Rollout assumesbashis available and uses it to execute agent install/execute scripts and the verifier. Rollout does not require any specific file to exist in that directory. Which file is required depends on the environment type being used for the evaluation. For example, to use docker, we check that anenvironment/Dockerfileis present. Different environment types could require other files to be present (e.g. an Apptainer environment could check for animage.deffile). Most cloud sandbox providers only supportDockerfiledefined environments and not docker compose.There are a few special paths in the environment's filesystem:
Path Description /logs/verifier/Contains the reward file produced by the verifier. /logs/agent/A directory agents can use to store logs from their runs. /oracle/The solution folder is copied here by the Oracle agent at runtime and executed from the working directory. /tests/The tests folder is copied here by the Rollout harness at runtime and executed from the working directory. The
/logs/directory is downloaded to the host after the agent/verifier run and are often useful for debugging and analysis.Working Directory: The working directory for agent and verifier execution is determined by the task's
environment/Dockerfile. If not explicitly set viaWORKDIR, it inherits from the base image. Task authors should setWORKDIRexplicitly to avoid ambiguity. -
solution/: The solution folder must contain a
solution/solve.shscript. Other dependencies are allowed. This folder is copied to/oracleby the Oracle agent at runtime and executed from the working directory. If no solution is provided, the Oracle agent cannot be used to sanity check the task. -
tests/: The tests folder must contain a
tests/test.shscript. The test script should install test dependencies and verify the agent completed the instruction. The test may be anything, from executingpytestto using LLM as a judge to verify task completion.Other dependencies are allowed in the
tests/folder. This folder is copied to/testsby Rollout at runtime and executed from the working directory. E.g.bash /tests/test.shis executed from/appin many cases.We recommend using absolute paths in your test script to avoid relative path issues.
Importantly, the test script must produce a reward file in the
/logs/verifier/directory and exit with code 0. This is the file that the verifier will read to determine if the task was successful. If the script exits with a non-zero code, the trial is marked as failed regardless of whether a reward file was produced.Reward File Format Description /logs/verifier/reward.txtPlain text (e.g. 1,0,0.5)A plain text file containing a single integer or float value, typically 1for success or0for failure.Your test script must output
reward.txt.Often, a reward can be determined by the exit code or a unit test command.
tests/test.sh:#!/bin/bash uvx pytest /tests/test.py if [ $? -eq 0 ]; then echo 1 > /logs/verifier/reward.txt else echo 0 > /logs/verifier/reward.txt fi
A dataset is a collection of tasks. Datasets are used to evaluate agents. Usually, a dataset corresponds to a benchmark (e.g. Terminal-Bench, SWE-Bench Verified, SWE-Bench Pro, etc.). Datasets can optionally be distributed via a registry.
Example structure of a dataset:
.
├── adaptive-rejection-sampler
│ ├── environment
│ │ ├── Dockerfile
│ │ └── warriors
│ │ ├── g2-clear.red
│ │ ├── paper.red
│ │ ├── snake.red
│ │ ├── stone.red
│ │ └── vampire.red
│ ├── instruction.md
│ ├── solution
│ │ └── solve.sh
│ ├── task.toml
│ └── tests
│ ├── test_outputs.py
│ └── test.sh
└── write-compressor
├── environment
│ ├── data.txt
│ ├── decomp.c
│ ├── Dockerfile
│ └── main.rs
├── instruction.md
├── solution
│ └── solve.sh
├── task.toml
└── tests
├── original-data.txt
├── original-decomp.c
├── test_outputs.py
└── test.shA dataset may also defined in a registry.json file, which looks like:
[
{
"name": "hello-world",
"version": "head",
"description": "A single, simple task for debugging.",
"tasks": [
{
"name": "hello-world",
"git_url": "https://github.com/laude-institute/harbor.git",
"path": "examples/tasks/hello-world"
}
]
},
{
"name": "terminal-bench",
"version": "2.0",
"description": "Version 2.0 of Terminal-Bench, a benchmark for testing agents in terminal environments. More tasks, harder, and higher quality than 1.0.",
"tasks": [
{
"name": "adaptive-rejection-sampler",
"git_url": "https://github.com/laude-institute/terminal-bench-2.git",
"git_commit_id": "69671fbaac6d67a7ef0dfec016cc38a64ef7a77c",
"path": "adaptive-rejection-sampler"
},
{
"name": "bn-fit-modify",
"git_url": "https://github.com/laude-institute/terminal-bench-2.git",
"git_commit_id": "69671fbaac6d67a7ef0dfec016cc38a64ef7a77c",
"path": "bn-fit-modify"
},
...
]
}
]A registry.json may contain multiple datasets, and each dataset may contain tasks that are from different repositories. Each task may also contain a path, and a git commit id.
Note that the dataset version field does not map to a specific artifact like git SHA or image digest, and merely acts as metadata at the moment. If there are multiple versions of a dataset, each version will have to be explicitly defined separately in a registry like so:
[
{
"name": "example multi version dataset",
"version": "1.0",
"description": "A single, simple task for debugging.",
"tasks": [
{
"name": "hello-world",
"git_url": "https://github.com/laude-institute/harbor.git",
"path": "examples/tasks/hello-world"
}
]
},
{
"name": "example multi version dataset",
"version": "2.0",
"description": "A single, simple task for debugging.",
"tasks": [
{
"name": "hello-world",
"git_url": "https://github.com/laude-institute/harbor.git",
"path": "examples/tasks/hello-world"
}
]
},
{
"name": "example multi version dataset",
"version": "alpha",
"description": "A single, simple task for debugging.",
"tasks": [
{
"name": "hello-world",
"git_url": "https://github.com/laude-institute/harbor.git",
"path": "examples/tasks/hello-world"
}
]
}
]An agent is a program that completes tasks. Agents are defined in the job.yaml. The agents block has the following structure:
# ... other parts of job.yaml
agents:
- name: Name of agent (Claude Code, CPE, Codex, Cline, etc.)
description: Description of agent
install: |
Bash script to install agent
execute: |
Bash script to execute agent
env: # key pair env vars that will passed to install and execute steps
MY_API_KEY: ${MY_API_KEY} # can use env var from system
# ... other parts of job.yamlWhen executing the agent, the environment variable $ROLLOUT_TASK_INSTRUCTION is set to the path where instruction.md has been copied in the container (default: /tmp/instruction.md). This path is configurable via instruction_path in job.yaml.
Here is an example agents definition for installing CPE.
# ... other parts of job.yaml
agents:
- name: cpe
description: A custom, minimalistic harness that heavily customizable
install: |
#!/bin/bash
set -euo pipefail
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
}
log "Starting CPE installation..."
log "Updating apt packages..."
apt-get update
apt-get install -y curl git
log "✓ apt packages installed"
log "Downloading Go 1.25.5..."
# Force IPv4, increase retries/timeouts, handle transient network failures
curl -4 -fSL --retry 10 --retry-delay 5 --retry-all-errors --connect-timeout 30 --max-time 300 \
-o /tmp/go.tar.gz https://go.dev/dl/go1.25.5.linux-amd64.tar.gz || {
log "Primary download failed, trying alternate mirror..."
curl -4 -fSL --retry 10 --retry-delay 5 --retry-all-errors --connect-timeout 30 --max-time 300 \
-o /tmp/go.tar.gz https://dl.google.com/go/go1.25.5.linux-amd64.tar.gz
}
# Verify download integrity
if ! gzip -t /tmp/go.tar.gz 2>/dev/null; then
log "ERROR: Downloaded file is corrupt"
exit 1
fi
log "✓ Go tarball downloaded"
log "Extracting Go..."
tar -C /usr/local -xzf /tmp/go.tar.gz
rm /tmp/go.tar.gz
log "✓ Go installed to /usr/local/go"
export PATH="/usr/local/go/bin:/root/go/bin:$PATH"
echo 'export PATH="/usr/local/go/bin:/root/go/bin:$PATH"' >> ~/.bashrc
log "✓ PATH configured"
log "Installing CPE..."
go install github.com/spachava753/cpe@latest
log "✓ CPE installed"
log "Creating config directory..."
mkdir -p ~/.config/cpe
log "✓ Config directory created"
log "Writing CPE configuration..."
cat > ~/.config/cpe/cpe.yaml << 'EOF'
version: "1.0"
models:
- ref: sonnet
display_name: "Claude Sonnet 4.5"
id: claude-sonnet-4-5-20250929
type: anthropic
base_url: https://api.anthropic.com/
api_key_env: ANTHROPIC_API_KEY
context_window: 200000
max_output: 64000
input_cost_per_million: 3
output_cost_per_million: 15
generationDefaults:
temperature: 1.0
maxGenerationTokens: 64000
- ref: opus
display_name: "Claude Opus 4.5"
id: claude-opus-4-5-20251101
type: anthropic
base_url: https://api.anthropic.com/
api_key_env: ANTHROPIC_API_KEY
context_window: 200000
max_output: 64000
input_cost_per_million: 5
output_cost_per_million: 25
generationDefaults:
temperature: 1.0
maxGenerationTokens: 64000
- ref: haiku
display_name: "Claude Haiku 4.5"
id: claude-haiku-4-5-20251001
type: anthropic
base_url: https://api.anthropic.com/
api_key_env: ANTHROPIC_API_KEY
context_window: 200000
max_output: 64000
input_cost_per_million: 1
output_cost_per_million: 5
generationDefaults:
temperature: 1.0
maxGenerationTokens: 64000
defaults:
systemPromptPath: "/root/.config/cpe/agent_instructions.prompt"
model: sonnet
timeout: "30m"
noStream: true
codeMode:
enabled: true
EOF
log "✓ CPE configuration written"
log "Downloading system prompt template..."
curl -fsSL --retry 5 --retry-delay 2 --retry-connrefused \
https://raw.githubusercontent.com/spachava753/cpe/main/examples/agent_instructions.prompt \
-o ~/.config/cpe/agent_instructions.prompt
log "✓ System prompt downloaded"
log "CPE installation complete!"
execute: |
#!/bin/bash
set -euo pipefail
export PATH="/root/go/bin:/usr/local/go/bin:$PATH" && cpe -n -G --skip-stdin ${ROLLOUT_TASK_INSTRUCTION} 2>&1 | tee /logs/agent/cpe.txt
env:
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}Note that when defining a job, you may provide environment variables that are set before running the installation and execution script. This is especially helpful if you need certain inputs to execute the agent based on the job definition, such as a model id, or generation parameters.
In addition, oracle is a reserved agent name with special behavior. When an agent named oracle is specified, Rollout copies the task's solution/ folder to /oracle in the container and executes bash /oracle/solve.sh from the working directory. No install or execute scripts are needed for the oracle agent—specifying - name: oracle in the agents list is sufficient.
Environments in Rollout are containers, typically defined as Docker images using a Dockerfile in a task environment/ folder, as well as any artifacts that are used in the process of building the image to execute, such as dependencies or zip file artifacts that should be copied to the Dockerfile. The execution of the task in the container follows flow:
-
Image building: Build the image using the
environment/Dockerfile. Some platforms like Modal and Fly can build on their platform, where as if theenvironment.typeinjob.yamlis docker, the image is built locally. Built images are cached by default for the Docker environment type; for cloud sandbox providers, caching depends on provider support and our implementation. Theforce_buildoption injob.yamlforces a rebuild, bypassing the cache. Whenforce_buildis set, Rollout builds from the task'senvironment/Dockerfileeven ifenvironment.docker_imageis specified intask.toml. -
Start environment: Start container with built image in platform (create sandbox via API calls with image, or deploy to Kubernetes as a Pod, with sleep comand) and keep it running. Copy the task's
instruction.mdto the configured path (default:/tmp/instruction.md) and set the$ROLLOUT_TASK_INSTRUCTIONenvironment variable to this path. -
Install agent: Copy the agent install script into the container and execute.
-
Execute agent: Copy the agent execute script into the container and execute. If the script exits with a non-zero exit code, the trial is marked as failed and the verifier is skipped. We cannot be certain of the filesystem state after a failed agent execution, so verification results would be unreliable. Additionally, verifiers can be expensive to run, so we only execute them when there are meaningful results to verify.
-
Verify: Copy the tasks
tests/folder into the container at/testsand execute/tests/test.sh -
Create trial output: Collect all outputs before stopping the container:
- Copy the container's
/logsdirectory to<trial>/logs/on the host - Write captured stdout/stderr from agent install to
<trial>/setup/ - Write captured stdout/stderr from agent execute to
<trial>/command/ - Generate
<trial>/result.jsonwith timing, cost, reward, and error information
- Copy the container's
-
Stop environment: Stop the execution of the running container.
-
Clean up (optional): Environment cleanup is controlled by the
preserve_envsetting:"never"(default): Always clean up environment resources after task completion."always": Never clean up; preserve all environments for inspection."on_failure": Only preserve environments for failed trials (non-zero reward or error). This is recommended for debugging, as it preserves failed environments while cleaning up successful ones to save resources.
Cleanup is provider-specific: for Docker, the container is removed (
docker rm <container-id>); for Modal, the Modal app is deleted; for Kubernetes, related resources are deleted (e.g., if executing as a Job, the Job resource is removed). Note that this is separate fromforce_build, which controls image caching rather than environment cleanup.
Error handling: If a fatal error occurs at any phase (environment build, agent install/execution, verification, or teardown), the error details are written to <trial>/error.txt and the error field is populated in result.json. See Error Types for all possible error types.
We aim to support many cloud providers and platforms out of the box, including Fly, Modal, and Kubernetes.
Provider-specific configuration: The environment.provider_config field accepts provider-specific settings. Each environment type may expose its own configuration knobs. Rollout provides sensible defaults where possible, but users may need to configure provider-specific options such as:
- Docker: Custom network, runtime (e.g., nvidia for GPU), volume mounts
- Kubernetes: Namespace, service account, node selectors, tolerations, resource classes
- Modal: App name prefix/suffix, region, secrets references
- Fly: Organization, region, VM size presets
Provider implementations should document their supported configuration options.
A trial is an agent's attempt at completing a task. A trial is spawned from a job definition (not explicitly defined by the user) and produces a structured output including:
Trial enumeration: Trials are generated as the Cartesian product of (agent, task, attempt). For a job with n_attempts attempts, multiple agents, and multiple datasets containing tasks, Rollout generates one trial for each combination. Since trials typically run concurrently and complete in varying times, results are produced in non-deterministic order.
Result generation: Trial results are written to disk as each trial completes, rather than being collected and written at the end of the job. This ensures that partial results are preserved if a job is cancelled or encounters a fatal error.
Trial outputs include:
- Reward: The score produced by the verifier (from
/logs/verifier/reward.txt) - Timing: Durations and timestamps for each phase (environment setup, agent setup, agent execution, verification). Note that "environment setup" encompasses image build/pull, container start, and file copy operations.
- Cost: Resource costs incurred during the trial. Cost is provider-specific and calculated based on the environment type selected (e.g., Modal pricing for Modal environments, compute costs for Kubernetes, etc.)
- Error: Any errors that occurred during execution
Trial results are stored in <trial>/result.json. See Job Output for the full schema and directory structure.
Trials can fail at various phases of execution. The error field in result.json uses a structured format to indicate the failure mode:
{
"type": "error_type_identifier",
"message": "Human-readable description of what went wrong"
}The following error types are defined:
| Error Type | Phase | Description |
|---|---|---|
environment_build_failed |
Environment Build | Dockerfile build failed (syntax error, failed command, missing base image) |
environment_build_timeout |
Environment Build | Build exceeded environment.build_timeout_sec |
environment_image_pull_failed |
Environment Build | Failed to pull pre-built image specified in environment.docker_image |
environment_start_failed |
Environment Start | Container failed to start after image was built/pulled |
environment_resource_allocation_failed |
Environment Start | Platform could not allocate requested CPU/memory/storage |
agent_install_failed |
Agent Install | Agent install script returned non-zero exit code |
agent_install_timeout |
Agent Install | Agent install script exceeded agent.install_timeout_sec |
agent_execution_failed |
Agent Execution | Agent execute script returned non-zero exit code |
agent_execution_timeout |
Agent Execution | Agent execution exceeded agent.timeout_sec |
verifier_failed |
Verification | Test script returned non-zero exit code |
verifier_timeout |
Verification | Test script exceeded verifier.timeout_sec |
verifier_reward_missing |
Verification | Test script completed but no /logs/verifier/reward.txt was produced |
verifier_reward_invalid |
Verification | Reward file exists but contains invalid format (not a number) |
environment_teardown_failed |
Teardown | Failed to stop container or clean up resources |
task_invalid |
Pre-execution | Task structure is invalid (missing required files like instruction.md or tests/test.sh) |
task_not_found |
Pre-execution | Task does not exist at the specified path |
internal_error |
Any | Unexpected Rollout internal error |
Error handling behavior:
- If an error occurs during environment build/start, the trial is marked as failed and no further phases execute.
- If an error occurs during agent install/execution (including non-zero exit codes), the verifier is skipped and Rollout proceeds directly to teardown. This is intentional: the filesystem state after agent failure is uncertain, making verification unreliable.
- If the verifier fails or times out, the
rewardfield is set tonull. - If the verifier script exits with a non-zero exit code, it is always treated as an error (
verifier_failed), even if areward.txtfile was produced. The reward file is ignored in this case. - Environment teardown errors are recorded but do not affect the trial's reward.
- For
task_invalidandtask_not_founderrors, no container is started.
Cancellation behavior: Users may cancel a running job at any time (e.g., via SIGINT/Ctrl+C). When cancelled:
- Rollout stops accepting new trials for execution.
- Currently running trials are terminated: environments are stopped regardless of their execution phase.
- Trials that were queued but not yet started are recorded as skipped in the job results.
- The job-level
result.jsonis written with thecancelledfield set totrueandskipped_trialslisting the trials that were not executed. - Rollout exits with a non-zero exit code.
A job is a collection of trials. Jobs are used to evaluate agents across datasets. Jobs can be configured via job.yaml or job.json file. Jobs essentially map to a collection of trials.
The structure of job looks like this:
name: my-eval-run # Optional. If not provided, defaults to YYYY-MM-DD__HH-mm-ss timestamp. Used as the output directory name.
jobs_dir: jobs # the output directory to store trial results
n_attempts: 1 # number of attempts per (agent, task) pair
n_concurrent_trials: 4 # number of concurrent trials
timeout_multiplier: 1.0 # multiplier applied to ALL configurable timeouts: task-level timeouts (agent.timeout_sec, agent.install_timeout_sec, verifier.timeout_sec, environment.build_timeout_sec), job-level override timeouts (verifier.override_timeout_sec, verifier.max_timeout_sec), and any other timeout settings. Useful when running on slower hardware
retry:
max_attempts: 3 # maximum number of retry attempts for transient infrastructure failures (default: 3)
initial_delay_ms: 1000 # initial delay in milliseconds before first retry (default: 1000)
max_delay_ms: 30000 # maximum delay in milliseconds between retries (default: 30000)
multiplier: 2.0 # exponential backoff multiplier (default: 2.0)
log_level: error # log level for rollout
instruction_path: /tmp/instruction.md # path where instruction.md is copied in the container; $ROLLOUT_TASK_INSTRUCTION will contain this path
environment:
type: "docker" # or k8s, or modal, etc.
force_build: false # if true, bypass image cache and force a rebuild. When set, builds from Dockerfile even if task specifies environment.docker_image
preserve_env: "on_failure" # One of "always", "never", "on_failure". Default is "never". Controls when to preserve the environment after task completion.
provider_config: # Provider-specific configuration. Structure depends on environment.type.
# Example for Modal:
# app_name: "rollout-my-app" # Optional: Name of the Modal app to use
# region: "us-east" # Optional: Single region to run in
# regions: ["us-east", "us-west"] # Optional: Multiple regions
# verbose: true # Optional: Enable detailed sandbox logging
# Example for Kubernetes:
# namespace: "rollout-evals"
# service_account: "rollout-runner"
# node_selector:
# gpu: "true"
# Example for Docker:
# network: "rollout-net"
# runtime: "nvidia"
override_cpus: 1 # if set, override task specific cpu config
override_memory_mb: 2048 # if set, override task specific memory config
override_storage_mb: 30720 # if set, override task specific storage config
verifier:
override_timeout_sec: 0 # if set, override task specific verifier timeout
max_timeout_sec: 0 # if set, sets the ceiling of timeouts for verifiers
disable: false # if set, disables executing verifier. Useful if you just want to collect trial execution traces
metrics:
- type: "mean" # can be one of sum, min, max, or mean, will print the metrics as jobs are executing
agents:
- name: "some-agent" # name of agent
description: "description of some-agent" # name of agent
install: "install bash script" # bash script to install agent
execute: "execute bash script" # bash script to execute agent
env:
MY_API_KEY: ${MY_API_KEY}
- name: oracle
datasets:
- path: ./path-to-dataset # path to a local dataset, a folder that contains tasks
tasks: # optional: filter to run only specific tasks (omit to run all tasks)
- task-name-1
- task-name-2
- registry:
path: ./path-to-registry # path to a registry defining datasets
name: "dataset-name" # name of dataset as found in registry
version: "dataset-version" # version of dataset as found in registry
tasks: # optional: filter to run only specific tasks
- specific-task-a
- specific-task-b
- registry:
url: https://raw.githubusercontent.com/laude-institute/harbor/refs/heads/main/registry.json # url to a registry defining datasets
name: "dataset-name" # name of dataset as found in registry
version: "dataset-version" # version of dataset as found in registry
# tasks field omitted = run all tasks (backward compatible)Task filtering: The tasks field allows you to selectively run specific tasks from a dataset instead of executing all tasks. This is useful for debugging specific tasks, running subsets for faster iteration, or focusing on particular areas of a benchmark.
Task filter behavior:
- When
tasksis omitted (nil), all tasks in the dataset are executed (backward compatible). - When
tasksis specified, only the listed tasks are executed. - Task names are case-sensitive (they match directory names on the filesystem).
- Task names must match the pattern
^[a-zA-Z0-9][a-zA-Z0-9_-]*$(start with alphanumeric, contain only alphanumeric characters, underscores, and hyphens). - Duplicate task names in the filter are silently deduplicated (first occurrence preserved).
- If any specified task does not exist in the dataset, the job fails with an error listing the missing tasks.
- Filtered tasks are executed in the order specified in the
tasksarray.
Example task filtering configurations:
# Run only two specific tasks from a local dataset
datasets:
- path: ./my-benchmark
tasks:
- hello-world
- complex-task
# Run specific tasks from a registry dataset
datasets:
- registry:
url: https://example.com/registry.json
name: "terminal-bench"
version: "2.0"
tasks:
- adaptive-rejection-sampler
- write-compressor
# Run all tasks (default behavior when tasks is omitted)
datasets:
- path: ./my-benchmark
# tasks field omitted = run all tasksRetry policy: The retry configuration enables automatic retries for transient infrastructure failures such as HTTP 500 errors from cloud sandbox providers, network timeouts during image pulls, or temporary resource allocation failures. Retries use exponential backoff: the delay between attempts starts at initial_delay_ms and multiplies by multiplier after each attempt, capped at max_delay_ms. Only infrastructure-level failures are retried; agent execution failures and verifier failures are not retried as they represent legitimate task outcomes.
Running a job produces a structured output directory containing all trial results, logs, and metadata. The output is organized as follows:
| Path | Description |
|---|---|
jobs/ |
The jobs folder, as specified by jobs_dir in job.yaml |
jobs/<job-name>/ |
Directory for the job execution. If name is not provided in job.yaml, defaults to YYYY-MM-DD__HH-mm-ss timestamp |
jobs/<job-name>/config.json |
Copy of job.yaml in JSON format |
jobs/<job-name>/result.json |
Aggregate results of job execution |
jobs/<job-name>/<agent-name>/ |
Results grouped by agent |
jobs/<job-name>/<agent-name>/<dataset-name>/ |
Results grouped by dataset within agent |
jobs/<job-name>/<agent-name>/<dataset-name>/<task-name>__<n>/ |
Trial-specific directory. <task-name> is the task folder name, <n> is the attempt number |
jobs/<job-name>/<agent-name>/<dataset-name>/<task-name>__<n>/result.json |
Trial result |
jobs/<job-name>/<agent-name>/<dataset-name>/<task-name>__<n>/error.txt |
Error details if environment build/destroy or other fatal errors occurred |
jobs/<job-name>/<agent-name>/<dataset-name>/<task-name>__<n>/setup/ |
Rollout-captured logs from agent installation |
jobs/<job-name>/<agent-name>/<dataset-name>/<task-name>__<n>/setup/stderr.txt |
Stderr from agent installation |
jobs/<job-name>/<agent-name>/<dataset-name>/<task-name>__<n>/setup/stdout.txt |
Stdout from agent installation |
jobs/<job-name>/<agent-name>/<dataset-name>/<task-name>__<n>/command/ |
Rollout-captured logs from agent execution |
jobs/<job-name>/<agent-name>/<dataset-name>/<task-name>__<n>/command/stderr.txt |
Stderr from agent execution |
jobs/<job-name>/<agent-name>/<dataset-name>/<task-name>__<n>/command/stdout.txt |
Stdout from agent execution |
jobs/<job-name>/<agent-name>/<dataset-name>/<task-name>__<n>/logs/ |
Copied from the container's /logs directory |
jobs/<job-name>/<agent-name>/<dataset-name>/<task-name>__<n>/logs/agent/ |
Agent-produced logs (from /logs/agent/ in container) |
jobs/<job-name>/<agent-name>/<dataset-name>/<task-name>__<n>/logs/verifier/ |
Verifier logs and results |
jobs/<job-name>/<agent-name>/<dataset-name>/<task-name>__<n>/logs/verifier/reward.txt |
Reward file produced by tests |
jobs/<job-name>/<agent-name>/<dataset-name>/<task-name>__<n>/logs/verifier/stderr.txt |
Stderr from test execution |
jobs/<job-name>/<agent-name>/<dataset-name>/<task-name>__<n>/logs/verifier/stdout.txt |
Stdout from test execution |
Note: Since a job may contain multiple datasets, it is possible for different datasets to contain tasks with the same name. To handle this, trial results are namespaced under the dataset name, allowing the same task name to exist in different datasets without conflict.
Dataset naming: For registry-defined datasets, the dataset name from the registry is used. For local datasets specified via path, the base name of the directory is used as the dataset name (e.g., a dataset at ./benchmarks/my-tasks would use my-tasks as the dataset name in the output hierarchy).
The trial-level result.json contains timing, cost, reward, and error information for a single trial. The task_git_commit_id field captures the Git commit SHA of the task source at execution time, enabling reproducibility and comparison across job runs. For registry-defined tasks, this is the git_commit_id from the registry entry; for local tasks, Rollout resolves the current HEAD commit of the task's repository (or null if the task is not in a Git repository):
{
"task_name": "hello-world",
"dataset_name": "terminal-bench",
"agent_name": "cpe",
"attempt": 1,
"task_git_commit_id": "a1b2c3d4e5f6789012345678901234567890abcd",
"reward": 1.0,
"cost": 0.0342,
"error": null,
"durations": {
"total_sec": 245.3,
"environment_setup_sec": 120.5,
"agent_setup_sec": 45.2,
"agent_execution_sec": 65.1,
"verifier_sec": 14.5
},
"timestamps": {
"started_at": "2025-01-15T10:30:00Z",
"environment_setup_started_at": "2025-01-15T10:30:00Z",
"environment_setup_ended_at": "2025-01-15T10:32:00Z",
"agent_setup_started_at": "2025-01-15T10:32:00Z",
"agent_setup_ended_at": "2025-01-15T10:32:45Z",
"agent_execution_started_at": "2025-01-15T10:32:45Z",
"agent_execution_ended_at": "2025-01-15T10:33:50Z",
"verifier_started_at": "2025-01-15T10:33:50Z",
"verifier_ended_at": "2025-01-15T10:34:05Z",
"ended_at": "2025-01-15T10:34:05Z"
}
}The error field, when present, describes the failure type (see Error Types for all possible values):
{
"task_name": "complex-task",
"dataset_name": "terminal-bench",
"agent_name": "cpe",
"attempt": 1,
"task_git_commit_id": "a1b2c3d4e5f6789012345678901234567890abcd",
"reward": null,
"cost": 0.012,
"error": {
"type": "agent_execution_timeout",
"message": "Agent execution exceeded timeout of 600 seconds"
},
"durations": {
"total_sec": 720.0,
"environment_setup_sec": 115.2,
"agent_setup_sec": 4.8,
"agent_execution_sec": 600.0,
"verifier_sec": null
},
"timestamps": {
"started_at": "2025-01-15T11:00:00Z",
"environment_setup_started_at": "2025-01-15T11:00:00Z",
"environment_setup_ended_at": "2025-01-15T11:01:55Z",
"agent_setup_started_at": "2025-01-15T11:01:55Z",
"agent_setup_ended_at": "2025-01-15T11:02:00Z",
"agent_execution_started_at": "2025-01-15T11:02:00Z",
"agent_execution_ended_at": "2025-01-15T11:12:00Z",
"verifier_started_at": null,
"verifier_ended_at": null,
"ended_at": "2025-01-15T11:12:00Z"
}
}The job-level result.json contains aggregate metrics across all trials:
{
"job_name": "my-eval-run",
"cancelled": false,
"total_trials": 20,
"completed_trials": 18,
"failed_trials": 2,
"skipped_trials": 0,
"pass_rate": 0.85,
"mean_reward": 0.85,
"total_cost": 1.234,
"total_duration_sec": 3600.5,
"started_at": "2025-01-15T10:00:00Z",
"ended_at": "2025-01-15T11:00:00Z",
"agents": {
"cpe": {
"total_trials": 10,
"completed_trials": 9,
"failed_trials": 1,
"pass_rate": 0.9,
"mean_reward": 0.9,
"total_cost": 0.617
},
"oracle": {
"total_trials": 10,
"completed_trials": 9,
"failed_trials": 1,
"pass_rate": 0.8,
"mean_reward": 0.8,
"total_cost": 0.617
}
},
"results": [
{
"task_name": "hello-world",
"dataset_name": "terminal-bench",
"agent_name": "cpe",
"attempt": 1,
"reward": 1.0
},
{
"task_name": "hello-world",
"dataset_name": "terminal-bench",
"agent_name": "oracle",
"attempt": 1,
"reward": 1.0
},
{
"task_name": "complex-task",
"dataset_name": "terminal-bench",
"agent_name": "cpe",
"attempt": 1,
"reward": 0.0
},
{
"task_name": "complex-task",
"dataset_name": "terminal-bench",
"agent_name": "oracle",
"attempt": 1,
"reward": 1.0
}
]
}Field definitions:
cancelled: Whether the job was cancelled before completionskipped_trials: Number of trials that were queued but not executed due to cancellationcompleted_trials: Trials where the verifier ran and produced a reward (regardless of reward value)failed_trials: Trials where an error prevented the verifier from producing a reward (any error type exceptenvironment_teardown_failed)pass_rate: Proportion of completed trials with reward 1.0mean_reward: Average reward across completed trials only (failed trials excluded)
The user interacts with Rollout via a CLI, and simply just provides a job.yaml as an argument:
rollout job.yamlThe following features are currently out of scope but may be considered for future versions:
- Job resumption: Resuming stopped or cancelled jobs from where they left off. Currently, cancelled jobs must be restarted from scratch.
- Advanced caching strategies: Baking images with agent installations pre-applied to reduce trial startup time. Currently, agent installation runs fresh for each trial.
- Network isolation: Restricting agent network access during task execution. Currently, agents have full network access, which may be necessary for accessing documentation or downloading dependencies. Users should verify agents have not used unauthorized resources by reviewing agent logs.
- Anonymous datasets: Executing a specified list of task paths directly without defining a dataset. Currently, tasks must be organized into a dataset structure.