AGENTBench Harness

This repository contains the harness used in our paper for evaluating coding agents with different repository-level context settings (NONE, LLM, HUMAN) on:

AGENTBench datasets (for example eth-sri/agentbench)
SWE-Bench Lite datasets (for example SWE-bench/SWE-bench_Lite)

The design follows the paper pipeline (Figure 1): generate context/instructions, run an autonomous coding agent, evaluate produced patches with tests, then analyze traces and costs.

Quick Start

1) Install dependencies

python3 -m venv .venv
source .venv/bin/activate
pip install -e .

You also need Docker available locally for benchmark execution.

2) Choose model access: API key or local vLLM

You must either export provider API keys or run local vLLM endpoints.

API-key path (examples):

export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
export OPENROUTER_API_KEY=...

Local vLLM path (Qwen + GPT-OSS):

cd scripts
bash local_lms/vllm_qwen.sh 0.0.0.0 4000
bash local_lms/vllm_gpt-oss.sh 0.0.0.0 4001

Model configs for local models live in src/configs/model_constants.py and already point to http://localhost:4000/v1 and http://localhost:4001/v1.

Harness Entry Scripts

The harness has three entrypoints under scripts/agentbench/run_harness/:

`generate.py`

Runs planning + coding-agent generation over benchmark instances.

Loads model/planner/generator/benchmark configs from src/configs/
Creates benchmark instances (AgentbenchBenchmark or SweBench)
For each instance: setup Docker env, run planner (writes AGENTS.md/CLAUDE.md), run coding agent, save trajectory + patch
Writes predictions to preds.json in the computed run directory

Example:

python scripts/agentbench/run_harness/generate.py \
  --plan_type codex_planner \
  --exec_model gpt-5.2-codex \
  --generator codex \
  --plan_model gpt-5.2-codex \
  --plan_generator codex \
  --benchmark agentbench \
  --dataset_name eth-sri/agentbench \
  --output_dir output/agentbench_v1_runs \
  --workers 8 \
  --run_id 0

`evaluate.py`

Evaluates generated patches by running benchmark test logic.

Reads preds.json from the generation output
Dispatches to benchmark solve(...)
For SWE-Bench, calls the official evaluation runner
For AGENTBench, replays each instance with repo tests + instance tests
Writes per-instance report.json files

Example:

python scripts/agentbench/run_harness/evaluate.py \
  --plan_type codex_planner \
  --exec_model gpt-5.2-codex \
  --generator codex \
  --plan_model gpt-5.2-codex \
  --plan_generator codex \
  --benchmark agentbench \
  --dataset_name eth-sri/agentbench \
  --output_dir output/agentbench_v1_runs \
  --workers 8 \
  --run_id 0

`analyze.py`

Aggregates execution/evaluation artifacts into a CSV.

Loads report.json, trajectories (*.traj.json), and optional cached plans
Extracts metrics: resolved/pass-fail tests, cost, number of steps, tool calls, patch sizes, token stats
Appends one row per instance to a CSV (default results.csv)

Example:

python scripts/agentbench/run_harness/analyze.py \
  --plan_type codex_planner \
  --exec_model gpt-5.2-codex \
  --generator codex \
  --plan_model gpt-5.2-codex \
  --plan_generator codex \
  --dataset_name eth-sri/agentbench \
  --output_dir output/agentbench_v1_runs \
  --csv experiment1_reasoning.csv \
  --run_id 0

Acknowledgments

Some code from this repository was forked from the mini-swe-agent repo. We thank the authors of mini-SWE-agent and we recommend checking their work!

Citation

If you find our work useful, please cite out paper.

@misc{gloaguen2026evaluatingagentsmdrepositorylevelcontext,
      title={Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?}, 
      author={Thibaud Gloaguen and Niels Mündler and Mark Müller and Veselin Raychev and Martin Vechev},
      year={2026},
      eprint={2602.11988},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2602.11988}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
final_scripts		final_scripts
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AGENTBench Harness

Quick Start

1) Install dependencies

2) Choose model access: API key or local vLLM

Harness Entry Scripts

`generate.py`

`evaluate.py`

`analyze.py`

More Documentation

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

License

eth-sri/agentbench

Folders and files

Latest commit

History

Repository files navigation

AGENTBench Harness

Quick Start

1) Install dependencies

2) Choose model access: API key or local vLLM

Harness Entry Scripts

generate.py

evaluate.py

analyze.py

More Documentation

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

`generate.py`

`evaluate.py`

`analyze.py`

Packages