IssueBenchKit

Turn a real GitHub issue, pull request, or local bug into a small coding-agent benchmark task.

SWE-bench is great when you want a public leaderboard. Most teams need something smaller: a repeatable task built from the bugs they actually care about, with a clear test command and a report that says whether a candidate patch really fixed it.

IssueBenchKit is that local builder. It does not try to invent tests for you. It packages the issue context, base commit, reproduction command, and scoring result so you can evaluate coding agents on your own repositories.

Quick Start

pip install issuebenchkit

Create a benchmark task:

issuebench init tasks/qwen-copy \
  --repo ./qwen-code \
  --issue https://github.com/QwenLM/qwen-code/issues/4716 \
  --base 8b4f3b2 \
  --test "npm test -- copyCommand.test.ts"

Or generate runnable demos first:

issuebench demo demo-task
issuebench run demo-task/task --repo demo-task/buggy_repo --out before.json
issuebench run demo-task/task --repo demo-task/fixed_repo --out after.json
issuebench score demo-task/task --before before.json --after after.json
issuebench validate demo-task/task --before-repo demo-task/buggy_repo --after-repo demo-task/fixed_repo --out validation.md

The built-in demos cover more than a toy Python case:

issuebench demo demo-python --kind python
issuebench demo demo-js --kind javascript
issuebench demo demo-mcp --kind mcp-pr
issuebench demo demo-gallery --all

python: a small pytest task around a division-by-zero behavior bug.
javascript: a Node-based slugification bug that drops numeric version suffixes.
mcp-pr: a distilled real contribution around rejecting duplicate MCP initialize calls.

Run the task against a candidate checkout:

issuebench run tasks/qwen-copy --repo ./candidate-qwen-code --out after.json

Compare before and after:

issuebench score tasks/qwen-copy --before before.json --after after.json

Export a report:

issuebench export tasks/qwen-copy --format html --out report.html

Create a coding-agent context pack:

issuebench context tasks/qwen-copy --result after.json --out qwen-copy-context.md
patchcontext scan --repo ./qwen-code --issue qwen-copy-context.md

What It Stores

Each task directory contains one issuebench.json manifest:

source repo path and optional GitHub issue URL
base commit or version marker
reproduction / validation command
expected signal, notes, and tags

Run results are plain JSON files with exit code, duration, command, stdout tail, stderr tail, and the pass/fail verdict. They are easy to archive, diff, or attach to a PR.

Why Not Just Use SWE-bench?

Use SWE-bench for public comparison. Use IssueBenchKit when you need:

a benchmark task for a private or small repo
a tiny task that can run in CI
a before/after report for one real bug
a dataset of issues that reflects your own engineering workflow

Current Scope

The first version is intentionally small:

generic shell test commands
built-in runnable demo workspaces for Python, JavaScript, and a distilled real MCP PR
JSON manifest files
before/after scoring
task validation that proves before fails, after passes, and both runs used the same task command
JSONL and single-file HTML export
Markdown context packs for coding agents and PatchContext

It does not generate tests automatically, mutate repositories, or claim that one command can evaluate every language ecosystem.

Related projects

AgentProbe — a pytest plugin for regression-testing AI agents
LiteBench — a pip-installable benchmark runner for LLMs and agents
CodeJoust — pit coding agents against the same bug and score the patches
agentcikit — CLI tools for AI-agent, MCP, and CI evidence and safety

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
issuebenchkit		issuebenchkit
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IssueBenchKit

Quick Start

What It Stores

Why Not Just Use SWE-bench?

Current Scope

Related projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IssueBenchKit

Quick Start

What It Stores

Why Not Just Use SWE-bench?

Current Scope

Related projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages