Multi-Agent Debate Arena — Let AI Models Fight It Out
Run the same task through multiple model agents, freeze a shared context bundle, generate independent plans, run an evidence-first debate, and produce a judge-backed verdict.
🌐 Language / 언어 / 语言: English · 한국어 · 中文
🏛️ Fair · 🔍 Traceable · 💰 Cost-Controlled · 📊 Evidence-First · 🔌 Extensible
Not just another chatbot UI — Colosseum is a structured debate platform designed for real workflows.
| Problem | AI Colosseum debate's Answer |
|---|---|
| "Which model gives a better plan?" | Run them side by side on the same frozen context |
| "How do I compare fairly?" | Independent plan generation — no agent sees another's plan first |
| "Debates go in circles forever" | Bounded rounds with novelty checks, convergence detection, and budget limits |
| "I can't trace how a decision was made" | Full artifact trail: plans, rounds, judge agendas, adopted arguments, verdicts |
| "I want control over judging" | Choose automated, AI judge, or human judge mode |
| "I need a code review, not just a debate" | Multi-phase code review with 6 configurable review phases |
| "I want multiple AI agents to QA my project in parallel" | QA ensemble mode — gladiators run in parallel on disjoint GPU slices, then a judge merges their findings into one canonical, deduplicated report |
|
Every agent gets the exact same input — text, files, directories, URLs, and images — frozen before planning begins. Claude · Codex · Gemini · Ollama · Custom CLIs Mix and match providers in the same debate. 20+ built-in personas (Karpathy, Andrew Ng, Elon Musk, and more), generate personas from surveys, or write custom ones. 6 configurable review phases: project rules, implementation, architecture, security/performance, test coverage, and red team adversarial testing. Multiple gladiators run the target project's own |
Automated heuristic judge, AI-powered judge (any model), or human judge with pause/resume flow. Claims must be grounded. Unsupported assertions are penalized. The judge tracks evidence quality per round. AI-synthesized final reports with key conclusions, verdict explanations, and debate highlights. Export to PDF or Markdown. Real token counts from provider output with per-agent cost breakdown. Always-on cost display in CLI results. tmux-based live monitor panel for watching debates and QA runs in real time. QA mode auto-spawns one watcher pane per gladiator. Four Claude Code wizard skills auto-installed under |
colosseum debate \
-t "Should we use microservices or monolith for a 10-person startup?" \
-g claude:claude-sonnet-4-6 gemini:gemini-2.5-pro \
-j claude:claude-opus-4-6Both models receive the exact same frozen context and generate independent plans before seeing each other's work. The judge tracks novelty and evidence quality per round — no circular debates.
colosseum debate \
-t "Best database for real-time analytics?" \
-g ollama:llama3.3 ollama:qwen2.5 \
--depth 2Colosseum auto-detects your GPU, checks model fit via
llmfit, and manages the Ollama daemon. Fully offline, fully free.
colosseum serveOpens at http://127.0.0.1:8000/ — pick models, assign personas, set judge mode, and watch the debate unfold live with real-time SSE streaming.
colosseum qa \
-t "Pre-release regression sweep" \
--target /path/to/your/target-project \
-g claude:claude-opus-4-6 claude:claude-sonnet-4-6 \
-j claude:claude-opus-4-6 \
--gpus-per-gladiator 2Each gladiator runs as a real
claude --printsubprocess with its own disjoint slice of GPUs (no collisions). Non-Claude gladiators run via a mediated executor. After all finish, the judge merges their reports into one canonical, REPRODUCED-only QA report. Inside tmux, watcher panes auto-spawn — one per gladiator.
| Other tools | AI Colosseum debate |
|---|---|
| Models see each other's output before responding | Frozen context — every agent plans independently from the same snapshot |
| Debates run until someone gives up | Bounded rounds with novelty checks, convergence detection, and budget limits |
| Verdicts are vibes-based | Evidence-first judging — unsupported assertions are penalized; adopted arguments are logged |
| No way to reproduce a result | Full artifact trail: plans, round transcripts, judge agendas, adopted arguments, verdict |
| One judge, one mode | Three judge modes: heuristic automated, any-model AI judge, or human pause/resume |
| QA tools test sequentially with one agent | QA ensemble — multiple gladiators in parallel on disjoint GPU slices, judge dedups findings into one report |
- vs ChatGPT Arena / lmsys: Those platforms route a single prompt to two models and ask humans to vote. AI Colosseum debate runs a structured multi-round debate on a topic you define, with your own context, and produces a traceable verdict with evidence.
- Personas built-in: Assign Karpathy, Andrew Ng, a security researcher, or your own custom persona to each gladiator — voices that meaningfully shift argument framing.
- Code review mode: Six configurable phases (conventions → implementation → architecture → security → tests → red team) turn the debate engine into a multi-reviewer code audit.
- QA ensemble mode: Drive any project's own
.claude/skills/qaskill from N gladiators in parallel and merge the union of findings — cooperative, not competitive. Claude gladiators dispatch their own sub-agents natively; Gemini/Codex run via the mediated executor. - Your infra: Use cloud APIs or local Ollama models interchangeably. No data leaves your machine unless you choose a cloud provider.
If AI Colosseum debate has been useful to you, a ⭐ on GitHub goes a long way.
- Bug reports & feature requests → GitHub Issues
- Contributions welcome — PRs for new provider adapters, personas, judge modes, QA executors, and UI improvements are appreciated. Check
docs/architecture/overview.mdbefore diving in.
The README is the product-facing overview. The canonical engineering docs live in docs/.
| Document | Description |
|---|---|
docs/colosseum_spec.md |
Specification index and entry point |
docs/architecture/overview.md |
Layered architectural model |
docs/architecture/design-philosophy.md |
Core design principles and non-goals |
docs/specs/runtime-protocol.md |
Run lifecycle, streaming contract, cost tracking |
docs/specs/agent-governance.md |
Agent, persona, and provider boundaries |
docs/specs/persona-authoring.md |
Persona file formats and validation |
# Install in editable mode
python -m pip install -e .
# With dev tools
python -m pip install -e '.[dev]'# Interactive setup — install & authenticate all supported CLI providers
# Also auto-installs the four bundled wizard skills under ~/.claude/skills/
colosseum setup
# Set up specific providers only
colosseum setup claude codex
# Verify installed tools
colosseum checkOn the very first run of any colosseum command, four Claude Code wizard skills are silently installed under ~/.claude/skills/ so you can call them from anywhere:
| Skill | Trigger | Purpose |
|---|---|---|
/colosseum |
"colosseum debate" | Interactive debate wizard |
/colosseum_code_review |
"colosseum code review" | Multi-phase code review wizard |
/colosseum_qa |
"colosseum qa" / "QA ensemble" | QA ensemble wizard |
/update_docs |
"update docs" | Project documentation refresh wizard |
If you ever need to refresh or force-overwrite them:
colosseum install-skills # install only the missing ones
colosseum install-skills --force # overwrite even user-customized SKILL.mdcolosseum serveOpen http://127.0.0.1:8000/ and you're ready to go.
# Quick mock debate (no real providers needed)
colosseum debate --topic "Should we refactor the provider layer?" --mock --depth 1
# Real multi-model debate
colosseum debate \
--topic "Best migration strategy for a vendor-neutral provider layer" \
-g claude:claude-sonnet-4-6 codex:o3 ollama:llama3.3
# With an AI judge and live monitoring
colosseum debate \
--topic "Monolithic vs microservices" \
-g claude:claude-sonnet-4-6 gemini:gemini-2.5-pro \
-j claude:claude-opus-4-6 --monitor
# With human judge
colosseum debate \
--topic "Database migration strategy" \
-g claude:claude-sonnet-4-6 codex:o4-mini \
-j human# Multi-phase code review with default phases (A-E)
colosseum review \
-t "OAuth implementation review" \
-g claude:claude-sonnet-4-6 gemini:gemini-2.5-pro \
--dir ./src
# Include red team phase and specific files
colosseum review \
-t "Payment module security review" \
-g claude:claude-sonnet-4-6 codex:o3 \
--phases A B C D E F \
-f src/payment.py src/auth.pyThe target project must contain a .claude/skills/qa/SKILL.md describing how it wants to be QA'd. Each gladiator runs that skill in parallel on its own GPU slice.
# 2 Claude gladiators on disjoint GPU slices, judge merges the union
colosseum qa \
-t "Pre-release regression sweep" \
--target /path/to/your/target-project \
-g claude:claude-opus-4-6 claude:claude-sonnet-4-6 \
-j claude:claude-opus-4-6 \
--gpus-per-gladiator 2
# Cross-vendor ensemble: Claude (native subagents) + Gemini/Codex (mediated)
colosseum qa \
-t "Cross-vendor QA pass" \
--target /path/to/your/target-project \
-g claude:claude-opus-4-6 gemini:gemini-2.5-pro codex:gpt-5.4 \
-j claude:claude-opus-4-6 \
--gpus-per-gladiator 1
# Brief mode (code analysis only, no GPU execution)
colosseum qa -t "Quick smoke" --target /path/to/target -g claude:claude-haiku-4-5-20251001 --briefInside tmux, watcher panes auto-spawn — one per gladiator showing live progress. The synthesized canonical report lands at .colosseum/qa/<run_id>/synthesized_report.md.
colosseum setup [providers...] Install & authenticate CLI providers (also installs wizard skills)
colosseum install-skills [--force] Install bundled wizard skills under ~/.claude/skills/
colosseum serve Start the web UI server
colosseum debate Run a debate from the terminal
colosseum review Run a multi-phase code review
colosseum qa Run a QA ensemble against a target project
colosseum monitor [run_id] Open live tmux monitor for an active debate
colosseum models List available models across all providers
colosseum personas List available personas
colosseum history List past battles
colosseum show <run_id> Show a past battle result
colosseum delete <run_id|all> Delete battle run(s)
colosseum check Verify CLI tool availability
colosseum local-runtime status Inspect managed local-model runtime state
| Flag | Description |
|---|---|
-t, --topic |
Debate topic (required) |
-g |
Gladiators in provider:model format (min 2) |
-j, --judge |
Judge model (provider:model or human) |
-d, --depth |
Debate depth 1-5 (default: 3) |
--dir |
Project directory for context |
-f |
Specific files for context |
--mock |
Use mock providers (free, for testing) |
--monitor |
Launch tmux monitor panel |
--timeout |
Per-phase timeout in seconds |
| Flag | Description |
|---|---|
-t, --topic |
Review target description (required) |
-g |
Reviewer agents in provider:model format (min 2) |
--phases |
Review phases to run (default: A B C D E) |
-j, --judge |
Judge model |
-d, --depth |
Per-phase debate depth (default: 2) |
--dir |
Project directory to review |
-f |
Specific files to review |
--diff |
Include recent git diff as context |
--lang |
Response language (ko, en, ja, etc.) |
--rules |
Path to project rules file |
--timeout |
Per-phase timeout in seconds |
| Flag | Description |
|---|---|
-t, --topic |
One-line QA run description (required) |
--target |
Path to target project (must contain .claude/skills/qa/SKILL.md) (required) |
--qa-args |
Args forwarded to the target's /qa skill |
-g |
Gladiator specs in provider:model format. Claude → real claude --print subprocess; non-Claude → mediated executor |
-j, --judge |
Judge model used to merge gladiator findings |
--gpus |
Comma-separated GPU indices to force (default: auto-detect) |
--gpus-per-gladiator |
GPUs per gladiator slice (default: even split) |
--sequential |
Run gladiators one at a time instead of parallel disjoint slices |
--max-budget-usd |
Hard per-gladiator spend cap (default: $25) |
--max-gladiator-minutes |
Soft timeout per gladiator (default: 90) |
--stall-timeout-minutes |
Stall detection threshold (default: 10) |
--brief |
Code analysis only, no GPU execution |
--monitor / --no-monitor |
Auto-spawn tmux watcher panes (default: on inside tmux) |
--spec |
Forward --spec NAME to the /qa skill |
--lang |
Response language |
--allow-dirty-target |
Skip the dirty-worktree warning |
--no-stash-safety |
Skip the git stash safety net |
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 📋 Task │───▶│ 🧊 Freeze │───▶│ 📝 Plan │───▶│ ⭐ Score │
│ Intake │ │ Context │ │ Generation │ │ Plans │
└─────────────┘ └─────────────┘ └─────────────┘ └──────┬──────┘
│
┌──────────────────────────────────────────────────────┘
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 🎯 Judge │───▶│ 💬 Debate │───▶│ ⚖️ Adopt │───▶│ 🏆 Verdict │
│ Agenda │ │ Round │ │ Arguments │ │ & Report │
└──────┬──────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │
└──────── 🔄 Next issue ◀──────────────┘
The orchestrator uses bounded debate rather than open-ended chat. The judge can stop early if plans are already well separated, if novelty collapses, or if budget pressure is too high.
Each round is agenda-driven, not open-ended:
| Step | Description |
|---|---|
| 1 | Judge selects one concrete issue |
| 2 | Every agent answers from its own plan |
| 3 | Agents must rebut or accept specific peer arguments |
| 4 | Judge adopts the strongest evidence-backed arguments |
| 5 | Judge either advances to the next issue or finalizes |
critique → rebuttal → synthesis → final_comparison → targeted_revision
Each round records: the judge's agenda, all agent messages, adopted arguments, and what remained unresolved.
| Depth | Name | Novelty Threshold | Convergence | Notes |
|---|---|---|---|---|
| 1 | Quick | 5% | 40% | Eager finalization |
| 2 | Brief | 10% | 55% | |
| 3 | Standard | 18% | 75% | Default |
| 4 | Thorough | 25% | 85% | Min 2 rounds |
| 5 | Deep Dive | 30% | 92% | Min 2 rounds, hard stop |
| Mode | Description |
|---|---|
| 🤖 Automated | Heuristic judge with budget, novelty, convergence, and evidence checks |
| 🧠 AI | Provider-backed judge — choose any available model as the judge |
| 👤 Human | Pause after planning or after rounds; wait for explicit human action |
The final verdict can be: one winning plan, a merged plan, or a targeted revision request.
| Phase | Name | Focus |
|---|---|---|
| A | Project Rules | Coding conventions, naming, linter/formatter rules |
| B | Implementation | Functional correctness, edge cases, error handling |
| C | Architecture | Design patterns, module separation, dependencies, extensibility |
| D | Security/Performance | Vulnerabilities, memory leaks, performance bottlenecks, concurrency |
| E | Test Coverage | Unit tests, integration tests, test structure |
| F | Red Team | Adversarial inputs, auth bypass, information leakage, privilege escalation (opt-in) |
Each phase runs a mini-debate among the reviewer agents. Results are aggregated into a comprehensive review report (Markdown export available).
| Source Kind | Description |
|---|---|
inline_text |
Raw text passed directly |
local_file |
Single file from disk |
local_directory |
Entire directory snapshot |
external_reference |
URL frozen as metadata |
inline_image |
Base64-encoded image data |
local_image |
Image file from disk |
Large text bundles are clipped to a prompt budget (28,000 chars max). Image bytes are preserved in the frozen bundle but not dumped into text prompts.
| Provider | Type | Notes |
|---|---|---|
| Claude | CLI wrapper | Requires claude CLI. Models: opus-4-6, sonnet-4-6, haiku-4-5 |
| Codex | CLI wrapper | Requires codex CLI. Models: gpt-5.4, o3, o4-mini |
| Gemini | CLI wrapper | Requires gemini CLI. Models: 2.5-pro, 3.1-pro, 3-flash |
| Ollama | Local | Requires ollama daemon. Auto-discovers installed models |
| Mock | Built-in | Deterministic outputs for tests |
| Custom | CLI command | Bring your own model/command |
Custom models can be marked as free or paid, tied into the persona flow, and participate in the same debate process as builtin agents.
Colosseum manages a local Ollama runtime with:
- GPU device detection (NVIDIA, AMD, CPU)
- Per-GPU model fit checking via
llmfit - Auto-start/stop daemon management
- Model download orchestration
colosseum local-runtime status| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Health check |
GET |
/setup/status |
Provider install/auth status |
GET |
/models |
List available models |
POST |
/models/refresh |
Force model re-probe |
GET |
/cli-versions |
CLI version info |
POST |
/setup/auth/{tool_name} |
Launch provider login |
POST |
/setup/install/{tool_name} |
Install a provider tool |
| Method | Endpoint | Description |
|---|---|---|
GET |
/local-runtime/status |
Ollama/llmfit status (?ensure_ready=false) |
POST |
/local-runtime/config |
Update local runtime settings |
POST |
/local-models/download |
Download a local model |
GET |
/local-models/fit-check |
llmfit hardware fit check (?model=...) |
| Method | Endpoint | Description |
|---|---|---|
POST |
/runs |
Create a run (blocking) |
POST |
/runs/stream |
Create a run (streaming SSE) |
GET |
/runs |
List all runs |
GET |
/runs/{run_id} |
Get run details |
POST |
/runs/{run_id}/skip-round |
Skip current debate round |
POST |
/runs/{run_id}/cancel |
Cancel active debate |
GET |
/runs/{run_id}/pdf |
Download PDF report |
GET |
/runs/{run_id}/markdown |
Download Markdown report |
POST |
/runs/{run_id}/judge-actions |
Submit human judge action |
| Method | Endpoint | Description |
|---|---|---|
GET |
/personas |
List all personas |
POST |
/personas/generate |
Generate from survey |
GET |
/personas/{id} |
Get persona details |
POST |
/personas |
Create custom persona |
DELETE |
/personas/{id} |
Delete a persona |
| Method | Endpoint | Description |
|---|---|---|
GET |
/provider-quotas |
Get quota status |
PUT |
/provider-quotas |
Update quotas |
| Route | Description |
|---|---|
GET / |
Arena / run setup screen |
GET /reports/{run_id} |
Battle report screen |
src/colosseum/
├── main.py # FastAPI app factory and server entry
├── cli.py # Terminal interface and live debate UX
├── monitor.py # tmux-based live monitoring
├── bootstrap.py # Dependency injection and app init
│
├── api/ # FastAPI routes
│ ├── routes.py # Router composition
│ ├── routes_runs.py # Run CRUD, streaming, judge actions
│ ├── routes_setup.py # Setup, discovery, local runtime
│ ├── routes_personas.py # Persona CRUD and generation
│ ├── routes_quotas.py # Provider quota management
│ ├── sse.py # SSE payload serialization
│ ├── validation.py # Shared request validation
│ └── signals.py # Lifecycle signal registry
│
├── core/ # Domain types and configuration
│ ├── models.py # Typed runtime schemas and requests
│ └── config.py # Enums, defaults, depth profiles, review phases
│
├── providers/ # Provider abstraction layer
│ ├── base.py # Abstract provider interface
│ ├── factory.py # Provider instantiation and pricing
│ ├── command.py # Generic CLI command provider
│ ├── cli_wrapper.py # CLI envelope parser and adapter
│ ├── cli_adapters.py # Claude, Codex, Gemini CLI adapters
│ ├── mock.py # Deterministic mock provider
│ └── presets.py # Model presets
│
├── services/ # Core business logic
│ ├── orchestrator.py # Run lifecycle composition
│ ├── debate.py # Round execution and prompt assembly
│ ├── judge.py # Plan scoring, agenda, adjudication, verdicts
│ ├── report_synthesizer.py # Final report generation
│ ├── review_orchestrator.py # Multi-phase code review workflow
│ ├── review_prompts.py # Review phase prompt templates
│ ├── context_bundle.py # Frozen context construction
│ ├── context_media.py # Image extraction and summarization
│ ├── provider_runtime.py # Provider execution and quota
│ ├── local_runtime.py # Managed Ollama/llmfit runtime
│ ├── repository.py # File-backed run persistence
│ ├── budget.py # Budget ledger tracking
│ ├── event_bus.py # Event publishing
│ ├── normalizers.py # Data normalization utilities
│ ├── prompt_contracts.py # Prompt asset contracts
│ ├── pdf_report.py # PDF export
│ └── markdown_report.py # Markdown report export
│
├── personas/ # Persona system
│ ├── registry.py # Typed persona metadata and parsing
│ ├── loader.py # Load, cache, resolve personas
│ ├── generator.py # Generate personas from surveys
│ ├── prompting.py # Persona prompt rendering
│ ├── builtin/ # 20 built-in personas
│ └── custom/ # User-created personas
│
└── web/ # Static web UI assets
├── index.html # Arena setup UI
├── report.html # Battle report display
├── app.js # Main UI logic
├── report.js # Report rendering
└── styles.css # Styling
docs/
├── colosseum_spec.md # Specification index
├── architecture/
│ ├── overview.md # Layered architectural model
│ └── design-philosophy.md # Core design principles
└── specs/
├── runtime-protocol.md # Run lifecycle and streaming contract
├── agent-governance.md # Agent, persona, provider boundaries
└── persona-authoring.md # Persona file formats and validation
examples/
└── demo_run.json # Mock-provider smoke test payload
tests/ # Test suite
# Run the full test suite
PYTHONPATH=src PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -q
# Quick syntax validation
python -m compileall src tests- URL sources are metadata-only unless fetched upstream before run creation
- Paid quota tracking is local/manual, not provider-synchronized
- Builtin vendor CLI wrappers are thinner than full SDK integrations
- Image-aware debates are best supported through custom command providers
- Artifact persistence is file-based, not database-backed
- Token counting falls back to
len//4estimation when real counts are unavailable
⚔️ Let the models fight. Let the evidence win. ⚔️
Built for people who want structured answers, not chat noise.