Route every AI call to the cheapest model that can handle it. 45 tools · 20+ providers · Claude Code, VS Code, Cursor, Codex, and more.
Average savings: 60–80% vs running everything on Claude Opus.
# One command to start saving
uvx claude-code-llm-router install
# Or: guided 5-minute setup
uvx claude-code-llm-router quickstart| Host | One-line install |
|---|---|
llm-router install |
|
llm-router install --host vscode |
|
llm-router install --host cursor |
|
llm-router install --host codex |
LLM Router is an MCP server and hook set that intercepts prompts and routes them to the cheapest model that can handle the task.
It is built for a common failure mode in AI coding tools: using your best model for everything. In Claude Code, that burns quota on simple explanations, file lookups, small edits, and repetitive prompts. In other MCP clients, it means paying premium-model prices for work that never needed them.
The goal is simple: keep cheap work on cheap or free models, keep hard work on Claude or other premium models, and remove the need to micromanage model selection. Works in Claude Code, Cursor, VS Code, Codex, Windsurf, Zed, claw-code, and Agno.
Most sessions contain a lot of low-value turns: quick questions, repo lookups, boilerplate edits, and small follow-ups. Those are exactly the prompts that quietly burn through premium models.
LLM Router offloads that work first, then escalates when the task actually needs more capability.
- Cheap work stays cheap.
- Hard work still gets the best model.
- Your workflow stays the same.
It does not try to replace Claude or force weak models onto hard tasks. It removes the waste around them.
pipx install claude-code-llm-router && llm-router installllm-router install registers the MCP server and installs hooks so prompt routing starts automatically.
If you use Claude Code Pro/Max, you can start with zero API keys. Otherwise add GEMINI_API_KEY for a cheap free-tier fallback.
GEMINI_API_KEY=AIza... # optional free-tier fallback
LLM_ROUTER_CLAUDE_SUBSCRIPTION=true- Intercept the prompt before your default premium model sees it.
- Classify the task and its complexity.
- Try the cheapest capable route first.
- Escalate or fall back when the task needs more capability.
Under the hood, every prompt goes through a UserPromptSubmit hook before your top-tier model sees it:
0. Context inherit instant, free "yes/ok/go ahead" reuse prior turn's route
1. Heuristic scoring instant, free high-confidence patterns route immediately
2. Ollama local LLM free, ~1s catches what heuristics miss
3. Cheap API ~$0.0001 Gemini Flash / GPT-4o-mini fallback
| Prompt | Classified as | Routed to |
|---|---|---|
| "What does os.path.join do?" | query/simple | Gemini Flash ($0.000001) |
| "Fix the bug in auth.py" | code/moderate | Haiku / Sonnet |
| "Design the full auth system" | code/complex | Sonnet / Opus |
| "Research latest AI funding" | research | Perplexity Sonar Pro |
| "Generate a hero image" | image | Flux Pro via fal.ai |
Free-first chain (subscription mode): Ollama → Codex (free via OpenAI sub) → paid API
45 tools across 6 categories:
| Tool | What it does |
|---|---|
llm_route |
Auto-classify prompt → route to best model |
llm_auto |
Route + server-side savings tracking — designed for hook-less hosts (Codex CLI, Claude Desktop, Copilot) |
llm_classify |
Classify complexity + recommend model |
llm_select_agent |
Pick agent CLI (claude_code / codex) + model for a session |
llm_stream |
Stream LLM response for long-running tasks |
llm_reroute |
Correct a bad routing decision in-session and train the router |
| Tool | What it does |
|---|---|
llm_query |
General questions — routed to cheapest capable model |
llm_research |
Web-grounded answers via Perplexity Sonar |
llm_generate |
Creative writing, summaries, brainstorming |
llm_analyze |
Deep reasoning — analysis, debugging, design review |
llm_code |
Code generation, refactoring, algorithms |
llm_edit |
Route edit reasoning to cheap model → returns {file, old, new} patch pairs |
| Tool | What it does |
|---|---|
llm_fs_find |
Describe files to find → cheap model returns glob/grep commands |
llm_fs_rename |
Describe a rename → returns mv/git mv commands (dry_run by default) |
llm_fs_edit_many |
Bulk edits across files → returns all patch pairs |
llm_fs_analyze_context |
Summarise workspace context for smarter routing |
| Tool | What it does |
|---|---|
llm_image |
Image generation — Flux, DALL-E, Gemini Imagen |
llm_video |
Video generation — Runway, Kling, Veo 2 |
llm_audio |
TTS/voice — ElevenLabs, OpenAI |
| Tool | What it does |
|---|---|
llm_orchestrate |
Multi-step pipeline across multiple models |
llm_pipeline_templates |
List available pipeline templates |
| Tool | What it does |
|---|---|
llm_usage |
Unified dashboard — Claude sub, Codex, APIs, savings |
llm_savings |
Cross-session savings breakdown by period, host, and task type |
llm_check_usage |
Live Claude subscription usage (session %, weekly %) |
llm_health |
Provider availability + circuit breaker status |
llm_providers |
List all configured providers and models |
llm_set_profile |
Switch profile: budget / balanced / premium |
llm_setup |
Interactive provider wizard — add keys, validate, install hooks |
llm_quality_report |
Routing accuracy, savings metrics, classifier stats |
llm_rate |
Rate last response 👍/👎 — logged for quality tracking |
llm_codex |
Route task to local Codex desktop agent (free) |
llm_save_session |
Persist session summary for cross-session context |
llm_cache_stats |
Cache hit rate, entries, evictions |
llm_cache_clear |
Clear classification cache |
llm_refresh_claude_usage |
Force-refresh subscription data via OAuth |
llm_update_usage |
Feed usage data from claude.ai into the router |
llm_track_usage |
Report Claude Code token usage for budget tracking |
llm_dashboard |
Open web dashboard at localhost:7337 |
llm_team_report |
Team-wide routing savings report |
llm_team_push |
Push local savings data to shared team store |
llm_policy |
Show active org/repo routing policy + last 10 policy decisions |
llm_digest |
Savings digest with spend-spike detection; push to Slack/Discord webhook |
llm_benchmark |
Per-task-type routing accuracy from llm_rate feedback |
llm_session_spend |
Real-time API spend breakdown for the current session |
llm_approve_route |
Approve or reject a pending high-cost routing call |
Three profiles — switch anytime with llm_set_profile:
| Profile | Use case | Chain |
|---|---|---|
budget |
Dev, drafts, exploration | Ollama → Haiku → Gemini Flash |
balanced |
Production work (default) | Codex → Sonnet → GPT-4o |
premium |
Critical tasks, max quality | Codex → Opus → o3 |
Profile is overridden by complexity: simple prompts always use the budget chain, complex ones escalate to premium, regardless of the active profile setting.
| Provider | Models | Free tier | Best for |
|---|---|---|---|
| Ollama | Any local model | Yes (forever) | Privacy, zero cost, offline |
| Google Gemini | 2.5 Flash, 2.5 Pro | Yes (1M tokens/day) | Generation, long context |
| Groq | Llama 3.3, Mixtral | Yes | Ultra-fast inference |
| OpenAI | GPT-4o, o3, DALL-E | No | Code, reasoning, images |
| Perplexity | Sonar, Sonar Pro | No | Research, current events |
| Anthropic | Haiku, Sonnet, Opus | No | Writing, analysis, safety |
| DeepSeek | V3, Reasoner | Limited | Cost-effective reasoning |
| Mistral | Large, Small | Limited | Multilingual |
| fal.ai | Flux, Kling, Veo | No | Images, video, audio |
| ElevenLabs | Voice models | Limited | High-quality TTS |
| Runway | Gen-3 | No | Professional video |
Full setup guides: docs/PROVIDERS.md
Auto-installed by llm-router install. Hooks intercept every prompt — you never need to call tools manually unless you want explicit control.
pipx install claude-code-llm-router && llm-router installLive status bar shows routing stats before every prompt and in the persistent bottom statusline:
📊 CC 13%s · 24%w │ sub:0 · free:305 · paid:27 │ $1.59 saved (35%)
Add to ~/.claw-code/mcp.json:
{
"mcpServers": {
"llm-router": { "command": "llm-router", "args": [] }
}
}Every API call in claw-code is paid — the free-first chain (Ollama → Codex → Gemini Flash) saves more here than in Claude Code.
Add to your IDE's MCP config:
{
"mcpServers": {
"llm-router": { "command": "llm-router", "args": [] }
}
}Two integration modes:
Option 1 — RouteredModel (v2.0+): use llm-router as a first-class Agno model. Every agent call is automatically routed to the cheapest capable provider.
pip install "claude-code-llm-router[agno]"from agno.agent import Agent
from llm_router.integrations.agno import RouteredModel, RouteredTeam
# Single agent — routes each call intelligently
coder = Agent(
model=RouteredModel(task_type="code", profile="balanced"),
instructions="You are a coding assistant.",
)
coder.print_response("Write a Python quicksort.")
# Multi-agent team with shared $20/month budget cap
# Automatically downshifts to 'budget' profile at 80% spend
team = RouteredTeam(
members=[coder, researcher],
monthly_budget_usd=20.0,
downshift_at=0.80,
)Option 2 — MCP tools: use llm-router's 45 tools in any Agno agent:
from agno.agent import Agent
from agno.models.anthropic import Claude
from agno.tools.mcp import MCPTools
agent = Agent(
model=Claude(id="claude-sonnet-4-6"),
tools=[MCPTools(command="llm-router")],
instructions="Use llm_research for web searches, llm_code for coding tasks.",
)All installs are idempotent — run any command twice safely.
llm-router install --host codexWrites ~/.codex/config.yaml, ~/.codex/hooks.json (PostToolUse), and ~/.codex/instructions.md.
codex plugin install llm-router # or via Codex marketplacellm-router install --host opencodeWrites ~/.config/opencode/config.json (MCP block), PostToolUse hook, and routing rules.
llm-router install --host gemini-cliWrites ~/.gemini/settings.json, creates the llm-router extension with gemini-extension.json + hooks.json, and appends routing rules.
llm-router install --host copilot-cliWrites ~/.config/gh/copilot/mcp.json and routing rules.
llm-router install --host openclawWrites ~/.openclaw/mcp.json and routing rules.
llm-router install --host traeWrites the platform-appropriate Trae config (~/Library/Application Support/Trae/mcp.json on macOS) and a .rules file in the current directory.
Factory Droid natively supports Claude Code plugin format (.claude-plugin/) — no extra setup needed:
factory plugin install ypollak2/llm-router
# or via Factory marketplace search: llm-routerThe dedicated .factory-plugin/ manifest is included for Factory marketplace discovery.
llm-router install --host desktopPrints the snippet for claude_desktop_config.json. No hooks in Desktop — use llm_auto for savings tracking.
llm-router install --host copilotPrints the snippet for .vscode/mcp.json and a copilot-instructions.md template.
llm-router install --host all # installs/prints all hostsRUN pip install claude-code-llm-router && llm-router install --headless
# Pass keys at runtime: docker run -e GEMINI_API_KEY=... your-image# API keys — at least one required
GEMINI_API_KEY=AIza... # free tier at aistudio.google.com
OPENAI_API_KEY=sk-proj-...
PERPLEXITY_API_KEY=pplx-...
ANTHROPIC_API_KEY=sk-ant-... # skip if using Claude Code subscription
DEEPSEEK_API_KEY=...
GROQ_API_KEY=gsk_...
FAL_KEY=... # images, video, audio via fal.ai
ELEVENLABS_API_KEY=...
# Router
LLM_ROUTER_PROFILE=balanced # budget | balanced | premium
LLM_ROUTER_MONTHLY_BUDGET=0 # USD, 0 = unlimited
LLM_ROUTER_CLAUDE_SUBSCRIPTION=false # true = Claude Code Pro/Max
LLM_ROUTER_ENFORCE=enforce # shadow | suggest | enforce (default: enforce)
# Ollama (local models)
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_BUDGET_MODELS=gemma4:latest,qwen3.5:latest
# Spend limits
LLM_ROUTER_DAILY_SPEND_LIMIT=5.00 # USD, 0 = disabledChoose how strict routing should be. The easiest way is llm-router onboard, which lets you pick a mode interactively.
| Mode | Behaviour | Best for |
|---|---|---|
shadow |
Observe routing decisions, never blocks | Safest first install |
suggest |
Show route hints, allow direct answers | Low-friction adoption |
enforce |
Block routed violations until the route is followed | Maximum savings |
LLM_ROUTER_ENFORCE=hard is the strict compatibility alias for enforce. Legacy soft and off values are still supported for direct CLI or env-based control.
Commit a routing policy alongside your code — no env vars required:
profile: balanced
enforce: suggest # shadow | suggest | enforce
block_providers:
- openai # never use OpenAI in this repo
routing:
code:
model: ollama/qwen3.5:latest # always use local model for code tasks
research:
provider: perplexity # always use Perplexity for research
daily_caps:
_total: 2.00 # global $2/day cap
code: 0.50 # code tasks capped at $0.50/dayUser-level overrides live in ~/.llm-router/routing.yaml (same schema). Repo config wins.
Full reference: .env.example
LLM_ROUTER_MONTHLY_BUDGET=50 # raises BudgetExceededError when exceededllm_usage("month")
→ Calls: 142 | Tokens: 320k | Cost: $3.42 | Budget: 6.8% of $50
The router tracks spend in SQLite across all providers and blocks calls when the monthly cap is reached.
llm-router dashboard # opens localhost:7337Live view of routing decisions, cost trends, model distribution, and subscription pressure. Auto-refreshes every 30s.
At session end the router prints a breakdown:
Free models 305 calls · $0.52 saved (Ollama / Codex)
External 27 calls · $0.006 (Gemini Flash, GPT-4o)
💡 Saved ~$0.53 this session
Share your savings:
llm-router share # copies savings card to clipboard + opens tweetPositioning: Route every AI coding task to the cheapest capable model. Works across Claude Code, Cursor, VS Code, Codex, Gemini CLI, and more.
| Version | Headline |
|---|---|
| v1.3–v2.0 | Foundation, dashboard, enforcement, Agno adapter |
| v2.1 | Route Simulator — llm-router test "<prompt>" dry-run + llm_savings dashboard |
| v2.2 | Explainable Routing — LLM_ROUTER_EXPLAIN=1, "why not Opus?", per-decision reasoning |
| v2.3 | Zero-Friction Activation — onboarding wizard, shadow/suggest/enforce modes, yearly savings projection |
| Version | Headline |
|---|---|
| v2.4 | Repo-Aware YAML Config — .llm-router.yml committed with the codebase, block_providers, model pins |
| v2.5 | Context-Aware Routing — "yes/ok/go ahead" inherits prior turn's route, zero classifier latency |
| v2.6 | Latency + Personalized Routing — p95 latency scoring, per-user acceptance signals |
| Version | Headline |
|---|---|
| v3.0 | Team Dashboard — shared savings across the whole team |
| v3.1 | Multi-Host + Cross-Session Savings — llm_auto, Codex/Desktop/Copilot adapters, persistent savings across sessions |
| v3.2 | Policy Engine — org/project/user routing policy, spend caps, audit log |
| v3.3 | Slack Digests + Codex Plugin — weekly savings digest, spend-spike alerts, Codex marketplace plugin |
| Version | Headline |
|---|---|
| v3.4 | Agent-Context Routing — subscription-first chain reordering when Codex or Claude Code is active |
| v3.5 | Multi-Agent CLI Compatibility — OpenCode, Gemini CLI, Copilot CLI, OpenClaw, Factory Droid, Trae |
| v3.6 | VS Code + Cursor IDE Support — native MCP config, routing rules, idempotent install |
| v4.0 | Token Efficiency + Real-Time Spend — tool slim mode, session spend meter, reroute learning, quickstart wizard |
| v4.1 | Playwright DOM Compression + Enforcement Fixes — DOM compression hook, PostToolUse MCP matcher fix, smart enforcement default |
| v4.2 | Quota-Aware Routing + Context-Aware Classification — Ollama-first CC-mode for simple tasks, qwen3.5:32b in BALANCED chains, short code follow-up context inheritance |
| Version | Headline | Status |
|---|---|---|
| v4.3 | OTEL / Prometheus Export — metrics endpoint for routing decisions, cost, and fallback rates | 📅 Planned |
| v5.0 | Learned Routing — self-training classifier from llm_rate feedback; personal routing patterns |
📅 Planned |
uv sync --extra dev
uv run pytest tests/ -q --ignore=tests/test_integration.py
uv run ruff check src/ tests/See CLAUDE.md for architecture and module layout.
See CONTRIBUTING.md. Key areas: new provider integrations, routing intelligence, MCP client testing.