Semantic diff for LLM prompts — compare prompt versions like git diff, but for behavior.
You changed your system prompt. Did it make things better or worse? PromptDiff runs both versions against your test cases, compares the outputs semantically, and tells you exactly what changed.
Prompt engineering is iterative. You tweak a word, add an instruction, restructure the format — but how do you know if it actually helped? Manual A/B testing is slow and error-prone. PromptDiff automates the comparison:
- Run both prompt versions against the same test inputs through any OpenAI-compatible API
- Semantic comparison using sentence embeddings (or lexical fallback) to detect behavioral changes
- LLM-as-judge (optional) to classify changes as improvements or regressions
- CI-friendly — exit code 1 on regressions, JSON output for automation
- Error-aware gating - fail CI when either prompt version errors before trusting the diff
- Rich terminal reports with color-coded diffs, similarity scores, latency/token deltas
pip install promptdiff
# with semantic similarity (recommended)
pip install "promptdiff[semantic]"Create two prompt files and a test cases file:
# prompt_v1.txt
You are a helpful coding assistant. Answer clearly and concisely.
# prompt_v2.txt
You are a senior engineer. Answer step by step. Always include code examples.
# test_cases.jsonl
{"input": "How do I reverse a string in Python?"}
{"input": "What's the difference between a list and a tuple?"}
{"input": "Explain closures."}Run the comparison:
promptdiff compare prompt_v1.txt prompt_v2.txt test_cases.jsonlOutput:
┌─────────────────── PromptDiff Summary ───────────────────┐
│ 3 cases: 1 unchanged, 2 regressed │
│ avg similarity: 72.31% | avg latency delta: +45ms | │
│ avg token delta: +38 │
└──────────────────────────────────────────────────────────-┘
# │ │ Input │ Similarity │ Latency │ Tokens
2 │ - │ What's the difference between a li... │ 65.2% │ +120ms │ +52
3 │ - │ Explain closures. │ 71.8% │ +30ms │ +41
promptdiff compare prompt_a.txt prompt_b.txt tests.jsonlpromptdiff validate prompt_a.txt tests.jsonl --min-cases 5This checks that the prompt is non-empty and that JSON/JSONL/YAML test cases have valid input fields before a CI job spends money on model calls.
When outputs differ, use an LLM judge to decide if the change is an improvement or regression:
promptdiff compare prompt_a.txt prompt_b.txt tests.jsonl --judgeWorks with any OpenAI-compatible API (Ollama, vLLM, LiteLLM, Together, etc.):
promptdiff compare prompt_a.txt prompt_b.txt tests.jsonl \
--model llama-3.1-8b \
--base-url http://localhost:11434/v1Fail the build if any regressions are detected:
promptdiff compare prompt_a.txt prompt_b.txt tests.jsonl \
--fail-on-regression --fail-on-error --json-output results.jsonSet practical budgets when a small number of changes is acceptable but cost or latency drift is not:
promptdiff compare prompt_a.txt prompt_b.txt tests.jsonl \
--max-regression-rate 0.05 \
--min-avg-similarity 0.90 \
--max-error-rate 0.01 \
--max-avg-latency-increase 150 \
--max-avg-token-increase 20 \
--json-output results.jsonThe command exits with code 1 when any configured budget is exceeded, and writes the gate result into JSON output.
Turn a saved results file into a Markdown summary you can paste into a PR comment or attach as a CI artifact. It is offline and never calls the model:
promptdiff compare prompt_a.txt prompt_b.txt tests.jsonl -o results.json
promptdiff report results.json -o report.mdWithout --output the report goes to stdout, which is convenient for piping into a gh pr comment step. The report has a summary table, the regression-budget verdict, and the worst cases ordered by severity. Use --top to cap how many cases are listed.
Need JUnit XML for a test-report dashboard instead? Pass --format junit to regenerate it from the same saved results — no model calls, so you don't pay to compare twice:
promptdiff report results.json --format junit -o junit.xmlreport --check re-applies the regression budgets recorded at compare time and exits non-zero if they failed. This lets one CI job run the expensive compare and upload results.json, while a later cheap job posts the comment and gates the build offline:
promptdiff report results.json -o report.md --check # exits 1 if a budget failedEach regression is also graded by severity so you can tell a near-miss from a rewrite at a glance. The grade is based on how far the output similarity fell below the threshold the run used: minor (just under), moderate, or major; errored cases are always major. The report shows the per-case grade plus a one-line breakdown like Severity: 1 major, 2 moderate. Because the threshold is recorded in the results JSON, report reproduces the same grades offline.
Lower threshold = more permissive (fewer false regressions):
promptdiff compare prompt_a.txt prompt_b.txt tests.jsonl --threshold 0.7Terminal reports sort by severity by default: prompt run errors first, then the lowest-similarity regressions, then improvements and unchanged cases. If you want to preserve the original test-case order:
promptdiff compare prompt_a.txt prompt_b.txt tests.jsonl --sort inputOptions:
-m, --model TEXT Model for running prompts (default: gpt-4o-mini)
--base-url TEXT Custom API base URL
--api-key TEXT API key (default: OPENAI_API_KEY env)
-t, --threshold FLOAT Similarity threshold for 'unchanged' (default: 0.85)
--judge / --no-judge Use LLM-as-judge for changed cases
--judge-model TEXT Judge model (default: gpt-4o-mini)
-v, --verbose Show detailed output for changed cases
--show-unchanged Include unchanged cases in report
-o, --json-output PATH Write JSON results to file
-c, --concurrency INT Max concurrent API calls (default: 5)
--no-semantic Use lexical similarity instead of embeddings
--fail-on-regression Exit code 1 if regressions found
--fail-on-error Exit code 1 if any prompt run errors
--max-regression-rate FLOAT
--min-avg-similarity FLOAT
--max-error-rate FLOAT
--max-avg-latency-increase FLOAT
--max-avg-token-increase FLOAT
PromptDiff supports multiple formats for test inputs:
| Format | Example |
|---|---|
.jsonl |
{"input": "your question"} per line |
.json |
["q1", "q2"] or [{"input": "q1"}] |
.yaml |
List of strings or objects with input key |
.txt |
One test case per line |
import asyncio
from promptdiff import PromptRunner, PromptDiff, DiffReport
from promptdiff.runner import RunConfig
config = RunConfig(model="gpt-4o-mini")
runner = PromptRunner(config)
prompt_a = "You are helpful."
prompt_b = "You are a senior engineer. Be detailed."
inputs = ["How do I sort a list in Python?", "What is a mutex?"]
results_a = asyncio.run(runner.run_batch(prompt_a, inputs))
results_b = asyncio.run(runner.run_batch(prompt_b, inputs))
differ = PromptDiff(threshold=0.85)
diffs, summary = differ.compare_batch(results_a, results_b)
report = DiffReport()
report.print_full(diffs, summary, verbose=True)- Run: Both prompts are sent to the LLM with each test input (concurrently, with rate limiting)
- Compare: Outputs are compared using semantic similarity (sentence-transformers) or lexical similarity (Jaccard)
- Classify: Cases below the similarity threshold are marked as "changed". Optionally, an LLM judge decides if the change is an improvement or regression
- Report: Results are displayed with color-coded terminal output and optional JSON export
JSON output includes judge verdicts and per-side run errors, so CI jobs can fail loudly instead of hiding API failures behind an empty diff.
git clone https://github.com/he-yufeng/PromptDiff.git
cd PromptDiff
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,semantic]"
pytestMIT