-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Description
Motivation.
Executive Summary
Proposes regression testing for speculative decoding in vLLM, starting with Eagle3 and expanding to all variants.
Objectives:
- Prevent regressions in acceptance rates and speedup
- Detect performance degradation in CI
- Validate correctness across variants
- Manage CI costs
Impact:
- Detect regressions in hours vs. days
- Provide performance baselines
Table of Contents
- Background and Motivation
- Current State Analysis
- Proposal Overview
- Design Details
- Phased Implementation
- CI Resource Management
- Benchmarking Strategy
- Future Expansion
- Open Questions
- References
- Call to Action
Background and Motivation
Speculative decoding improves LLM latency 2-3x by using a draft model to predict tokens ahead, verified in parallel. vLLM supports n-gram, EAGLE/Eagle3, Medusa, and MTP variants.
Problem: Small code changes can break acceptance rates and speedup without failing correctness tests.
Current Gaps:
- No acceptance rate tracking in
tests/v1/e2e/test_spec_decode.py(only 66-80% match thresholds) - Nightly benchmarks lack baselines and alerts
- Incomplete coverage for MTP variants
- No documented performance expectations per variant
Current State Analysis
Existing Test Infrastructure
1. Correctness Tests
Location: tests/v1/e2e/test_spec_decode.py
Current Coverage:
- N-gram: Basic correctness test with 100 mixed prompts (66% match threshold)
- EAGLE: Multiple model variants tested (Llama-3.1-8B, Llama-4-Scout-17B-16E, DeepSeek)
- Eagle3: Qwen3-8B, Llama-3.1-8B variants
- MTP: MiMo-7B, DeepSeek-V3 (80% match threshold)
Test Pattern:
# Reference implementation
ref_llm = LLM(model=model_name, max_model_len=2048)
ref_outputs = ref_llm.chat(test_prompts, sampling_config)
# Speculative implementation
spec_llm = LLM(
model=model_name,
speculative_config={
"method": method,
"model": spec_model_name,
"num_speculative_tokens": 3,
},
max_model_len=2048,
)
spec_outputs = spec_llm.chat(test_prompts, sampling_config)
# Heuristic validation
matches = sum(ref.outputs[0].text == spec.outputs[0].text
for ref, spec in zip(ref_outputs, spec_outputs))
assert matches >= int(0.66 * len(ref_outputs)) # 66% thresholdLimitations:
- Match rate threshold is arbitrary and doesn't track temporal trends
- No measurement of acceptance rate or actual speedup
- Tests run with limited prompt diversity
- No multimodal coverage for most variants
- Thresholds not scientifically derived
2. Buildkite CI Integration
Location: .buildkite/test-pipeline.yaml:307
Current Configuration:
- label: V1 Speculative Decoding Test (Integration)
commands:
- pytest -v -s v1/spec_decodeExample Offline Tests (line 362-364):
python3 offline_inference/spec_decode.py --test \
--method eagle --num_spec_tokens 3 \
--dataset-name hf --dataset-path philschmid/mt-bench \
--num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 \
--tp 1 --enable-chunked-prefill --max-model-len 2048Limitations:
- Not part of
fast_checkCI (only runs on full CI) - No performance metrics collected
- No trend analysis or regression detection
- Tests don't fail on performance degradation
3. Benchmarking Infrastructure
Location: benchmarks/benchmark_ngram_proposer.py
Capabilities:
- Microbenchmark for n-gram proposal latency
- Parameterized testing across different n-gram sizes
- Timing collection and statistical analysis
Nightly Benchmarks:
Location: .buildkite/nightly-benchmarks/tests/serving-tests.json:57
{
"test_name": "serving_llama70B_tp4_sharegpt_specdecode",
"qps_list": [2],
"server_parameters": {
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"tensor_parallel_size": 4,
"speculative_config": {
"model": "ibm-fms/llama3-70b-accelerator",
"num_speculative_tokens": 4,
}
}
}Limitations:
- Only covers one speculative decoding scenario
- No Eagle3-specific benchmarks
- Results published to PyTorch Performance Dashboard but no automated regression detection
- No acceptance rate or speedup tracking
4. Example Scripts
Location: examples/offline_inference/spec_decode.py
Features:
- Comprehensive CLI for testing different methods
- Support for multimodal prompts
- Configurable sampling parameters
- Metrics collection (
vllm.v1.metrics.reader)
Gap: Examples are manual tools, not integrated into automated testing
What's Missing
-
Regression Tracking Infrastructure
- No historical baseline storage
- No automated comparison against baselines
- No alerting on degradation
-
Quality Metrics
- Acceptance rate not systematically measured
- Speedup not validated
- Draft model utilization not tracked
-
Comprehensive Coverage
- MTP variants lack dedicated tests
- Multimodal speculative decoding undertested
- Tensor parallelism edge cases not covered
- Different hardware platforms (AMD, Intel) not systematically tested
-
Performance Regression Detection
- No automated detection of latency increases
- No throughput regression tracking
- No memory usage monitoring
-
Documentation
- No documented expected baselines
- No guidance on acceptable performance ranges
- No troubleshooting guides for failures
Proposal Overview
Core Principles
- Multi-layered testing: smoke (PR), nightly, weekly
- Empirical baselines for each variant
- Progressive rollout: Eagle3 → EAGLE/n-gram → MTP variants
Proposed Test Taxonomy
graph TD
A[Test Pyramid] --> B[Weekly Deep Dive]
A --> C[Nightly Regression]
A --> D[PR Smoke Tests]
B --> B1["🔬 Comprehensive Coverage<br/>─────────────────<br/>• All variants × all models<br/>• 1000+ prompts per test<br/>• Full hardware matrix<br/>• 3 hours runtime"]
C --> C1["🎯 Targeted Validation<br/>─────────────────<br/>• 4 core variants<br/>• Representative models<br/>• Baseline validation<br/>• 45 min runtime"]
D --> D1["⚡ Fast Check<br/>─────────────────<br/>• Eagle3 + Llama-3.1-8B<br/>• Single configuration<br/>• Quick correctness<br/>• 5 min runtime"]
style A fill:#e8eaf6,stroke:#3f51b5,stroke-width:3px,color:#000
style B fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000
style C fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#000
style D fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000
style B1 fill:#ffcdd2,stroke:#c62828,stroke-width:1px,color:#000
style C1 fill:#ffe0b2,stroke:#ef6c00,stroke-width:1px,color:#000
style D1 fill:#c8e6c9,stroke:#2e7d32,stroke-width:1px,color:#000
Success Criteria
Per-variant thresholds:
- Match rate ≥ baseline
- Acceptance rate within ±10%
- TTFT/TPOT within ±15%
- No crashes/OOMs
Design Details
Test Architecture Overview
Three test categories: correctness, quality, performance. All validate against baselines.
Note: Metrics collection/dashboards/alerting infrastructure TBD (see Open Questions).
1. Correctness Tests
Purpose: Verify speculative = non-speculative outputs.
Test Pattern:
FOR each variant + model:
prompts = generate_test_prompts(100, diverse_categories)
ref_outputs = run_inference(speculative: disabled)
spec_outputs = run_inference(speculative: enabled)
match_rate = calculate_match_rate(ref_outputs, spec_outputs)
baseline = load_baseline(variant, model)
IF match_rate < baseline.expected - baseline.tolerance:
FAIL
Note: Match rate < 100% is acceptable due to sampling variance.
2. Quality Regression Tests
Purpose: Track acceptance rate and draft efficiency.
Test Pattern:
FOR each variant + model:
results = run_inference_with_metrics(prompts, variant)
metrics = {
acceptance_rate: accepted / proposed,
avg_drafts_per_step: total_drafts / steps
}
baseline = load_baseline(variant, model)
FOR each metric:
IF metric NOT within baseline.tolerance:
FAIL
Thresholds: Acceptance rate 60-80% (±10%).
3. Performance Regression Tests
Purpose: Monitor TTFT, TPOT, end-to-end latency.
Test Pattern:
FOR each variant + model + hardware:
latency_results = benchmark_latency(prompts, variant, hardware)
metrics = {
ttft_p50, ttft_p99,
tpot_p50, tpot_p99,
e2e_latency_p50
}
baseline = load_baseline(variant, model, hardware)
FOR each metric:
IF metric > baseline.expected + baseline.tolerance:
FAIL
Note: Use percentiles (p50, p99), hardware-specific baselines, ±15% tolerance.
4. Baseline Management
Requirements:
- Version controlled (easy PR review)
- Human readable (JSON/YAML/TOML)
- Per variant + model + hardware
Example Baseline:
{
"variant": "eagle3",
"model": "meta-llama/Llama-3.1-8B-Instruct",
"draft_model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
"hardware": "L4",
"metrics": {
"match_rate": {"expected": 0.75, "tolerance_pct": 10},
"acceptance_rate": {"expected": 0.73, "tolerance_pct": 10},
"ttft_p50_ms": {"expected": 44.0, "tolerance_pct": 15}
}
}Validation:
FOR each metric:
tolerance = expected * (tolerance_pct / 100)
IF actual < expected - tolerance OR actual > expected + tolerance:
FAIL
Update Process: PR review required. Updates for intentional changes only, not regressions.
TBD: Storage format, directory structure, approval process (see Open Questions).
Phased Implementation
Phase 0: Foundation
Deliverables:
- Test infrastructure + baseline storage
- Eagle3 baseline (Llama-3.1-8B, 10-20 runs)
- One working regression test
- Test markers, optional nightly CI
Success: Eagle3 baseline established, one test passing, documented, extensible.
Phase 1: Eagle3 Comprehensive Coverage
Deliverables:
- Correctness, quality, performance tests for Eagle3
- Baselines for 3+ models (Llama-3.1-8B, Qwen3-8B, optionally larger)
- Nightly CI integration + alerting
- Documentation
Success: 3+ models tested, nightly tests running, documented.
Phase 2: Expand to Core Variants
Deliverables:
- EAGLE tests + baselines (2-3 models)
- N-gram tests + baselines
- Optional: Medusa tests
- Optional: Cross-variant analysis
Success: 3-4 variants tested, integrated into nightly CI.
Phase 3: MTP Variants and Advanced Scenarios
Deliverables:
- 2-3 MTP variants (DeepSeek MTP, MiMo MTP)
- Advanced scenarios: multimodal, long context, multi-GPU, or alt hardware
- Integration with existing benchmarks
Success: 2+ MTP variants tested, 1+ advanced scenario validated.
Phase 4: Optimization and Operationalization
Deliverables:
- Performance optimization (parallelization, caching, remove redundant tests)
- Reliability improvements (reduce false positives, retry logic, better diagnostics)
- Feedback loop and threshold tuning
Success: <10% false positives, <15min nightly runtime, docs complete.
CI Resource Management
Resource Usage
Test Tiers:
| Test Tier | Frequency | Variants | Models | GPU | Duration |
|---|---|---|---|---|---|
| PR Smoke | Per PR | 1 (Eagle3) | 1 | L4 | 5 min |
| Nightly | Daily | 4 | 3 | L4 + A100 | 45 min |
| Weekly Deep Dive | Weekly | 10 | 5 | L4 + A100 + H100 | 3 hours |
Optimization Strategies
-
Smart Triggering
- Only run spec decode tests when relevant files change:
source_file_dependencies: - vllm/v1/spec_decode/ - vllm/model_executor/models/*eagle*.py - vllm/model_executor/models/*medusa*.py - vllm/config/speculative.py - tests/v1/regression/spec_decode/
- Only run spec decode tests when relevant files change:
-
Test Sharding
- Parallelize test execution across multiple workers
- Use Buildkite parallelism feature:
- label: "Spec Decode Regression (Eagle3)" parallelism: 3 command: pytest -v -s v1/regression/spec_decode/test_eagle3.py::shard_${BUILDKITE_PARALLEL_JOB}
-
Progressive Coverage
- Start with 1-2 variants, expand gradually
- Prioritize high-value tests (most frequently used variants)
-
Test Result Caching
- Skip tests if commit doesn't touch relevant code paths
- Cache test results for unchanged code
Resource Allocation:
# .buildkite/test-pipeline.yaml additions
- label: "🦅 Spec Decode Smoke Test (Eagle3)"
fast_check: true # Run on all PRs
timeout_in_minutes: 10
gpu: "l4" # Cheapest GPU
source_file_dependencies:
- vllm/v1/spec_decode/
- vllm/model_executor/models/llama_eagle3.py
- vllm/config/speculative.py
- tests/v1/regression/spec_decode/
commands:
- pytest -v -s tests/v1/regression/spec_decode/test_eagle3_smoke.py
- label: "🦅 Spec Decode Nightly Regression"
# Only on nightly schedule, not per-PR
schedule: "0 2 * * *" # 2 AM daily
timeout_in_minutes: 60
gpu: "a100"
num_gpus: 1
parallelism: 4 # Shard across 4 workers
commands:
- pytest -v -s tests/v1/regression/spec_decode/ --shard-id=${BUILDKITE_PARALLEL_JOB} --num-shards=${BUILDKITE_PARALLEL_JOB_COUNT}
artifact_paths:
- "test_results/*.json" # Upload metrics
notify:
- slack: "#vllm-ci-alerts"
if: build.state == "failed"
- label: "🦅 Spec Decode Weekly Deep Dive"
schedule: "0 4 * * 0" # Sunday 4 AM
timeout_in_minutes: 240
gpu: "a100"
num_gpus: 4
commands:
- pytest -v -s tests/v1/regression/spec_decode/ --comprehensive --all-variants
artifact_paths:
- "test_results/*.json"
- "benchmark_results/*.json"Benchmarking Strategy
Integration with Existing Infrastructure
vLLM already has robust benchmarking infrastructure:
benchmarks/benchmark_serving.pyfor online servingbenchmarks/benchmark_throughput.pyfor offline throughputbenchmarks/benchmark_latency.pyfor latency profiling.buildkite/nightly-benchmarks/for continuous benchmarking
Proposal: Extend existing benchmarks with speculative decoding focus
Benchmark Suite
1. Latency Benchmarks
Script: benchmarks/benchmark_spec_decode_latency.py (new)
"""
Benchmark speculative decoding latency metrics.
Measures:
- TTFT (Time to First Token)
- TPOT (Time per Output Token)
- E2E Latency
- Per-step latency breakdown
Usage:
python benchmarks/benchmark_spec_decode_latency.py \
--method eagle3 \
--model meta-llama/Llama-3.1-8B-Instruct \
--draft-model yuhuili/EAGLE3-LLaMA3.1-Instruct-8B \
--num-prompts 100 \
--dataset sharegpt
"""Metrics:
- TTFT p50, p95, p99
- TPOT p50, p95, p99
- E2E latency distribution
- Latency overhead vs non-speculative baseline
2. Throughput Benchmarks
Script: Extend benchmarks/benchmark_throughput.py
Additions:
- Compare throughput with/without speculative decoding
- Measure throughput degradation at different batch sizes
- Identify optimal batch size for speculative decoding
3. Quality Benchmarks
Script: benchmarks/benchmark_spec_decode_quality.py (new)
Metrics:
- Acceptance rate across different prompt types
- Average drafts per step
- Draft efficiency (accepted tokens / proposed tokens)
- Speedup ratio vs non-speculative
4. Serving Benchmarks
Integration: Extend .buildkite/nightly-benchmarks/tests/serving-tests.json
Current Coverage:
{
"test_name": "serving_llama70B_tp4_sharegpt_specdecode",
"qps_list": [2],
...
}Proposed Expansion:
[
{
"test_name": "serving_llama8B_eagle3_sharegpt",
"qps_list": [1, 2, 5, 10],
"server_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1,
"speculative_config": {
"method": "eagle3",
"model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 5
}
}
},
{
"test_name": "serving_qwen8B_eagle3_mt_bench",
"qps_list": [1, 2, 5],
"server_parameters": {
"model": "Qwen/Qwen3-8B",
"tensor_parallel_size": 1,
"speculative_config": {
"method": "eagle3",
"model": "AngelSlim/Qwen3-8B_eagle3",
"num_speculative_tokens": 5
}
}
},
{
"test_name": "serving_llama8B_ngram_sharegpt",
"qps_list": [1, 5, 10],
"server_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"speculative_config": {
"method": "ngram",
"num_speculative_tokens": 5,
"prompt_lookup_max": 5
}
}
}
]Baseline Tracking
Note: Uses the same simplified baseline format as regression tests (see section 4).
Storage: JSON files in tests/v1/regression/baselines/
Example: eagle3_llama3.1_8b_benchmark.json
{
"variant": "eagle3",
"model": "meta-llama/Llama-3.1-8B-Instruct",
"draft_model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
"gpu_type": "L4",
"dataset": "sharegpt",
"baselines": {
"ttft_p50_ms": {
"expected": 44.0,
"tolerance_pct": 15
},
"tpot_p50_ms": {
"expected": 12.3,
"tolerance_pct": 15
},
"acceptance_rate": {
"expected": 0.73,
"tolerance_pct": 10
},
"speedup": {
"expected": 2.1,
"tolerance_pct": 15
}
},
"last_updated": "2025-11-05",
"baseline_commit": "3481e4074",
"notes": "Baseline from 10 benchmark runs on L4 GPU"
}Metrics Collection and Reporting
Purpose: Track regression test results over time to identify trends and provide visibility.
Open Questions for Discussion:
-
Metrics Storage:
- Where to store test run results? (CI artifacts, S3/GCS, time-series DB)
- How long to retain historical data?
- Should metrics be queryable? If so, what infrastructure?
-
Visualization and Dashboards:
- Do we need real-time dashboards? (e.g., Grafana, custom web UI)
- Or is static reporting sufficient? (HTML reports, markdown summaries)
Minimal Requirements:
At minimum, the testing infrastructure should:
- Export test results as structured data (JSON, CSV, etc.)
- Provide clear pass/fail status for CI integration
- Generate human-readable summaries of test runs
- Store results as CI artifacts for debugging
Optional Enhancements:
If resources and infrastructure permit:
- Historical trend visualization (acceptance rate over time, latency trends)
- Comparative analysis (variant A vs variant B performance)
- Automated regression bisection (identify which commit introduced regression)
- Performance dashboards for community visibility
Example Minimal Report (Text Output):
Speculative Decoding Regression Test Summary
============================================
Date: 2025-11-05 | Commit: 3481e4074
Overall: ✓ PASSED (4/4 variants passing)
Eagle3 - Llama-3.1-8B-Instruct: ✓ PASS
acceptance_rate: 0.742 [0.657-0.803] ✓
speedup: 2.15x [1.79-2.42] ✓
EAGLE - Llama-3.1-8B-Instruct: ✓ PASS
acceptance_rate: 0.681 [0.594-0.726] ✓
speedup: 1.89x [1.53-2.07] ✓
N-gram - Llama-3.1-8B-Instruct: ✓ PASS
match_rate: 0.78 [0.66-0.86] ✓
MTP - MiMo-7B: ✓ PASS
acceptance_rate: 0.812 [0.72-0.88] ✓
Future Expansion
1. Hardware Diversity
Current Focus: NVIDIA L4, A100, H100
Implementation:
- Add hardware-specific baselines
- Validate performance characteristics per hardware
- Detect hardware-specific regressions
2. Multi-Modal Speculative Decoding
Current tests have limited multimodal coverage. Future expansion:
- Vision + text speculative decoding
- Audio + text speculative decoding
- Validate modality-specific acceptance rates
- Benchmark multimodal latency
Challenges:
- Multimodal models are larger (resource intensive)
- Acceptance rates may differ significantly from text-only
- Need diverse multimodal datasets
3. Long Context Scenarios
Test speculative decoding performance at extreme context lengths:
- 32k tokens
- 64k tokens
- 128k+ tokens
Metrics:
- Memory usage vs context length
- Acceptance rate degradation
- Latency scaling
Open Questions for Discussion
This section outlines key decisions that require team input and consensus.
1. Infrastructure and Tooling
Baseline Storage:
- Where should baselines be stored? (version control, external storage, database)
- What format? (JSON, YAML, TOML, other)
- How to organize? (by variant, by model, flat structure, hierarchical)
Metrics Collection:
- What infrastructure for collecting test metrics? (CI artifacts, time-series DB, custom solution)
- How long to retain historical data?
Dashboards and Reporting:
- Do we need real-time dashboards or are static reports sufficient?
- What should be visible to the community vs. internal only?
- Integration with existing performance dashboards?
2. Testing Strategy
Test Coverage Priorities:
- Which variants should be tested first? (prioritize by usage frequency)
- What model sizes to cover? (focus on 8B models or include 70B+)
- Hardware diversity: which GPUs to support? (NVIDIA only, or AMD/Intel too)
Tolerance Thresholds:
- How to set initial tolerance ranges? (conservative vs. tight)
- Should tolerances be metric-specific or uniform?
- How to handle inherent GPU performance variance?
Test Frequency:
- Which tests run per-PR, nightly, or weekly?
- How to balance coverage vs. CI cost?
- Should we have different test tiers (smoke, full, comprehensive)?
3. CI Failure Policy:
- Should regression tests block merges?
- Hard fail vs. soft fail initially?
- Different policies for different test categories?
4. Resources
Resource Management:
- What CI budget is available?
- Coverage vs. resource usage trade-offs
- Optimization opportunities
5. Success Metrics
How do we measure success of this initiative?
- Regression detection rate?
- False positive rate?
- Time to detect regressions?
- Production incident reduction?
References
Code References
-
Speculative Decoding Implementations:
- Eagle3:
vllm/model_executor/models/llama_eagle3.py - EAGLE:
vllm/v1/spec_decode/eagle.py - Medusa:
vllm/v1/spec_decode/medusa.py - N-gram:
vllm/v1/spec_decode/ngram_proposer.py - Configuration:
vllm/config/speculative.py
- Eagle3:
-
Existing Tests:
- E2E Correctness:
tests/v1/e2e/test_spec_decode.py - Test Fixtures:
tests/conftest.py
- E2E Correctness:
-
Benchmarks:
- N-gram Proposer:
benchmarks/benchmark_ngram_proposer.py - Serving:
benchmarks/benchmark_serving.py - Latency:
benchmarks/benchmark_latency.py - Nightly Config:
.buildkite/nightly-benchmarks/tests/serving-tests.json
- N-gram Proposer:
-
CI/CD:
- Test Pipeline:
.buildkite/test-pipeline.yaml(line 307, 362-364) - AMD Pipeline:
.buildkite/test-amd.yaml(line 346) - GitHub Actions:
.github/workflows/reminder_comment.yml
- Test Pipeline:
-
Examples:
- Offline Inference:
examples/offline_inference/spec_decode.py
- Offline Inference:
Documentation References
- Speculative Decoding Docs:
docs/features/spec_decode.md - Contributing Guide:
docs/contributing/README.md - vLLM Documentation: https://docs.vllm.ai
External References
- Speculative Decoding Overview: https://x.com/karpathy/status/1697318534555336961
- EAGLE Paper: EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- Medusa Paper: Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
- MLP Speculator Blog: https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/
- PyTorch Performance Dashboard: https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm
Success Metrics
- Catch 80% of regressions before release
- Detect within 24 hours (nightly tests)
- <5% false positive rate
- 100% variant coverage by Phase 3
- Zero production incidents from undetected regressions
Appendix
Appendix A: Test Matrix
Complete test coverage matrix for all variants:
| Variant | Model | TP Size | GPU | Dataset | Metrics Tracked | Priority |
|---|---|---|---|---|---|---|
| Eagle3 | Llama-3.1-8B | 1 | L4 | ShareGPT | Correctness, Acceptance, TTFT, TPOT | P0 |
| Eagle3 | Qwen3-8B | 1 | L4 | mt-bench | Correctness, Acceptance, TTFT, TPOT | P0 |
| Eagle3 | Llama-4-Scout-17B | 4 | H100 | ShareGPT | Correctness, Acceptance | P1 |
| EAGLE | Llama-3.1-8B | 1 | L4 | ShareGPT | Correctness, Acceptance | P1 |
| EAGLE | DeepSeek-v3 | 1 | A100 | mt-bench | Correctness, Acceptance | P1 |
| N-gram | Llama-3.1-8B | 1 | L4 | ShareGPT | Correctness, Speedup | P0 |
| N-gram | Qwen3-8B | 1 | L4 | mt-bench | Correctness, Speedup | P1 |
| MTP | MiMo-7B | 1 | L4 | ShareGPT | Correctness, Acceptance | P1 |
| MTP | DeepSeek-V3 | 1 | A100 | mt-bench | Correctness, Acceptance | P1 |
| Medusa | [TBD] | 1 | L4 | ShareGPT | Correctness, Acceptance | P2 |
Priority Levels:
- P0: Critical, must have in Phase 1
- P1: Important, target for Phase 2
- P2: Nice to have, Phase 3 or later
Appendix B: Simplified Baseline JSON Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Simplified Speculative Decoding Baseline",
"type": "object",
"required": ["variant", "model", "baselines"],
"properties": {
"variant": {
"type": "string",
"enum": ["eagle3", "eagle", "ngram", "medusa", "mtp"],
"description": "Speculative decoding method"
},
"model": {
"type": "string",
"description": "Target model name (e.g., meta-llama/Llama-3.1-8B-Instruct)"
},
"draft_model": {
"type": "string",
"description": "Draft model name if applicable (omit for ngram)"
},
"baselines": {
"type": "object",
"description": "Metrics with expected values and tolerance percentages",
"patternProperties": {
"^.*$": {
"type": "object",
"required": ["expected", "tolerance_pct"],
"properties": {
"expected": {
"type": "number",
"description": "Expected baseline value"
},
"tolerance_pct": {
"type": "number",
"minimum": 0,
"maximum": 100,
"description": "Acceptable deviation percentage (e.g., 10 = ±10%)"
}
}
}
}
},
"last_updated": {
"type": "string",
"format": "date",
"description": "Date baseline was last updated (YYYY-MM-DD)"
},
"baseline_commit": {
"type": "string",
"description": "Git commit SHA where baseline was established (optional)"
},
"notes": {
"type": "string",
"description": "Free-form notes about baseline establishment (optional)"
}
}
}Example Baseline File:
See section 4 for a complete example of eagle3_llama3.1_8b.json
Appendix C: Example Test Output
===== Speculative Decoding Regression Test Results =====
Test: Eagle3 Regression - Llama-3.1-8B-Instruct
Status: ✓ PASSED
Duration: 4m 32s
Baseline: eagle3_llama3.1_8b.json
Metrics (all within tolerance):
match_rate: 0.765 within [0.675, 0.825] (expected: 0.75 ±10%) ✓
acceptance_rate: 0.741 within [0.657, 0.803] (expected: 0.73 ±10%) ✓
ttft_p50_ms: 44.8 within [37.4, 50.6] (expected: 44.0 ±15%) ✓
tpot_p50_ms: 12.9 within [10.5, 14.2] (expected: 12.3 ±15%) ✓
speedup: 2.08 within [1.79, 2.42] (expected: 2.10 ±15%) ✓
Details:
Test Prompts: 100 (mixed categories)
GPU: L4 (1x)
Dataset: ShareGPT
Commit: 3481e4074
Timestamp: 2025-11-05T14:32:15Z
---
Example FAILURE output:
Test: Eagle3 Regression - Llama-3.1-8B-Instruct
Status: ✗ FAILED
Duration: 4m 18s
Regression detected:
- acceptance_rate: 0.635 outside range [0.657, 0.803] (expected: 0.73 ±10%)
- speedup: 1.62 outside range [1.79, 2.42] (expected: 2.10 ±15%)
Baseline: eagle3_llama3.1_8b.json
Last updated: 2025-11-05
Commit: 3481e4074
Action Required:
1. Investigate recent changes to spec decode code
2. Check if this is an intentional change requiring baseline update
3. If regression is real, bisect to find problematic commit
=======================================================
Feedback Period.
2 Weeks
CC List.
@benchislett @njhill @DarkLight1337 @aarnphm
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.