Skip to content

[RFC]: Robust End-to-End CI/CD Regression Testing for Speculative Decoding #28135

@rahul-tuli

Description

@rahul-tuli

Motivation.

Executive Summary

Proposes regression testing for speculative decoding in vLLM, starting with Eagle3 and expanding to all variants.

Objectives:

  • Prevent regressions in acceptance rates and speedup
  • Detect performance degradation in CI
  • Validate correctness across variants
  • Manage CI costs

Impact:

  • Detect regressions in hours vs. days
  • Provide performance baselines

Table of Contents

  1. Background and Motivation
  2. Current State Analysis
  3. Proposal Overview
  4. Design Details
  5. Phased Implementation
  6. CI Resource Management
  7. Benchmarking Strategy
  8. Future Expansion
  9. Open Questions
  10. References
  11. Call to Action

Background and Motivation

Speculative decoding improves LLM latency 2-3x by using a draft model to predict tokens ahead, verified in parallel. vLLM supports n-gram, EAGLE/Eagle3, Medusa, and MTP variants.

Problem: Small code changes can break acceptance rates and speedup without failing correctness tests.

Current Gaps:

  1. No acceptance rate tracking in tests/v1/e2e/test_spec_decode.py (only 66-80% match thresholds)
  2. Nightly benchmarks lack baselines and alerts
  3. Incomplete coverage for MTP variants
  4. No documented performance expectations per variant

Current State Analysis

Existing Test Infrastructure

1. Correctness Tests

Location: tests/v1/e2e/test_spec_decode.py

Current Coverage:

  • N-gram: Basic correctness test with 100 mixed prompts (66% match threshold)
  • EAGLE: Multiple model variants tested (Llama-3.1-8B, Llama-4-Scout-17B-16E, DeepSeek)
  • Eagle3: Qwen3-8B, Llama-3.1-8B variants
  • MTP: MiMo-7B, DeepSeek-V3 (80% match threshold)

Test Pattern:

# Reference implementation
ref_llm = LLM(model=model_name, max_model_len=2048)
ref_outputs = ref_llm.chat(test_prompts, sampling_config)

# Speculative implementation
spec_llm = LLM(
    model=model_name,
    speculative_config={
        "method": method,
        "model": spec_model_name,
        "num_speculative_tokens": 3,
    },
    max_model_len=2048,
)
spec_outputs = spec_llm.chat(test_prompts, sampling_config)

# Heuristic validation
matches = sum(ref.outputs[0].text == spec.outputs[0].text
              for ref, spec in zip(ref_outputs, spec_outputs))
assert matches >= int(0.66 * len(ref_outputs))  # 66% threshold

Limitations:

  • Match rate threshold is arbitrary and doesn't track temporal trends
  • No measurement of acceptance rate or actual speedup
  • Tests run with limited prompt diversity
  • No multimodal coverage for most variants
  • Thresholds not scientifically derived

2. Buildkite CI Integration

Location: .buildkite/test-pipeline.yaml:307

Current Configuration:

- label: V1 Speculative Decoding Test (Integration)
  commands:
    - pytest -v -s v1/spec_decode

Example Offline Tests (line 362-364):

python3 offline_inference/spec_decode.py --test \
  --method eagle --num_spec_tokens 3 \
  --dataset-name hf --dataset-path philschmid/mt-bench \
  --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 \
  --tp 1 --enable-chunked-prefill --max-model-len 2048

Limitations:

  • Not part of fast_check CI (only runs on full CI)
  • No performance metrics collected
  • No trend analysis or regression detection
  • Tests don't fail on performance degradation

3. Benchmarking Infrastructure

Location: benchmarks/benchmark_ngram_proposer.py

Capabilities:

  • Microbenchmark for n-gram proposal latency
  • Parameterized testing across different n-gram sizes
  • Timing collection and statistical analysis

Nightly Benchmarks:
Location: .buildkite/nightly-benchmarks/tests/serving-tests.json:57

{
  "test_name": "serving_llama70B_tp4_sharegpt_specdecode",
  "qps_list": [2],
  "server_parameters": {
    "model": "meta-llama/Meta-Llama-3-70B-Instruct",
    "tensor_parallel_size": 4,
    "speculative_config": {
      "model": "ibm-fms/llama3-70b-accelerator",
      "num_speculative_tokens": 4,
    }
  }
}

Limitations:

  • Only covers one speculative decoding scenario
  • No Eagle3-specific benchmarks
  • Results published to PyTorch Performance Dashboard but no automated regression detection
  • No acceptance rate or speedup tracking

4. Example Scripts

Location: examples/offline_inference/spec_decode.py

Features:

  • Comprehensive CLI for testing different methods
  • Support for multimodal prompts
  • Configurable sampling parameters
  • Metrics collection (vllm.v1.metrics.reader)

Gap: Examples are manual tools, not integrated into automated testing

What's Missing

  1. Regression Tracking Infrastructure

    • No historical baseline storage
    • No automated comparison against baselines
    • No alerting on degradation
  2. Quality Metrics

    • Acceptance rate not systematically measured
    • Speedup not validated
    • Draft model utilization not tracked
  3. Comprehensive Coverage

    • MTP variants lack dedicated tests
    • Multimodal speculative decoding undertested
    • Tensor parallelism edge cases not covered
    • Different hardware platforms (AMD, Intel) not systematically tested
  4. Performance Regression Detection

    • No automated detection of latency increases
    • No throughput regression tracking
    • No memory usage monitoring
  5. Documentation

    • No documented expected baselines
    • No guidance on acceptable performance ranges
    • No troubleshooting guides for failures

Proposal Overview

Core Principles

  1. Multi-layered testing: smoke (PR), nightly, weekly
  2. Empirical baselines for each variant
  3. Progressive rollout: Eagle3 → EAGLE/n-gram → MTP variants

Proposed Test Taxonomy

graph TD
    A[Test Pyramid] --> B[Weekly Deep Dive]
    A --> C[Nightly Regression]
    A --> D[PR Smoke Tests]

    B --> B1["🔬 Comprehensive Coverage<br/>─────────────────<br/>• All variants × all models<br/>• 1000+ prompts per test<br/>• Full hardware matrix<br/>• 3 hours runtime"]

    C --> C1["🎯 Targeted Validation<br/>─────────────────<br/>• 4 core variants<br/>• Representative models<br/>• Baseline validation<br/>• 45 min runtime"]

    D --> D1["⚡ Fast Check<br/>─────────────────<br/>• Eagle3 + Llama-3.1-8B<br/>• Single configuration<br/>• Quick correctness<br/>• 5 min runtime"]

    style A fill:#e8eaf6,stroke:#3f51b5,stroke-width:3px,color:#000
    style B fill:#ffebee,stroke:#c62828,stroke-width:2px,color:#000
    style C fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#000
    style D fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000
    style B1 fill:#ffcdd2,stroke:#c62828,stroke-width:1px,color:#000
    style C1 fill:#ffe0b2,stroke:#ef6c00,stroke-width:1px,color:#000
    style D1 fill:#c8e6c9,stroke:#2e7d32,stroke-width:1px,color:#000
Loading

Success Criteria

Per-variant thresholds:

  1. Match rate ≥ baseline
  2. Acceptance rate within ±10%
  3. TTFT/TPOT within ±15%
  4. No crashes/OOMs

Design Details

Test Architecture Overview

Three test categories: correctness, quality, performance. All validate against baselines.

Note: Metrics collection/dashboards/alerting infrastructure TBD (see Open Questions).

1. Correctness Tests

Purpose: Verify speculative = non-speculative outputs.

Test Pattern:

FOR each variant + model:
  prompts = generate_test_prompts(100, diverse_categories)
  ref_outputs = run_inference(speculative: disabled)
  spec_outputs = run_inference(speculative: enabled)
  match_rate = calculate_match_rate(ref_outputs, spec_outputs)

  baseline = load_baseline(variant, model)
  IF match_rate < baseline.expected - baseline.tolerance:
    FAIL

Note: Match rate < 100% is acceptable due to sampling variance.

2. Quality Regression Tests

Purpose: Track acceptance rate and draft efficiency.

Test Pattern:

FOR each variant + model:
  results = run_inference_with_metrics(prompts, variant)
  metrics = {
    acceptance_rate: accepted / proposed,
    avg_drafts_per_step: total_drafts / steps
  }

  baseline = load_baseline(variant, model)
  FOR each metric:
    IF metric NOT within baseline.tolerance:
      FAIL

Thresholds: Acceptance rate 60-80% (±10%).

3. Performance Regression Tests

Purpose: Monitor TTFT, TPOT, end-to-end latency.

Test Pattern:

FOR each variant + model + hardware:
  latency_results = benchmark_latency(prompts, variant, hardware)
  metrics = {
    ttft_p50, ttft_p99,
    tpot_p50, tpot_p99,
    e2e_latency_p50
  }

  baseline = load_baseline(variant, model, hardware)
  FOR each metric:
    IF metric > baseline.expected + baseline.tolerance:
      FAIL

Note: Use percentiles (p50, p99), hardware-specific baselines, ±15% tolerance.

4. Baseline Management

Requirements:

  • Version controlled (easy PR review)
  • Human readable (JSON/YAML/TOML)
  • Per variant + model + hardware

Example Baseline:

{
  "variant": "eagle3",
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "draft_model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
  "hardware": "L4",
  "metrics": {
    "match_rate": {"expected": 0.75, "tolerance_pct": 10},
    "acceptance_rate": {"expected": 0.73, "tolerance_pct": 10},
    "ttft_p50_ms": {"expected": 44.0, "tolerance_pct": 15}
  }
}

Validation:

FOR each metric:
  tolerance = expected * (tolerance_pct / 100)
  IF actual < expected - tolerance OR actual > expected + tolerance:
    FAIL

Update Process: PR review required. Updates for intentional changes only, not regressions.

TBD: Storage format, directory structure, approval process (see Open Questions).


Phased Implementation

Phase 0: Foundation

Deliverables:

  1. Test infrastructure + baseline storage
  2. Eagle3 baseline (Llama-3.1-8B, 10-20 runs)
  3. One working regression test
  4. Test markers, optional nightly CI

Success: Eagle3 baseline established, one test passing, documented, extensible.


Phase 1: Eagle3 Comprehensive Coverage

Deliverables:

  1. Correctness, quality, performance tests for Eagle3
  2. Baselines for 3+ models (Llama-3.1-8B, Qwen3-8B, optionally larger)
  3. Nightly CI integration + alerting
  4. Documentation

Success: 3+ models tested, nightly tests running, documented.


Phase 2: Expand to Core Variants

Deliverables:

  1. EAGLE tests + baselines (2-3 models)
  2. N-gram tests + baselines
  3. Optional: Medusa tests
  4. Optional: Cross-variant analysis

Success: 3-4 variants tested, integrated into nightly CI.


Phase 3: MTP Variants and Advanced Scenarios

Deliverables:

  1. 2-3 MTP variants (DeepSeek MTP, MiMo MTP)
  2. Advanced scenarios: multimodal, long context, multi-GPU, or alt hardware
  3. Integration with existing benchmarks

Success: 2+ MTP variants tested, 1+ advanced scenario validated.


Phase 4: Optimization and Operationalization

Deliverables:

  1. Performance optimization (parallelization, caching, remove redundant tests)
  2. Reliability improvements (reduce false positives, retry logic, better diagnostics)
  3. Feedback loop and threshold tuning

Success: <10% false positives, <15min nightly runtime, docs complete.


CI Resource Management

Resource Usage

Test Tiers:

Test Tier Frequency Variants Models GPU Duration
PR Smoke Per PR 1 (Eagle3) 1 L4 5 min
Nightly Daily 4 3 L4 + A100 45 min
Weekly Deep Dive Weekly 10 5 L4 + A100 + H100 3 hours

Optimization Strategies

  1. Smart Triggering

    • Only run spec decode tests when relevant files change:
      source_file_dependencies:
        - vllm/v1/spec_decode/
        - vllm/model_executor/models/*eagle*.py
        - vllm/model_executor/models/*medusa*.py
        - vllm/config/speculative.py
        - tests/v1/regression/spec_decode/
  2. Test Sharding

    • Parallelize test execution across multiple workers
    • Use Buildkite parallelism feature:
      - label: "Spec Decode Regression (Eagle3)"
        parallelism: 3
        command: pytest -v -s v1/regression/spec_decode/test_eagle3.py::shard_${BUILDKITE_PARALLEL_JOB}
  3. Progressive Coverage

    • Start with 1-2 variants, expand gradually
    • Prioritize high-value tests (most frequently used variants)
  4. Test Result Caching

    • Skip tests if commit doesn't touch relevant code paths
    • Cache test results for unchanged code

Resource Allocation:

# .buildkite/test-pipeline.yaml additions

- label: "🦅 Spec Decode Smoke Test (Eagle3)"
  fast_check: true  # Run on all PRs
  timeout_in_minutes: 10
  gpu: "l4"  # Cheapest GPU
  source_file_dependencies:
    - vllm/v1/spec_decode/
    - vllm/model_executor/models/llama_eagle3.py
    - vllm/config/speculative.py
    - tests/v1/regression/spec_decode/
  commands:
    - pytest -v -s tests/v1/regression/spec_decode/test_eagle3_smoke.py

- label: "🦅 Spec Decode Nightly Regression"
  # Only on nightly schedule, not per-PR
  schedule: "0 2 * * *"  # 2 AM daily
  timeout_in_minutes: 60
  gpu: "a100"
  num_gpus: 1
  parallelism: 4  # Shard across 4 workers
  commands:
    - pytest -v -s tests/v1/regression/spec_decode/ --shard-id=${BUILDKITE_PARALLEL_JOB} --num-shards=${BUILDKITE_PARALLEL_JOB_COUNT}
  artifact_paths:
    - "test_results/*.json"  # Upload metrics
  notify:
    - slack: "#vllm-ci-alerts"
      if: build.state == "failed"

- label: "🦅 Spec Decode Weekly Deep Dive"
  schedule: "0 4 * * 0"  # Sunday 4 AM
  timeout_in_minutes: 240
  gpu: "a100"
  num_gpus: 4
  commands:
    - pytest -v -s tests/v1/regression/spec_decode/ --comprehensive --all-variants
  artifact_paths:
    - "test_results/*.json"
    - "benchmark_results/*.json"

Benchmarking Strategy

Integration with Existing Infrastructure

vLLM already has robust benchmarking infrastructure:

  • benchmarks/benchmark_serving.py for online serving
  • benchmarks/benchmark_throughput.py for offline throughput
  • benchmarks/benchmark_latency.py for latency profiling
  • .buildkite/nightly-benchmarks/ for continuous benchmarking

Proposal: Extend existing benchmarks with speculative decoding focus

Benchmark Suite

1. Latency Benchmarks

Script: benchmarks/benchmark_spec_decode_latency.py (new)

"""
Benchmark speculative decoding latency metrics.

Measures:
- TTFT (Time to First Token)
- TPOT (Time per Output Token)
- E2E Latency
- Per-step latency breakdown

Usage:
    python benchmarks/benchmark_spec_decode_latency.py \
        --method eagle3 \
        --model meta-llama/Llama-3.1-8B-Instruct \
        --draft-model yuhuili/EAGLE3-LLaMA3.1-Instruct-8B \
        --num-prompts 100 \
        --dataset sharegpt
"""

Metrics:

  • TTFT p50, p95, p99
  • TPOT p50, p95, p99
  • E2E latency distribution
  • Latency overhead vs non-speculative baseline

2. Throughput Benchmarks

Script: Extend benchmarks/benchmark_throughput.py

Additions:

  • Compare throughput with/without speculative decoding
  • Measure throughput degradation at different batch sizes
  • Identify optimal batch size for speculative decoding

3. Quality Benchmarks

Script: benchmarks/benchmark_spec_decode_quality.py (new)

Metrics:

  • Acceptance rate across different prompt types
  • Average drafts per step
  • Draft efficiency (accepted tokens / proposed tokens)
  • Speedup ratio vs non-speculative

4. Serving Benchmarks

Integration: Extend .buildkite/nightly-benchmarks/tests/serving-tests.json

Current Coverage:

{
  "test_name": "serving_llama70B_tp4_sharegpt_specdecode",
  "qps_list": [2],
  ...
}

Proposed Expansion:

[
  {
    "test_name": "serving_llama8B_eagle3_sharegpt",
    "qps_list": [1, 2, 5, 10],
    "server_parameters": {
      "model": "meta-llama/Llama-3.1-8B-Instruct",
      "tensor_parallel_size": 1,
      "speculative_config": {
        "method": "eagle3",
        "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
        "num_speculative_tokens": 5
      }
    }
  },
  {
    "test_name": "serving_qwen8B_eagle3_mt_bench",
    "qps_list": [1, 2, 5],
    "server_parameters": {
      "model": "Qwen/Qwen3-8B",
      "tensor_parallel_size": 1,
      "speculative_config": {
        "method": "eagle3",
        "model": "AngelSlim/Qwen3-8B_eagle3",
        "num_speculative_tokens": 5
      }
    }
  },
  {
    "test_name": "serving_llama8B_ngram_sharegpt",
    "qps_list": [1, 5, 10],
    "server_parameters": {
      "model": "meta-llama/Llama-3.1-8B-Instruct",
      "speculative_config": {
        "method": "ngram",
        "num_speculative_tokens": 5,
        "prompt_lookup_max": 5
      }
    }
  }
]

Baseline Tracking

Note: Uses the same simplified baseline format as regression tests (see section 4).

Storage: JSON files in tests/v1/regression/baselines/

Example: eagle3_llama3.1_8b_benchmark.json

{
  "variant": "eagle3",
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "draft_model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
  "gpu_type": "L4",
  "dataset": "sharegpt",
  "baselines": {
    "ttft_p50_ms": {
      "expected": 44.0,
      "tolerance_pct": 15
    },
    "tpot_p50_ms": {
      "expected": 12.3,
      "tolerance_pct": 15
    },
    "acceptance_rate": {
      "expected": 0.73,
      "tolerance_pct": 10
    },
    "speedup": {
      "expected": 2.1,
      "tolerance_pct": 15
    }
  },
  "last_updated": "2025-11-05",
  "baseline_commit": "3481e4074",
  "notes": "Baseline from 10 benchmark runs on L4 GPU"
}

Metrics Collection and Reporting

Purpose: Track regression test results over time to identify trends and provide visibility.

Open Questions for Discussion:

  1. Metrics Storage:

    • Where to store test run results? (CI artifacts, S3/GCS, time-series DB)
    • How long to retain historical data?
    • Should metrics be queryable? If so, what infrastructure?
  2. Visualization and Dashboards:

    • Do we need real-time dashboards? (e.g., Grafana, custom web UI)
    • Or is static reporting sufficient? (HTML reports, markdown summaries)

Minimal Requirements:

At minimum, the testing infrastructure should:

  • Export test results as structured data (JSON, CSV, etc.)
  • Provide clear pass/fail status for CI integration
  • Generate human-readable summaries of test runs
  • Store results as CI artifacts for debugging

Optional Enhancements:

If resources and infrastructure permit:

  • Historical trend visualization (acceptance rate over time, latency trends)
  • Comparative analysis (variant A vs variant B performance)
  • Automated regression bisection (identify which commit introduced regression)
  • Performance dashboards for community visibility

Example Minimal Report (Text Output):

Speculative Decoding Regression Test Summary
============================================
Date: 2025-11-05 | Commit: 3481e4074

Overall: ✓ PASSED (4/4 variants passing)

Eagle3 - Llama-3.1-8B-Instruct: ✓ PASS
  acceptance_rate: 0.742 [0.657-0.803] ✓
  speedup: 2.15x [1.79-2.42] ✓

EAGLE - Llama-3.1-8B-Instruct: ✓ PASS
  acceptance_rate: 0.681 [0.594-0.726] ✓
  speedup: 1.89x [1.53-2.07] ✓

N-gram - Llama-3.1-8B-Instruct: ✓ PASS
  match_rate: 0.78 [0.66-0.86] ✓

MTP - MiMo-7B: ✓ PASS
  acceptance_rate: 0.812 [0.72-0.88] ✓

Future Expansion

1. Hardware Diversity

Current Focus: NVIDIA L4, A100, H100

Implementation:

  • Add hardware-specific baselines
  • Validate performance characteristics per hardware
  • Detect hardware-specific regressions

2. Multi-Modal Speculative Decoding

Current tests have limited multimodal coverage. Future expansion:

  • Vision + text speculative decoding
  • Audio + text speculative decoding
  • Validate modality-specific acceptance rates
  • Benchmark multimodal latency

Challenges:

  • Multimodal models are larger (resource intensive)
  • Acceptance rates may differ significantly from text-only
  • Need diverse multimodal datasets

3. Long Context Scenarios

Test speculative decoding performance at extreme context lengths:

  • 32k tokens
  • 64k tokens
  • 128k+ tokens

Metrics:

  • Memory usage vs context length
  • Acceptance rate degradation
  • Latency scaling

Open Questions for Discussion

This section outlines key decisions that require team input and consensus.

1. Infrastructure and Tooling

Baseline Storage:

  • Where should baselines be stored? (version control, external storage, database)
  • What format? (JSON, YAML, TOML, other)
  • How to organize? (by variant, by model, flat structure, hierarchical)

Metrics Collection:

  • What infrastructure for collecting test metrics? (CI artifacts, time-series DB, custom solution)
  • How long to retain historical data?

Dashboards and Reporting:

  • Do we need real-time dashboards or are static reports sufficient?
  • What should be visible to the community vs. internal only?
  • Integration with existing performance dashboards?

2. Testing Strategy

Test Coverage Priorities:

  • Which variants should be tested first? (prioritize by usage frequency)
  • What model sizes to cover? (focus on 8B models or include 70B+)
  • Hardware diversity: which GPUs to support? (NVIDIA only, or AMD/Intel too)

Tolerance Thresholds:

  • How to set initial tolerance ranges? (conservative vs. tight)
  • Should tolerances be metric-specific or uniform?
  • How to handle inherent GPU performance variance?

Test Frequency:

  • Which tests run per-PR, nightly, or weekly?
  • How to balance coverage vs. CI cost?
  • Should we have different test tiers (smoke, full, comprehensive)?

3. CI Failure Policy:

  • Should regression tests block merges?
  • Hard fail vs. soft fail initially?
  • Different policies for different test categories?

4. Resources

Resource Management:

  • What CI budget is available?
  • Coverage vs. resource usage trade-offs
  • Optimization opportunities

5. Success Metrics

How do we measure success of this initiative?

  • Regression detection rate?
  • False positive rate?
  • Time to detect regressions?
  • Production incident reduction?

References

Code References

  • Speculative Decoding Implementations:

    • Eagle3: vllm/model_executor/models/llama_eagle3.py
    • EAGLE: vllm/v1/spec_decode/eagle.py
    • Medusa: vllm/v1/spec_decode/medusa.py
    • N-gram: vllm/v1/spec_decode/ngram_proposer.py
    • Configuration: vllm/config/speculative.py
  • Existing Tests:

    • E2E Correctness: tests/v1/e2e/test_spec_decode.py
    • Test Fixtures: tests/conftest.py
  • Benchmarks:

    • N-gram Proposer: benchmarks/benchmark_ngram_proposer.py
    • Serving: benchmarks/benchmark_serving.py
    • Latency: benchmarks/benchmark_latency.py
    • Nightly Config: .buildkite/nightly-benchmarks/tests/serving-tests.json
  • CI/CD:

    • Test Pipeline: .buildkite/test-pipeline.yaml (line 307, 362-364)
    • AMD Pipeline: .buildkite/test-amd.yaml (line 346)
    • GitHub Actions: .github/workflows/reminder_comment.yml
  • Examples:

    • Offline Inference: examples/offline_inference/spec_decode.py

Documentation References

  • Speculative Decoding Docs: docs/features/spec_decode.md
  • Contributing Guide: docs/contributing/README.md
  • vLLM Documentation: https://docs.vllm.ai

External References


Success Metrics

  1. Catch 80% of regressions before release
  2. Detect within 24 hours (nightly tests)
  3. <5% false positive rate
  4. 100% variant coverage by Phase 3
  5. Zero production incidents from undetected regressions

Appendix

Appendix A: Test Matrix

Complete test coverage matrix for all variants:

Variant Model TP Size GPU Dataset Metrics Tracked Priority
Eagle3 Llama-3.1-8B 1 L4 ShareGPT Correctness, Acceptance, TTFT, TPOT P0
Eagle3 Qwen3-8B 1 L4 mt-bench Correctness, Acceptance, TTFT, TPOT P0
Eagle3 Llama-4-Scout-17B 4 H100 ShareGPT Correctness, Acceptance P1
EAGLE Llama-3.1-8B 1 L4 ShareGPT Correctness, Acceptance P1
EAGLE DeepSeek-v3 1 A100 mt-bench Correctness, Acceptance P1
N-gram Llama-3.1-8B 1 L4 ShareGPT Correctness, Speedup P0
N-gram Qwen3-8B 1 L4 mt-bench Correctness, Speedup P1
MTP MiMo-7B 1 L4 ShareGPT Correctness, Acceptance P1
MTP DeepSeek-V3 1 A100 mt-bench Correctness, Acceptance P1
Medusa [TBD] 1 L4 ShareGPT Correctness, Acceptance P2

Priority Levels:

  • P0: Critical, must have in Phase 1
  • P1: Important, target for Phase 2
  • P2: Nice to have, Phase 3 or later

Appendix B: Simplified Baseline JSON Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Simplified Speculative Decoding Baseline",
  "type": "object",
  "required": ["variant", "model", "baselines"],
  "properties": {
    "variant": {
      "type": "string",
      "enum": ["eagle3", "eagle", "ngram", "medusa", "mtp"],
      "description": "Speculative decoding method"
    },
    "model": {
      "type": "string",
      "description": "Target model name (e.g., meta-llama/Llama-3.1-8B-Instruct)"
    },
    "draft_model": {
      "type": "string",
      "description": "Draft model name if applicable (omit for ngram)"
    },
    "baselines": {
      "type": "object",
      "description": "Metrics with expected values and tolerance percentages",
      "patternProperties": {
        "^.*$": {
          "type": "object",
          "required": ["expected", "tolerance_pct"],
          "properties": {
            "expected": {
              "type": "number",
              "description": "Expected baseline value"
            },
            "tolerance_pct": {
              "type": "number",
              "minimum": 0,
              "maximum": 100,
              "description": "Acceptable deviation percentage (e.g., 10 = ±10%)"
            }
          }
        }
      }
    },
    "last_updated": {
      "type": "string",
      "format": "date",
      "description": "Date baseline was last updated (YYYY-MM-DD)"
    },
    "baseline_commit": {
      "type": "string",
      "description": "Git commit SHA where baseline was established (optional)"
    },
    "notes": {
      "type": "string",
      "description": "Free-form notes about baseline establishment (optional)"
    }
  }
}

Example Baseline File:

See section 4 for a complete example of eagle3_llama3.1_8b.json

Appendix C: Example Test Output

===== Speculative Decoding Regression Test Results =====

Test: Eagle3 Regression - Llama-3.1-8B-Instruct
Status: ✓ PASSED
Duration: 4m 32s

Baseline: eagle3_llama3.1_8b.json

Metrics (all within tolerance):
  match_rate:       0.765 within [0.675, 0.825] (expected: 0.75 ±10%) ✓
  acceptance_rate:  0.741 within [0.657, 0.803] (expected: 0.73 ±10%) ✓
  ttft_p50_ms:      44.8  within [37.4, 50.6]   (expected: 44.0 ±15%) ✓
  tpot_p50_ms:      12.9  within [10.5, 14.2]   (expected: 12.3 ±15%) ✓
  speedup:          2.08  within [1.79, 2.42]   (expected: 2.10 ±15%) ✓

Details:
  Test Prompts: 100 (mixed categories)
  GPU: L4 (1x)
  Dataset: ShareGPT
  Commit: 3481e4074
  Timestamp: 2025-11-05T14:32:15Z

---

Example FAILURE output:

Test: Eagle3 Regression - Llama-3.1-8B-Instruct
Status: ✗ FAILED
Duration: 4m 18s

Regression detected:
  - acceptance_rate: 0.635 outside range [0.657, 0.803] (expected: 0.73 ±10%)
  - speedup: 1.62 outside range [1.79, 2.42] (expected: 2.10 ±15%)

Baseline: eagle3_llama3.1_8b.json
Last updated: 2025-11-05
Commit: 3481e4074

Action Required:
  1. Investigate recent changes to spec decode code
  2. Check if this is an intentional change requiring baseline update
  3. If regression is real, bisect to find problematic commit

=======================================================

Feedback Period.

2 Weeks

CC List.

@benchislett @njhill @DarkLight1337 @aarnphm

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions