ai-dynamo · dagil-nvidia · Feb 5, 2026 · Feb 5, 2026 · Feb 5, 2026 · Feb 5, 2026
diff --git a/docs/backends/vllm/README.md b/docs/backends/vllm/README.md
@@ -146,6 +146,8 @@ This setup demonstrates how to use Dynamo to create an instance using Eagle-base
 
 **Guide:** [Speculative Decoding Quickstart](./speculative_decoding.md)
 
+> **See also:** [Speculative Decoding Feature Overview](../../features/speculative_decoding/README.md) for cross-backend documentation.
+
 ### Kubernetes Deployment
 
 For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](../../../examples/backends/vllm/deploy/README.md)

diff --git a/docs/backends/vllm/speculative_decoding.md b/docs/backends/vllm/speculative_decoding.md
@@ -14,6 +14,11 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
+
+> **Note**: This content has moved to [Speculative Decoding with vLLM](../../features/speculative_decoding/speculative_decoding_vllm.md).
+> See [Speculative Decoding Overview](../../features/speculative_decoding/README.md) for cross-backend documentation.
+> This file will be removed in a future release.
+
 # Running **Meta-Llama-3.1-8B-Instruct** with Speculative Decoding (Eagle3)
 
 This guide walks through how to deploy **Meta-Llama-3.1-8B-Instruct** using **aggregated speculative decoding** with **Eagle3** on a single node.

diff --git a/docs/conf.py b/docs/conf.py
@@ -95,6 +95,8 @@
     "backends/sglang/multimodal_epd": "../../multimodal/sglang.html",
     "backends/sglang/multimodal_sglang_guide": "../../multimodal/sglang.html",
     "multimodal/multimodal_intro": "index.html",
+    # Speculative decoding consolidation (PR speculative-migration)
+    "backends/vllm/speculative_decoding": "../../features/speculative_decoding/speculative_decoding_vllm.html",
 }
 
 # Custom extensions

diff --git a/docs/features/speculative_decoding/README.md b/docs/features/speculative_decoding/README.md
@@ -0,0 +1,97 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Speculative Decoding
+
+Speculative decoding is an optimization technique that uses a smaller "draft" model to predict multiple tokens, which are then verified by the main model in parallel. This can significantly reduce latency for autoregressive generation.
+
+## Backend Support
+
+| Backend | Status | Notes |
+|---------|--------|-------|
+| vLLM | ✅ | Eagle3 draft model support |
+| SGLang | 🚧 | Not yet documented |
+| TensorRT-LLM | 🚧 | Not yet documented |
+
+## Overview
+
+Speculative decoding works by:
+
+1. **Draft phase**: A smaller, faster model generates candidate tokens
+2. **Verify phase**: The main model verifies these candidates in a single forward pass
+3. **Accept/reject**: Tokens are accepted if they match what the main model would have generated
+
+This approach trades off additional compute for lower latency, as multiple tokens can be generated per forward pass of the main model.
+
+## Quick Start (vLLM + Eagle3)
+
+This guide walks through deploying **Meta-Llama-3.1-8B-Instruct** with **Eagle3** speculative decoding on a single GPU with at least 16GB VRAM.
+
+### Prerequisites
+
+1. Start infrastructure services:
+
+```bash
+docker compose -f deploy/docker-compose.yml up -d
+```
+
+2. Build and run the vLLM container:
+
+```bash
+./container/build.sh --framework VLLM
+./container/run.sh -it --framework VLLM --mount-workspace
+```
+
+3. Set up Hugging Face access (Meta-Llama-3.1-8B-Instruct is gated):
+
+```bash
+export HUGGING_FACE_HUB_TOKEN="your_token_here"
+export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN
+```
+
+### Run Speculative Decoding
+
+```bash
+cd examples/backends/vllm
+bash launch/agg_spec_decoding.sh
+```
+
+### Test the Deployment
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+   -H "Content-Type: application/json" \
+   -d '{
+     "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+     "messages": [
+       {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
+     ],
+     "max_tokens": 250
+   }'
+```
+
+## Backend-Specific Guides
+
+| Backend | Guide |
+|---------|-------|
+| vLLM | [speculative_decoding_vllm.md](./speculative_decoding_vllm.md) |
+
+## See Also
+
+- [vLLM Backend](../../backends/vllm/README.md) - Full vLLM deployment guide
+- [Disaggregated Serving](../../design_docs/disagg_serving.md) - Alternative optimization approach
+- [Meta-Llama-3.1-8B-Instruct on Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
diff --git a/docs/features/speculative_decoding/speculative_decoding_vllm.md b/docs/features/speculative_decoding/speculative_decoding_vllm.md
@@ -0,0 +1,132 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Speculative Decoding with vLLM
+
+Using Speculative Decoding with the vLLM backend.
+
+> **See also**: [Speculative Decoding Overview](./README.md) for cross-backend documentation.
+
+## Prerequisites
+
+- vLLM container with Eagle3 support
+- GPU with at least 16GB VRAM
+- Hugging Face access token (for gated models)
+
+## Quick Start: Meta-Llama-3.1-8B-Instruct + Eagle3
+
+This guide walks through deploying **Meta-Llama-3.1-8B-Instruct** with **Eagle3** speculative decoding on a single node.
+
+### Step 1: Set Up Your Docker Environment
+
+First, initialize a Docker container using the vLLM backend. See the [vLLM Quickstart Guide](../../backends/vllm/README.md#vllm-quick-start) for details.
+
+```bash
+# Launch infrastructure services
+docker compose -f deploy/docker-compose.yml up -d
+
+# Build the container
+./container/build.sh --framework VLLM
+
+# Run the container
+./container/run.sh -it --framework VLLM --mount-workspace
+```
+
+### Step 2: Get Access to the Llama-3 Model
+
+The **Meta-Llama-3.1-8B-Instruct** model is gated. Request access on Hugging Face:
+[Meta-Llama-3.1-8B-Instruct repository](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
+
+Approval time varies depending on Hugging Face review traffic.
+
+Once approved, set your access token inside the container:
+
+```bash
+export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
+export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN
+```
+
+### Step 3: Run Aggregated Speculative Decoding
+
+```bash
+# Requires only one GPU
+cd examples/backends/vllm
+bash launch/agg_spec_decoding.sh
+```
+
+Once the weights finish downloading, the server will be ready for inference requests.
+
+### Step 4: Test the Deployment
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+   -H "Content-Type: application/json" \
+   -d '{
+     "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+     "messages": [
+       {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
+     ],
+     "max_tokens": 250
+   }'
+```
+
+### Example Output
+
+```json
+{
+  "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
+  "choices": [
+    {
+      "message": {
+        "role": "assistant",
+        "content": "In cherry blossom's gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes."
+      },
+      "index": 0,
+      "finish_reason": "stop"
+    }
+  ],
+  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
+  "usage": {
+    "prompt_tokens": 16,
+    "completion_tokens": 250,
+    "total_tokens": 266
+  }
+}
+```
+
+## Configuration
+
+Speculative decoding in vLLM uses Eagle3 as the draft model. The launch script configures:
+
+- Target model: `meta-llama/Meta-Llama-3.1-8B-Instruct`
+- Draft model: Eagle3 variant
+- Aggregated serving mode
+
+See `examples/backends/vllm/launch/agg_spec_decoding.sh` for the full configuration.
+
+## Limitations
+
+- Currently only supports Eagle3 as the draft model
+- Requires compatible model architectures between target and draft
+
+## See Also
+
+| Document | Path |
+|----------|------|
+| Speculative Decoding Overview | [README.md](./README.md) |
+| vLLM Backend Guide | [vLLM README](../../backends/vllm/README.md) |
+| Meta-Llama-3.1-8B-Instruct | [Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) |
diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst
@@ -83,6 +83,9 @@
    backends/vllm/prompt-embeddings.md
    backends/vllm/speculative_decoding.md
 
+   features/speculative_decoding/README.md
+   features/speculative_decoding/speculative_decoding_vllm.md
+
    benchmarks/kv-router-ab-testing.md
 
    mocker/mocker.md