Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Built in Rust for performance and Python for extensibility, Dynamo is fully open
|---|:----:|:----------:|:--:|
| **Best For** | High-throughput serving | Maximum performance | Broadest feature coverage |
| [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**KV-Aware Routing**](docs/router/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**KV-Aware Routing**](docs/router/README.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ |
| [**KVBM**](docs/kvbm/README.md) | 🚧 | ✅ | ✅ |
| [**Multimodal**](docs/multimodal/index.md) | ✅ | ✅ | ✅ |
Expand Down Expand Up @@ -388,7 +388,7 @@ See [SGLang on Slurm](examples/backends/sglang/slurm_jobs/README.md) and [TRT-LL

<!-- Reference links for Feature Compatibility Matrix -->
[disagg]: docs/design_docs/disagg_serving.md
[kv-routing]: docs/router/kv_cache_routing.md
[kv-routing]: docs/router/README.md
[planner]: docs/planner/sla_planner.md
[kvbm]: docs/kvbm/README.md
[mm]: examples/multimodal/
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/router/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ To see all available router arguments, run:
python -m dynamo.frontend --help
```

For detailed explanations of router arguments (especially KV cache routing parameters), see the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md).
For detailed explanations of router arguments (especially KV cache routing parameters), see the [Router Guide](../../docs/router/router_guide.md).

> [!Note]
> If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
Expand All @@ -146,7 +146,7 @@ When you launch prefill workers using `run_engines.sh --prefill`, the frontend a
- Uses the same routing mode as the frontend's `--router-mode` setting
- Seamlessly integrates with your decode workers for token generation

No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for more details.
No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [Router Guide](../../docs/router/router_guide.md#disaggregated-serving) for more details.

> [!Note]
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh)
Expand Down
9 changes: 5 additions & 4 deletions components/src/dynamo/router/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

# Standalone Router

A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [KV Cache Routing documentation](/docs/router/kv_cache_routing.md).
A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [Router Guide](/docs/router/router_guide.md).

## Overview

Expand All @@ -29,7 +29,7 @@ python -m dynamo.router \
- `--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`)

**Router Configuration:**
For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [KV Cache Routing documentation](/docs/router/kv_cache_routing.md).
For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/router/router_guide.md).

## Architecture

Expand All @@ -43,7 +43,7 @@ Clients query the `find_best_worker` endpoint to determine which worker should p
## Example: Manual Disaggregated Serving (Alternative Setup)

> [!Note]
> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [KV Cache Routing documentation](../../../../docs/router/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for the default setup.
> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [Router Guide](/docs/router/router_guide.md#disaggregated-serving) for the default setup.
>
> Use this manual setup if you need explicit control over prefill routing configuration or want to manage prefill and decode routers separately.

Expand Down Expand Up @@ -103,6 +103,7 @@ See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a refere

## See Also

- [KV Cache Routing Architecture](/docs/router/kv_cache_routing.md) - Detailed explanation of KV-aware routing
- [Router Guide](/docs/router/router_guide.md) - Configuration and tuning for KV-aware routing
- [Router Design](/docs/design_docs/router_design.md) - Architecture details and event transport modes
- [Frontend Router](../frontend/README.md) - Main HTTP frontend with integrated routing
- [Router Benchmarking](/benchmarks/router/README.md) - Performance testing and tuning
10 changes: 5 additions & 5 deletions deploy/inference-gateway/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -216,11 +216,11 @@ Common Vars for Routing Configuration:
- Set `DYN_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner.
- By default the Dynamo plugin uses KV routing. You can expose `DYN_USE_KV_ROUTING=false` in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml) if you prefer to route in the round-robin fashion.
- If using kv-routing:
- Overwrite the `DYN_KV_BLOCK_SIZE` in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml) to match your model's block size. The `DYN_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
- Set `DYN_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
- Set `DYN_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYN_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing
- See the [KV cache routing design](../../docs/router/kv_cache_routing.md) for details.
- Overwrite the `DYN_KV_BLOCK_SIZE` in your [values-dynamo-epp.yaml](./values-dynamo-epp.yaml) to match your model's block size.The `DYN_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures.
- Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
- Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing
- See the [Router Guide](../../docs/router/router_guide.md) for details.


Stand-Alone installation only:
Expand Down
2 changes: 1 addition & 1 deletion docs/backends/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
|---------|--------|-------|
| [**Disaggregated Serving**](../../design_docs/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../router/kv_cache_routing.md) | ✅ | |
| [**KV-Aware Routing**](../../router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ | |
| [**Multimodal Support**](../../multimodal/sglang.md) | ✅ | |
| [**KVBM**](../../kvbm/README.md) | ❌ | Planned |
Expand Down
4 changes: 2 additions & 2 deletions docs/backends/trtllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
|---------|--------------|-------|
| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ | |
| [**KV-Aware Routing**](../../router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/kvbm/README.md) | ✅ | |
Expand Down Expand Up @@ -113,7 +113,7 @@ apt-get update && apt-get -y install git git-lfs
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.

For detailed information about the architecture and how KV-aware routing works, see the [KV Cache Routing documentation](../../router/kv_cache_routing.md).
For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../router/router_guide.md).

### Aggregated
```bash
Expand Down
4 changes: 2 additions & 2 deletions docs/backends/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
|---------|------|-------|
| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ | |
| [**KV-Aware Routing**](../../router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/kvbm/README.md) | ✅ | |
Expand Down Expand Up @@ -179,7 +179,7 @@ When using KV-aware routing, ensure deterministic hashing across processes to av
```bash
vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
```
See the high-level notes in [KV Cache Routing](../../../docs/router/kv_cache_routing.md) on deterministic event IDs.
See the high-level notes in [Router Design](../../design_docs/router_design.md#deterministic-event-ids) on deterministic event IDs.

## Request Migration

Expand Down
2 changes: 1 addition & 1 deletion docs/design_docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ To address the growing demands of distributed inference serving, NVIDIA introduc
The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:

- [Dynamo Disaggregated Serving](disagg_serving.md)
- [Dynamo Smart Router](../router/kv_cache_routing.md)
- [Dynamo Smart Router](../router/README.md)
- [Dynamo KV Cache Block Manager](../kvbm/kvbm_intro.rst)
- [Planner](../planner/planner_intro.rst)
- [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
Expand Down
Loading
Loading