diff --git a/README.md b/README.md index 7e2ac34cce8..5eedabd5f08 100644 --- a/README.md +++ b/README.md @@ -52,7 +52,7 @@ Built in Rust for performance and Python for extensibility, Dynamo is fully open |---|:----:|:----------:|:--:| | **Best For** | High-throughput serving | Maximum performance | Broadest feature coverage | | [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ | -| [**KV-Aware Routing**](docs/router/kv_cache_routing.md) | ✅ | ✅ | ✅ | +| [**KV-Aware Routing**](docs/router/README.md) | ✅ | ✅ | ✅ | | [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ | | [**KVBM**](docs/kvbm/README.md) | 🚧 | ✅ | ✅ | | [**Multimodal**](docs/multimodal/index.md) | ✅ | ✅ | ✅ | @@ -388,7 +388,7 @@ See [SGLang on Slurm](examples/backends/sglang/slurm_jobs/README.md) and [TRT-LL [disagg]: docs/design_docs/disagg_serving.md -[kv-routing]: docs/router/kv_cache_routing.md +[kv-routing]: docs/router/README.md [planner]: docs/planner/sla_planner.md [kvbm]: docs/kvbm/README.md [mm]: examples/multimodal/ diff --git a/benchmarks/router/README.md b/benchmarks/router/README.md index 008872d572f..c009762caa7 100644 --- a/benchmarks/router/README.md +++ b/benchmarks/router/README.md @@ -127,7 +127,7 @@ To see all available router arguments, run: python -m dynamo.frontend --help ``` -For detailed explanations of router arguments (especially KV cache routing parameters), see the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md). +For detailed explanations of router arguments (especially KV cache routing parameters), see the [Router Guide](../../docs/router/router_guide.md). > [!Note] > If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead: @@ -146,7 +146,7 @@ When you launch prefill workers using `run_engines.sh --prefill`, the frontend a - Uses the same routing mode as the frontend's `--router-mode` setting - Seamlessly integrates with your decode workers for token generation -No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for more details. +No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [Router Guide](../../docs/router/router_guide.md#disaggregated-serving) for more details. > [!Note] > The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh) diff --git a/components/src/dynamo/router/README.md b/components/src/dynamo/router/README.md index c1571f0dd59..98183c49d46 100644 --- a/components/src/dynamo/router/README.md +++ b/components/src/dynamo/router/README.md @@ -3,7 +3,7 @@ # Standalone Router -A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [KV Cache Routing documentation](/docs/router/kv_cache_routing.md). +A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [Router Guide](/docs/router/router_guide.md). ## Overview @@ -29,7 +29,7 @@ python -m dynamo.router \ - `--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`) **Router Configuration:** -For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [KV Cache Routing documentation](/docs/router/kv_cache_routing.md). +For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/router/router_guide.md). ## Architecture @@ -43,7 +43,7 @@ Clients query the `find_best_worker` endpoint to determine which worker should p ## Example: Manual Disaggregated Serving (Alternative Setup) > [!Note] -> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [KV Cache Routing documentation](../../../../docs/router/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for the default setup. +> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [Router Guide](/docs/router/router_guide.md#disaggregated-serving) for the default setup. > > Use this manual setup if you need explicit control over prefill routing configuration or want to manage prefill and decode routers separately. @@ -103,6 +103,7 @@ See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a refere ## See Also -- [KV Cache Routing Architecture](/docs/router/kv_cache_routing.md) - Detailed explanation of KV-aware routing +- [Router Guide](/docs/router/router_guide.md) - Configuration and tuning for KV-aware routing +- [Router Design](/docs/design_docs/router_design.md) - Architecture details and event transport modes - [Frontend Router](../frontend/README.md) - Main HTTP frontend with integrated routing - [Router Benchmarking](/benchmarks/router/README.md) - Performance testing and tuning diff --git a/deploy/inference-gateway/README.md b/deploy/inference-gateway/README.md index 4faa437a571..f4b08a0c5e1 100644 --- a/deploy/inference-gateway/README.md +++ b/deploy/inference-gateway/README.md @@ -216,11 +216,11 @@ Common Vars for Routing Configuration: - Set `DYN_ENFORCE_DISAGG=true` if you want to enforce every request being served in the disaggregated manner. By default it is false meaning if the the prefill worker is not available the request will be served in the aggregated manner. - By default the Dynamo plugin uses KV routing. You can expose `DYN_USE_KV_ROUTING=false` in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml) if you prefer to route in the round-robin fashion. - If using kv-routing: - - Overwrite the `DYN_KV_BLOCK_SIZE` in your [values.yaml](standalone/helm/dynamo-gaie/values.yaml) to match your model's block size. The `DYN_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures. - - Set `DYN_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. - - Set `DYN_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration). - - Set `DYN_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing - - See the [KV cache routing design](../../docs/router/kv_cache_routing.md) for details. + - Overwrite the `DYN_KV_BLOCK_SIZE` in your [values-dynamo-epp.yaml](./values-dynamo-epp.yaml) to match your model's block size.The `DYN_KV_BLOCK_SIZE` env var is ***MANDATORY*** to prevent silent KV routing failures. + - Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. + - Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration). + - Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing + - See the [Router Guide](../../docs/router/router_guide.md) for details. Stand-Alone installation only: diff --git a/docs/backends/sglang/README.md b/docs/backends/sglang/README.md index 45f6aa0124b..9f282391dcd 100644 --- a/docs/backends/sglang/README.md +++ b/docs/backends/sglang/README.md @@ -36,7 +36,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) |---------|--------|-------| | [**Disaggregated Serving**](../../design_docs/disagg_serving.md) | ✅ | | | [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) | -| [**KV-Aware Routing**](../../router/kv_cache_routing.md) | ✅ | | +| [**KV-Aware Routing**](../../router/README.md) | ✅ | | | [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ | | | [**Multimodal Support**](../../multimodal/sglang.md) | ✅ | | | [**KVBM**](../../kvbm/README.md) | ❌ | Planned | diff --git a/docs/backends/trtllm/README.md b/docs/backends/trtllm/README.md index 6c8b7b9db1a..b8429b67f1f 100644 --- a/docs/backends/trtllm/README.md +++ b/docs/backends/trtllm/README.md @@ -54,7 +54,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) |---------|--------------|-------| | [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | | | [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet | -| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ | | +| [**KV-Aware Routing**](../../router/README.md) | ✅ | | | [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | | | [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned | | [**KVBM**](../../../docs/kvbm/README.md) | ✅ | | @@ -113,7 +113,7 @@ apt-get update && apt-get -y install git git-lfs > [!IMPORTANT] > Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend ` to start up the ingress and using `python3 -m dynamo.trtllm ` to start up the workers. You can easily take each command and run them in separate terminals. -For detailed information about the architecture and how KV-aware routing works, see the [KV Cache Routing documentation](../../router/kv_cache_routing.md). +For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../router/router_guide.md). ### Aggregated ```bash diff --git a/docs/backends/vllm/README.md b/docs/backends/vllm/README.md index d6c997c1066..794e4183fc8 100644 --- a/docs/backends/vllm/README.md +++ b/docs/backends/vllm/README.md @@ -37,7 +37,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) |---------|------|-------| | [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | | | [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP | -| [**KV-Aware Routing**](../../../docs/router/kv_cache_routing.md) | ✅ | | +| [**KV-Aware Routing**](../../router/README.md) | ✅ | | | [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | | | [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP | | [**KVBM**](../../../docs/kvbm/README.md) | ✅ | | @@ -179,7 +179,7 @@ When using KV-aware routing, ensure deterministic hashing across processes to av ```bash vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256 ``` -See the high-level notes in [KV Cache Routing](../../../docs/router/kv_cache_routing.md) on deterministic event IDs. +See the high-level notes in [Router Design](../../design_docs/router_design.md#deterministic-event-ids) on deterministic event IDs. ## Request Migration diff --git a/docs/design_docs/architecture.md b/docs/design_docs/architecture.md index b812cad802b..e4ec91bd4fb 100644 --- a/docs/design_docs/architecture.md +++ b/docs/design_docs/architecture.md @@ -53,7 +53,7 @@ To address the growing demands of distributed inference serving, NVIDIA introduc The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features: - [Dynamo Disaggregated Serving](disagg_serving.md) -- [Dynamo Smart Router](../router/kv_cache_routing.md) +- [Dynamo Smart Router](../router/README.md) - [Dynamo KV Cache Block Manager](../kvbm/kvbm_intro.rst) - [Planner](../planner/planner_intro.rst) - [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md) diff --git a/docs/design_docs/router_design.md b/docs/design_docs/router_design.md new file mode 100644 index 00000000000..a7fea649570 --- /dev/null +++ b/docs/design_docs/router_design.md @@ -0,0 +1,310 @@ + + +# Router Design + +This document describes the internal architecture of the Dynamo KV Router, including block tracking mechanisms, the KV cache optimization system, event handling, and transport modes. + +## KV Router Architecture + +The KV Router tracks two key metrics for each worker: + +1. **Potential Active Blocks**: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request. + +2. **Potential New Prefill Blocks**: The number of tokens that need to be computed from scratch on a worker, calculated as: + - New prefill tokens = Total input tokens - (Overlap blocks × Block size) + - Potential prefill blocks = New prefill tokens / Block size + +### Block Tracking Mechanisms + +The router maintains block information through two complementary systems: + +- **Active Decoding Blocks**: Tracked locally by the router throughout the request lifecycle: + - Incremented when adding a new request + - Updated during token generation + - Decremented upon request completion + +- **Cached Blocks**: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions. + +## KV Cache Router + +The leading Large Language Models (LLMs) today are auto-regressive and based off of the [transformer architecture](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf). One key inference optimization technique is to cache the already computed keys and values and to reuse them for the future tokens. This is called the [KV Cache](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/#key-value_caching). + +### KV Cache Routing and Load Balancing + +```mermaid +graph TD + T[Tokens] --> R[KV Aware Router] + + R -.-> W1["Worker 1
Cached: 2 blocks
Prefill: 8 blks
Decode: 10 blks"] + R ==>|Selected| W2["Worker 2
Cached: 5 blocks
Prefill: 5 blks
Decode: 5 blks"] + R -.-> W3["Worker 3
Cached: 8 blocks
Prefill: 2 blks
Decode: 9 blks"] + + style T fill:#fff3e0,stroke:#333,color:#333 + style R fill:#2e8b57,stroke:#333,color:#fff + style W1 fill:#f3e5f5,stroke:#333,color:#333 + style W2 fill:#c8e6c9,stroke:#333,color:#333 + style W3 fill:#f3e5f5,stroke:#333,color:#333 + + linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px +``` + +The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions. + +#### Cost Calculation + +1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion. + +2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed. + +3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks` + - Lower costs indicate better routing choices + - `overlap_score_weight` balances cache hit optimization against load distribution + - Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL) + +#### Worker Selection + +The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution. + +Example calculation with `overlap_score_weight = 1.0`: +- Worker 1: cost = 1.0 * 8 + 10 = 18 +- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost) +- Worker 3: cost = 1.0 * 2 + 9 = 11 + +### KV Cache Optimizations + +Every inference framework will have a KV Cache for each worker. A popular inference framework library is [vLLM](https://github.com/vllm-project/vllm) where a key contribution was [PagedAttention](https://arxiv.org/abs/2309.06180), which allowed them to manage KV Cache in an efficient way by chunking requests into blocks. + +Another popular inference framework, [SGLang](https://github.com/sgl-project/sglang), contributed [RadixAttention](https://arxiv.org/abs/2312.07104) which introduced a prefix tree which allows for efficient matching, inserting and eviction of KV Cache blocks. The prefix tree structure popularized KV Cache reuse. + +In Dynamo, we introduce a KVPublisher which emits KV Cache events that occur at each worker and a KVIndexer which keeps track of these events globally. + +### KV Block Management Flow + +To get a feel for how KV Cache management works on a single worker with KV Cache reuse turned on and where the KVPublisher gets plugged in, we can walk through the KV Block management flow: + +1. **Request tokenization**: The incoming prompt is converted into tokens +2. **Block partitioning**: The token sequence is divided into fixed-size blocks (e.g., 16 or 64 tokens per block) +3. **Block hashing**: Each block of tokens is hashed to create a unique identifier +4. **Cache lookup**: + - For each block, the system checks if a matching block already exists in the KV cache + - If a match is found, the existing KV cache block is reused + - If no match is found, the system proceeds to the next step +5. **Resource allocation**: + - For blocks without matches, the system attempts to allocate new memory space + - If sufficient memory is available, allocate memory space and proceed to step 7 + - If memory is constrained, proceed to step 6 +6. **Cache eviction** (when necessary): + - The system applies an eviction policy (e.g., LRU, LFU) to identify blocks for removal + - Selected blocks are evicted from the cache + - **KVPublisher emits a KV removed event notifying KVIndexer about the removed block.** + - Alternatively, some systems may offload less-frequently used blocks to CPU memory. +7. **KV computation**: + - For new blocks, the model computes key and value tensors + - These tensors are stored in the newly allocated cache blocks + - **KVPublisher emits a kv stored event notifying KVIndexer about newly stored blocks**. + +Further details can be found for: [TRT-LLM](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/), [vLLM](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching) and [SGLang](https://lmsys.org/blog/2024-01-17-sglang/). + +## Events + +### KVPublisher + +The KVPublisher can be initialized and then called in the inference framework where blocks are allocated and removed. + +The two types of events are: +- KV stored event +- KV removed event + +The publisher can be initialized and used through C bindings or Python bindings. + +### Deterministic Event IDs + +Engines do not need to emit deterministic block identifiers in KV events, as the router uses local block hashes (computed from token content) for tracking and matching blocks across workers. However, it is strongly preferred that engines do emit deterministic block identifiers, as this keeps the KvIndexer's internal lookup table smaller and more efficient. To ensure deterministic behavior, all workers should use identical engine versions/configuration. If your engine relies on Python's built-in `hash()` for any event IDs, set `PYTHONHASHSEED=0`; otherwise this setting has no effect. + +### KVIndexer + +The KVIndexer builds and maintains a global view of cached blocks in a prefix tree. We modify the original prefix tree by also storing the worker id on each node. This is so we can return the number of matched blocks for each worker. + +The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks. + +### Inter-Router Communication + +In distributed deployments with multiple routers, each router maintains visibility over only a portion of the total requests. To ensure consistent routing decisions, routers synchronize their states through three event types: + +1. **AddRequest**: Notifies other routers when a request is assigned to a worker. Includes request ID, worker ID, token sequence blocks, and overlap score to track block usage across the system. + +2. **MarkPrefillCompleted**: Signals when a request moves from prefill to decode phase, allowing routers to update their worker load calculations by excluding completed prefill tokens. + +3. **Free**: Indicates request completion and resource release, enabling accurate block reference counting across all routers. + +Each event carries a unique router ID to prevent self-event processing. This asynchronous communication system ensures optimal routing decisions by maintaining consistent KV cache state across all routers, even as they handle different request streams. + +## Event Transport Modes + +The router supports two event transport modes for KV cache state synchronization: + +- **JetStream (default)**: Persistent event stream with durable consumers. State persists across router restarts via snapshots in NATS object store. Best for production with multi-replica consistency. + +- **NATS Core with Local Indexer** (`--enable-local-indexer` on workers): Fire-and-forget pub/sub where workers maintain local radix trees. Router rebuilds state by querying workers on startup. Lower latency, simpler setup. + +### JetStream Mode + +KV events are sent to a persistent NATS JetStream. Each KV router/indexer replica acts as a durable consumer, pulling messages from this shared stream. This architecture ensures consistency across router replicas and persistence across restarts. + +- **Best for**: Production deployments requiring durability and multi-replica router consistency +- **Tradeoffs**: Requires JetStream setup; slightly higher latency due to persistence guarantees + +```mermaid +graph TD + subgraph Engines + E1[Engine 1
KVPublisher] + E2[Engine 2
KVPublisher] + E3[Engine 3
KVPublisher] + end + + subgraph "NATS JetStream" + JS[(Persistent KV Events Stream
- Block created
- Block removed)] + end + + subgraph "NATS Object Store" + OS[(Radix Tree
State Snapshot)] + end + + subgraph "Router Replicas" + R1[Router 1
KVIndexer] + R2[Router 2
KVIndexer] + end + + E1 -->|Publish Events| JS + E2 -->|Publish Events| JS + E3 -->|Publish Events| JS + + JS -->|Consume as Durable Consumer| R1 + JS -->|Consume as Durable Consumer| R2 + JS -->|Periodic Snapshot| OS + + style JS fill:#e1f5fe,stroke:#333,color:#333 + style OS fill:#e1f5fe,stroke:#333,color:#333 + style E1 fill:#f3e5f5,stroke:#333,color:#333 + style E2 fill:#f3e5f5,stroke:#333,color:#333 + style E3 fill:#f3e5f5,stroke:#333,color:#333 + style R1 fill:#2e8b57,stroke:#333,color:#fff + style R2 fill:#2e8b57,stroke:#333,color:#fff + + linkStyle 0,1,2,3,4,5 stroke:#2196f3,stroke-width:2px +``` + +### NATS Core with Local Indexer + +When workers are started with `--enable-local-indexer`, each worker maintains its own local radix tree (local indexer) and publishes events over NATS Core (fire-and-forget pub/sub) instead of JetStream. Each worker assigns monotonically increasing event IDs to its events. The router detects gaps in event sequences and recovers missed events by querying the worker's local indexer directly. + +- **Best for**: Lower-latency setups; simpler deployments without JetStream; single-router scenarios +- **Tradeoffs**: State persists on workers (not centralized); recovery depends on workers being available +- **Enable with**: `--enable-local-indexer` flag on workers (vLLM, mocker) + +```mermaid +graph TD + subgraph Engines + E1[Engine 1
LocalKvIndexer] + E2[Engine 2
LocalKvIndexer] + E3[Engine 3
LocalKvIndexer] + end + + subgraph "NATS Core" + NC[KV Events Pub/Sub
- Block created
- Block removed] + end + + subgraph "Router Replicas" + R1[Router 1
KVIndexer] + R2[Router 2
KVIndexer] + end + + E1 -->|Publish Events| NC + E2 -->|Publish Events| NC + E3 -->|Publish Events| NC + + NC -->|Subscribe| R1 + NC -->|Subscribe| R2 + + style NC fill:#e1f5fe,stroke:#333,color:#333 + style E1 fill:#f3e5f5,stroke:#333,color:#333 + style E2 fill:#f3e5f5,stroke:#333,color:#333 + style E3 fill:#f3e5f5,stroke:#333,color:#333 + style R1 fill:#2e8b57,stroke:#333,color:#fff + style R2 fill:#2e8b57,stroke:#333,color:#fff + + linkStyle 0,1,2,3,4 stroke:#2196f3,stroke-width:2px +``` + +**How gap detection works:** +1. Each worker assigns monotonically increasing event IDs starting from 0 +2. The router tracks the last received event ID per worker +3. If an event arrives with `event_id > last_id + 1`, the router detects a gap +4. The router queries the worker's local indexer for the missing event range `[last_id+1, event_id-1]` +5. On worker discovery (Added event), the router dumps the worker's entire local indexer state + +**Startup behavior:** +- When a worker is discovered, the router queries and ingests its full local indexer state +- When a worker is removed, the router removes all its blocks from the global radix tree + +>[!Note] +> The router automatically selects the transport mode based on worker configuration. If all connected workers have `enable_local_indexer=true`, the router uses NATS Core mode. Otherwise, it uses JetStream mode. + +### Local Active Block Management with Replica Sync + +In addition to cached blocks, each router replica needs to track active blocks (blocks being used for ongoing generation) as load metrics. Since this information is highly time-sensitive, it should be predicted immediately when: +- The router receives and routes a request +- The first token is generated (prefill complete) +- The response ends (request freed) + +This is managed locally in each router via a "slot manager". To maintain consistency across the system, router replicas synchronize these local predictions with each other through NATS core messaging. + +```mermaid +sequenceDiagram + participant C1 as Client 1 + participant R1 as Router 1
(Slot Manager) + participant R2 as Router 2
(Slot Manager) + participant C2 as Client 2 + + Note over R1,R2: Router Replica Sync Enabled + + C1->>R1: Request A + activate R1 + R1->>R1: Predict blocks & route to worker + R1-->>R2: Sync: AddRequest(A) + + C2->>R2: Request B + activate R2 + R2->>R2: Predict blocks & route to worker + R2-->>R1: Sync: AddRequest(B) + + R1->>R1: First token received
(prefill complete) + R1-->>R2: Sync: MarkPrefillCompleted(A) + R1->>C1: Stream response + + R2->>R2: First token received
(prefill complete) + R2-->>R1: Sync: MarkPrefillCompleted(B) + R2->>C2: Stream response + + R1->>R1: Response complete
(free blocks) + R1-->>R2: Sync: Free(A) + deactivate R1 + + R2->>R2: Response complete
(free blocks) + R2-->>R1: Sync: Free(B) + deactivate R2 + + Note over R1,R2: Both routers have consistent
view of active blocks +``` + +This dual-layer approach—persistent global KV cache state via JetStream and ephemeral active block synchronization via router replicas—enables the system to make optimal routing decisions that balance cache reuse with load distribution. + +## See Also + +- **[Router README](../router/README.md)**: Quick start guide for the KV Router +- **[Router Guide](../router/router_guide.md)**: Configuration, tuning, and production setup +- **[Router Examples](../router/router_examples.md)**: Python API usage and custom routing patterns +- **[KV Event Publishing for Custom Engines](../integrations/kv_events_custom_engines.md)**: Integrate custom inference engines with KV-aware routing diff --git a/docs/features/lora/README.md b/docs/features/lora/README.md index da4e8a3aaef..de22435c29a 100644 --- a/docs/features/lora/README.md +++ b/docs/features/lora/README.md @@ -311,4 +311,4 @@ kubectl logs deployment/my-worker | grep -i lora - [Feature Matrix](../../reference/feature-matrix.md) - Backend compatibility overview - [vLLM Backend](../../backends/vllm/README.md) - vLLM-specific configuration - [Dynamo Operator](../../kubernetes/dynamo_operator.md) - Kubernetes operator overview -- [KV-Aware Routing](../../router/kv_cache_routing.md) - LoRA-aware request routing +- [KV-Aware Routing](../../router/router_guide.md) - LoRA-aware request routing diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst index 9036bb6e681..1408966ce8c 100644 --- a/docs/hidden_toctree.rst +++ b/docs/hidden_toctree.rst @@ -37,11 +37,11 @@ kubernetes/README.md reference/cli.md observability/metrics.md + integrations/kv_events_custom_engines.md agents/tool-calling.md development/jail_stream.md - router/kv_cache_routing.md - router/kv_events.md + router/router_examples.md planner/load_planner.md fault_tolerance/README.md fault_tolerance/request_migration.md @@ -75,6 +75,7 @@ backends/vllm/deepseek-r1.md backends/vllm/gpt-oss.md + integrations/lmcache_integration.md backends/vllm/multi-node.md backends/vllm/prometheus.md backends/vllm/prompt-embeddings.md diff --git a/docs/index.rst b/docs/index.rst index 70770b00b2b..5fe9ac14f9c 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -59,6 +59,7 @@ Quickstart :caption: User Guides KV Cache Offloading + KV Aware Routing Tool Calling Multimodality Support LoRA Adapters @@ -89,6 +90,7 @@ Quickstart Architecture Flow Disaggregated Serving Distributed Runtime + Router Design Request Plane Event Plane Planner Design diff --git a/docs/router/kv_events.md b/docs/integrations/kv_events_custom_engines.md similarity index 97% rename from docs/router/kv_events.md rename to docs/integrations/kv_events_custom_engines.md index e9ff7904c46..3b854a15c72 100644 --- a/docs/router/kv_events.md +++ b/docs/integrations/kv_events_custom_engines.md @@ -282,3 +282,9 @@ Each event in the payload is a dictionary with `type` field (`BlockStored`, `Blo 2. **Block size must match** your engine's actual `kv_block_size` 3. **`parent_hash` is required** for all blocks except the first in a sequence - it links blocks to enable prefix matching + +## See Also + +- **[Router README](../router/README.md)**: Quick start guide for the KV Router +- **[Router Guide](../router/router_guide.md)**: Configuration, tuning, and production setup +- **[Router Design](../design_docs/router_design.md)**: Architecture details and event transport modes diff --git a/docs/reference/feature-matrix.md b/docs/reference/feature-matrix.md index 399dce2124e..bdc22c150b9 100644 --- a/docs/reference/feature-matrix.md +++ b/docs/reference/feature-matrix.md @@ -119,7 +119,7 @@ TensorRT-LLM delivers maximum inference performance and optimization, with full [disagg]: docs/design_docs/disagg_serving.md -[kv-routing]: docs/router/kv_cache_routing.md +[kv-routing]: docs/router/README.md [planner]: docs/planner/planner_intro.rst [kvbm]: docs/kvbm/kvbm_intro.rst [migration]: docs/fault_tolerance/request_migration.md diff --git a/docs/router/README.md b/docs/router/README.md index a42abe9a279..d12b4db6746 100644 --- a/docs/router/README.md +++ b/docs/router/README.md @@ -3,11 +3,9 @@ SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. SPDX-License-Identifier: Apache-2.0 --> -# KV Router +# Router -## Overview - -The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups. +The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups. ## Quick Start @@ -24,14 +22,23 @@ This command: - Exposes the service on port 8000 (configurable) - Automatically handles all backend workers registered to the Dynamo endpoint -Backend workers register themselves using the `register_llm` API, after which the KV Router automatically: -- Tracks the state of all registered workers -- Makes routing decisions based on KV cache overlap -- Balances load across available workers +Backend workers register themselves using the `register_llm` API, after which the KV Router automatically tracks worker state and makes routing decisions based on KV cache overlap. + +#### CLI Arguments + +| Argument | Default | Description | +|----------|---------|-------------| +| `--router-mode kv` | `round_robin` | Enable KV cache-aware routing | +| `--router-temperature ` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) | +| `--kv-cache-block-size ` | Backend-specific | KV cache block size (should match backend config) | +| `--kv-events` / `--no-kv-events` | `--kv-events` | Enable/disable real-time KV event tracking | +| `--kv-overlap-score-weight ` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT) | + +For all available options: `python -m dynamo.frontend --help` ### Kubernetes Deployment -To enable the KV Router in a Kubernetes deployment, add the `DYN_ROUTER_MODE` environment variable to your frontend service: +To enable the KV Router in Kubernetes, add the `DYN_ROUTER_MODE` environment variable to your frontend service: ```yaml apiVersion: nvidia.com/v1alpha1 @@ -47,11 +54,6 @@ spec: envs: - name: DYN_ROUTER_MODE value: kv # Enable KV Smart Router - extraPodSpec: - mainContainer: - image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0 - Worker: - # ... worker configuration ... ``` **Key Points:** @@ -59,258 +61,43 @@ spec: - Workers automatically report KV cache events to the router - No worker-side configuration changes needed -**Complete K8s Examples:** -- [TRT-LLM aggregated router example](../../examples/backends/trtllm/deploy/agg_router.yaml) -- [vLLM aggregated router example](../../examples/backends/vllm/deploy/agg_router.yaml) -- [SGLang aggregated router example](../../examples/backends/sglang/deploy/agg_router.yaml) -- [Distributed inference tutorial](../../examples/basics/kubernetes/Distributed_Inference/agg_router.yaml) - -**For A/B Testing and Advanced K8s Setup:** -See the comprehensive [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes. - -## Configuration Options - -### CLI Arguments (Python Deployment) - -The KV Router supports several key configuration options: - -- **`--router-mode kv`**: Enable KV cache-aware routing (required) - -- **`--kv-cache-block-size `**: Sets the KV cache block size (default: backend-specific). Larger blocks reduce overlap detection granularity but improve memory efficiency. This should match your backend configuration. - -- **`--router-temperature `**: Controls routing randomness (default: 0.0) - - `0.0`: Deterministic selection of the best worker - - `> 0.0`: Probabilistic selection using softmax sampling - - Higher values increase randomness, helping prevent worker saturation - -- **`--kv-events` / `--no-kv-events`**: Controls how the router tracks cached blocks (default: `--kv-events`) - - `--kv-events`: Uses real-time events from workers for accurate cache tracking - - `--no-kv-events`: Uses approximation based on routing decisions (lower overhead, less accurate) - -- **`--kv-overlap-score-weight `**: Balance between prefill and decode optimization (default: 1.0) - - Higher values (> 1.0): Prioritize reducing prefill cost (better TTFT) - - Lower values (< 1.0): Prioritize decode performance (better ITL) - -For a complete list of available options: -```bash -python -m dynamo.frontend --help -``` - -### Kubernetes Environment Variables - -All CLI arguments can be configured via environment variables in Kubernetes deployments. Use the `DYN_` prefix with uppercase parameter names: - -| CLI Argument | K8s Environment Variable | Default | Description | -|--------------|-------------------------|---------|-------------| -| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` | Enable KV router | -| `--router-temperature ` | `DYN_ROUTER_TEMPERATURE=` | `0.0` | Routing randomness | -| `--kv-cache-block-size ` | `DYN_KV_CACHE_BLOCK_SIZE=` | Backend-specific | KV cache block size | -| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` | Disable KV event tracking | -| `--kv-overlap-score-weight ` | `DYN_KV_OVERLAP_SCORE_WEIGHT=` | `1.0` | Prefill vs decode weight | -| `--http-port ` | `DYN_HTTP_PORT=` | `8000` | HTTP server port | - -### Example with Advanced Configuration - -```yaml -apiVersion: nvidia.com/v1alpha1 -kind: DynamoGraphDeployment -metadata: - name: my-deployment -spec: - services: - Frontend: - dynamoNamespace: my-namespace - componentType: frontend - replicas: 1 - envs: - - name: DYN_ROUTER_MODE - value: kv - - name: DYN_ROUTER_TEMPERATURE - value: "0.5" # Add some randomness to prevent worker saturation - - name: DYN_KV_OVERLAP_SCORE_WEIGHT - value: "1.5" # Prioritize TTFT over ITL - - name: DYN_KV_CACHE_BLOCK_SIZE - value: "16" - extraPodSpec: - mainContainer: - image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0 -``` - -### Alternative: Using Command Args in K8s - -You can also pass CLI arguments directly in the container command: - -```yaml -extraPodSpec: - mainContainer: - image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0 - command: - - /bin/sh - - -c - args: - - "python3 -m dynamo.frontend --router-mode kv --router-temperature 0.5 --http-port 8000" -``` - -**Recommendation:** Use environment variables for easier configuration management and consistency with Dynamo's K8s patterns. - -## KV Router Architecture - -The KV Router tracks two key metrics for each worker: - -1. **Potential Active Blocks**: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request. - -2. **Potential New Prefill Blocks**: The number of tokens that need to be computed from scratch on a worker, calculated as: - - New prefill tokens = Total input tokens - (Overlap blocks × Block size) - - Potential prefill blocks = New prefill tokens / Block size +#### Environment Variables -### Block Tracking Mechanisms +All CLI arguments can be configured via environment variables using the `DYN_` prefix: -The router maintains block information through two complementary systems: +| CLI Argument | Environment Variable | Default | +|--------------|---------------------|---------| +| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` | +| `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` | +| `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific | +| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` | +| `--kv-overlap-score-weight` | `DYN_KV_OVERLAP_SCORE_WEIGHT` | `1.0` | -- **Active Decoding Blocks**: Tracked locally by the router throughout the request lifecycle: - - Incremented when adding a new request - - Updated during token generation - - Decremented upon request completion +For complete K8s examples and advanced configuration, see [K8s Examples](router_examples.md#k8s-examples). -- **Cached Blocks**: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions. +For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md). -## Cost Function - -The KV Router's routing decision is based on a simple cost function: - -``` -logit = kv_overlap_score_weight × potential_prefill_blocks + potential_active_blocks -``` - -Where: -- Lower logit values are better (less computational cost) -- The router uses softmax sampling with optional temperature to select workers - -### Key Parameter: kv-overlap-score-weight - -The `kv-overlap-score-weight` parameter (default: 1.0) controls the balance between prefill and decode optimization: - -- **Higher values (> 1.0)**: Emphasize reducing prefill cost - - Prioritizes routing to workers with better cache hits - - Optimizes for Time To First Token (TTFT) - - Best for workloads where initial response latency is critical - -- **Lower values (< 1.0)**: Emphasize decode performance - - Distributes active decoding blocks more evenly - - Optimizes for Inter-Token Latency (ITL) - - Best for workloads with long generation sequences - -## KV Events vs. Approximation Mode - -The router uses KV events from workers by default to maintain an accurate global view of cached blocks. You can disable this with the `--no-kv-events` flag: - -- **With KV Events (default)**: - - Calculates overlap accurately using actual cached blocks - - Provides higher accuracy with event processing overhead - - Recommended for production deployments - -- **Without KV Events (--no-kv-events)**: - - Router predicts cache state based on routing decisions with TTL-based expiration and pruning - - Tracks blocks from recent requests with configurable time-to-live - - Reduces overhead at the cost of routing accuracy - - **NATS is not needed** - suitable for simpler deployments without NATS infrastructure - - Suitable for testing or when event processing becomes a bottleneck - -## Event Transport Modes - -The router supports two event transport modes for KV cache state synchronization: - -- **JetStream (default)**: Persistent event stream with durable consumers. State persists across router restarts via snapshots in NATS object store. Best for production with multi-replica consistency. - -- **NATS Core with Local Indexer** (`--enable-local-indexer` on workers): Fire-and-forget pub/sub where workers maintain local radix trees. Router rebuilds state by querying workers on startup. Lower latency, simpler setup. - -See [KV Cache Routing](kv_cache_routing.md#global-kv-cache-state-synchronization) for architecture diagrams and details. - -## Disaggregated Serving - -Dynamo supports disaggregated serving where prefill and decode are handled by separate worker pools. Register prefill workers with `ModelType.Prefill` and the frontend automatically activates an internal prefill router. - -Key points: -- Prefill router auto-activates when both prefill and decode workers register with the same model name -- Supports vLLM and TensorRT-LLM backends (SGLang requires separate router setup) -- Use `--no-track-active-blocks` for prefill-only workers - -See [KV Cache Routing - Disaggregated Serving](kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for setup examples. - -## Router Replicas and State Persistence - -For high availability, run multiple router replicas with `--router-replica-sync` to synchronize active block tracking via NATS. - -State persistence options: -- **JetStream mode**: Automatic persistence via event stream and object store snapshots -- **Local Indexer mode**: State rebuilds from workers on startup -- **Reset state**: Use `--router-reset-states` to start fresh (use with caution) - -See [KV Cache Routing - Serving Multiple Router Replicas](kv_cache_routing.md#serving-multiple-router-replicas) for details. - -## Busy Thresholds - -Control worker saturation with busy thresholds: -- `--active-decode-blocks-threshold <0.0-1.0>`: Mark workers busy when KV cache utilization exceeds threshold -- `--active-prefill-tokens-threshold `: Mark workers busy when active prefill tokens exceed threshold - -Thresholds can be updated at runtime via the `/busy_threshold` HTTP endpoint. See [Dynamic Threshold Configuration](kv_cache_routing.md#dynamic-threshold-configuration). - -## Python API - -For programmatic routing control, use the `KvPushRouter` class directly: - -```python -from dynamo.llm import DistributedRuntime, KvPushRouter, KvRouterConfig - -router = KvPushRouter(endpoint=endpoint, block_size=16, kv_router_config=KvRouterConfig()) -stream = await router.generate(token_ids=tokens, model="model-name") -``` - -Key methods: `generate()`, `best_worker()`, `get_potential_loads()`, `mark_prefill_complete()`, `free()`. - -See [KV Cache Routing - Python API](kv_cache_routing.md#using-kvpushrouter-python-api) for complete examples. +For more configuration options and tuning guidelines, see the [Router Guide](router_guide.md). ## Prerequisites and Limitations -- **Dynamic endpoints only**: KV router requires `register_llm()` with `model_input=ModelInput.Tokens` -- **No multimodal support**: Currently tracks token-based blocks only -- **No static endpoints**: Use `--router-mode round-robin` for static endpoint deployments - -See [KV Cache Routing - Prerequisites](kv_cache_routing.md#prerequisites-and-limitations) for details. - -## Tuning Guidelines - -### 1. Understand Your Workload Characteristics - -- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight` -- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight` - -### 2. Monitor Key Metrics - -The router logs the cost calculation for each worker: -``` -Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15) -``` +**Requirements:** +- **Dynamic endpoints only**: KV router requires `register_llm()` with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text. +- Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../development/backend-guide.md)) +- You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead) -This shows: -- Total cost (125.3) -- Overlap weight × prefill blocks (1.0 × 100.5) -- Active blocks (25.0) -- Cached blocks that contribute to overlap (15) +**Multimodal Support:** +- **vLLM and TRT-LLM**: Multimodal routing supported for images via multimodal hashes +- **SGLang**: Image routing not yet supported +- **Other modalities** (audio, video, etc.): Not yet supported -### 3. Temperature-Based Routing +**Limitations:** +- Static endpoints not supported—KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states -The `router_temperature` parameter controls routing randomness: -- **0.0 (default)**: Deterministic selection of the best worker -- **> 0.0**: Probabilistic selection, higher values increase randomness -- Useful for preventing worker saturation and improving load distribution +For basic model registration without KV routing, use `--router-mode round-robin` or `--router-mode random` with both static and dynamic endpoints. -### 4. Iterative Optimization +## Next Steps -1. Begin with default settings -2. Monitor TTFT and ITL metrics -3. Adjust `kv-overlap-score-weight` to meet your performance goals: - - To reduce TTFT: Increase the weight - - To reduce ITL: Decrease the weight -4. If you observe severe load imbalance, increase the temperature setting +- **[Router Guide](router_guide.md)**: Deep dive into KV cache routing, configuration, disaggregated serving, and tuning +- **[Router Examples](router_examples.md)**: Python API usage, K8s examples, and custom routing patterns +- **[Router Design](../design_docs/router_design.md)**: Architecture details, algorithms, and event transport modes diff --git a/docs/router/kv_cache_routing.md b/docs/router/kv_cache_routing.md deleted file mode 100644 index 1a11da27f08..00000000000 --- a/docs/router/kv_cache_routing.md +++ /dev/null @@ -1,732 +0,0 @@ - - -# KV Cache Routing -This document explains how Dynamo's Key-Value (KV) cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data, while maintaining load balance through worker utilization metrics. - -To enable KV cache aware routing start the frontend node like this: -``` -python -m dynamo.frontend --router-mode kv -``` - -When KV blocks are created or removed, the engine notifies the Dynamo router, which then identifies the worker with the best matching blocks and routes traffic accordingly. - -To evaluate the benefits of KV-aware routing, compare your workload's performance using `--router-mode random|round-robin` against KV-aware routing. - -The main KV-aware routing arguments: - -- `--kv-overlap-score-weight`: Controls the importance of prefix cache overlaps in prefill cost calculations. Higher values improve Time To First Token (TTFT) at the cost of Inter-Token Latency (ITL). When set to 0, the router ignores prefix caches and uses pure load balancing. Defaults to 1. - -- `--router-temperature`: Controls worker selection randomness through softmax sampling of router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness. - -- `--no-kv-events`: Disables KV event tracking. By default (when this flag is not provided), the router uses KV events to monitor block creation and deletion from workers. When disabled with this flag, the router predicts cache state based on routing decisions with TTL-based expiration (default 120s) and pruning. Use this flag if your backend doesn't support KV events (or you are not confident in the accuracy or responsiveness of the events). - -- `--router-replica-sync`: Disabled by default. Enables NATS-based synchronization of local routing decisions between router replicas. When enabled, routers share their active sequence information and local predictions of block usage, improving routing consistency across instances. Note that this does not sync the radix tree or cached KV block states themselves - those are synchronized through JetStream events - -- `--router-reset-states`: When specified, resets the router state on startup by clearing both the JetStream event stream and NATS object store, starting with a fresh state. By default (when this flag is not provided), the router persists state across restarts, downloading any available snapshot from NATS object store and continuing to consume events from where it left off. This enables routers to maintain KV cache awareness across restarts. **Warning**: Using `--router-reset-states` can bring existing router replicas into an inconsistent state. Only use this flag when launching the first router replica in a component, or consider using a different namespace/component for a clean slate. - -- `--router-snapshot-threshold`: Sets the number of messages in the JetStream before triggering a snapshot. When the message count exceeds this threshold, a router will attempt to purge acknowledged messages from the stream and create a snapshot of the current radix tree state in NATs object store. Defaults to 1000000. This helps manage stream size and provides faster initialization for routers that restart. - -- `--no-track-active-blocks`: Disables tracking of active blocks (blocks being used for ongoing generation/decode phases). By default, the router tracks active blocks for load balancing. Disable this when routing to workers that only perform prefill (no decode phase), as tracking decode load is not relevant. This reduces router overhead and simplifies state management. - -- `--no-assume-kv-reuse`: When tracking active blocks, disables the assumption of KV cache reuse. By default (`router_assume_kv_reuse=true`), the router computes actual block hashes for sequence tracking to deduplicate blocks and optimize load balancing. When disabled via this flag, the router generates random hashes for sequence blocks, treating each request's blocks as unique. This is useful in disaggregated setups where prefill transfers blocks to decode workers that may already have those blocks cached, but the engine cannot coordinate transfers to avoid duplication. Without this flag, the router's load balancing heuristics would undercount decode blocks when duplicates exist. - -- `--active-decode-blocks-threshold`: Initial threshold (0.0-1.0) for determining when a worker is considered busy based on KV cache block utilization. When a worker's KV cache active blocks exceed this percentage of total blocks, it will be marked as busy and excluded from routing. If not set, blocks-based busy detection is disabled. This feature works with all routing modes (`--router-mode kv|round-robin|random`) as long as backend engines publish load metrics. The threshold can be dynamically updated at runtime via the `/busy_threshold` HTTP endpoint (see [Dynamic Threshold Configuration](#dynamic-threshold-configuration)). - -- `--active-prefill-tokens-threshold`: Literal token count threshold for determining when a worker is considered busy based on prefill token utilization. When active prefill tokens exceed this threshold, the worker is marked as busy. If not set, tokens-based busy detection is disabled. - -- `--router-ttl`: Time-to-live in seconds for blocks in the router's local cache predictions. Blocks older than this duration will be automatically expired and removed from the router's radix tree. Defaults to 120.0 seconds when `--no-kv-events` is used. This helps manage memory usage by removing stale cache predictions that are unlikely to be accurate. - -- `--router-max-tree-size`: Maximum tree size (number of blocks) before pruning is triggered. When the total number of blocks in the radix tree exceeds this threshold, the router will prune the least recently used blocks. Defaults to 1048576 (2^20 blocks) when `--no-kv-events` is used. This prevents unbounded memory growth in long-running deployments. - -- `--router-prune-target-ratio`: Target size ratio to prune down to when `--router-max-tree-size` is exceeded. For example, with a value of 0.8 (default) and max tree size of 1048576, the router will prune down to approximately 838860 blocks when the threshold is exceeded. Defaults to 0.8 when `--no-kv-events` is used. This creates headroom before the next pruning cycle. - ->[!Note] -> **State persistence** depends on the event transport mode: -> - **JetStream mode** (default): State persists across router restarts via JetStream and NATS object store snapshots. -> - **NATS Core with Local Indexer mode** (`--enable-local-indexer` on workers): State persists on workers—router rebuilds state by querying workers on startup. -> - **No KV events** (`--no-kv-events`): State persistence is not supported. -> -> **Request plane is independent of KV event transport.** -> `DYN_REQUEST_PLANE` controls how **requests** are sent (TCP/HTTP/NATS), but KV-aware routing still uses **NATS** for KV events in both JetStream and NATS Core + Local Indexer modes. -> When KV events are enabled (default), NATS is automatically initialized. You can optionally set `NATS_SERVER=nats://...` to specify a custom NATS server; otherwise, it defaults to `localhost:4222`. -> Use `--no-kv-events` to disable KV events and remove the NATS requirement entirely (with request plane being `tcp` or `http`). -> -> When `--kv-overlap-score-weight` is set to 0, no KvIndexer is created and prefix matching is disabled (pure load balancing). When `--no-kv-events` is set, a KvIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning. -> -> **Backend Configuration:** When using `--no-kv-events`, configure your backend workers to disable KV event publishing: -> - **vLLM**: Use `--kv-events-config '{"enable_kv_cache_events": false}'` -> - **SGLang**: Do not use `--kv-events-config` -> - **TRT-LLM**: Do not use `--publish-events-and-metrics` -> -> The cli args `--router-ttl`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When KV events are enabled (default), the router relies on worker-side eviction events and these parameters are ignored. - -## Prerequisites and Limitations - ->[!Note] -> **KV Router Requirements**: The KV router currently works only with **dynamic endpoints** that are registered via [`register_llm()`](../development/backend-guide.md#writing-python-workers-in-dynamo) with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text. - -**Current Limitations (WIP):** -- **Static endpoints**: Not yet supported. The KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states. -- **Multimodal models**: Not yet supported. The KV router currently tracks token-based blocks only. - -**What this means for your setup:** -1. Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../development/backend-guide.md) or [example implementations](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/examples/hello_world)) -2. Your handler receives requests with pre-tokenized `token_ids`, not raw text or multimodal inputs -3. You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead) - -For basic model registration without KV routing, you can use `--router-mode round-robin` or `--router-mode random` with both static and dynamic endpoints. - -## Disaggregated Serving (Prefill and Decode) - -Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../development/backend-guide.md)), the frontend automatically detects them and activates an internal prefill router. - -### Automatic Prefill Router Activation - -The prefill router is automatically created when: -1. A decode model is registered (e.g., via `register_llm()` with `ModelType.Chat | ModelType.Completions`) -2. A prefill worker is detected with the same model name and `ModelType.Prefill` - -**Key characteristics of the prefill router:** -- **Always disables active block tracking** (`track_active_blocks=false`) since prefill workers don't perform decode -- **Seamlessly integrated** into the request pipeline between preprocessing and decode routing -- **Falls back gracefully** to decode-only mode if prefill fails or no prefill workers are available - -### Setup Example - -When both workers are registered, requests are automatically routed. - -```python -# Decode worker registration (in your decode worker) -decode_endpoint = runtime.namespace("dynamo").component("decode").endpoint("generate") - -await register_llm( - model_input=ModelInput.Tokens, - model_type=ModelType.Chat | ModelType.Completions, - endpoint=decode_endpoint, - model_name="meta-llama/Llama-2-7b-hf", - # ... other parameters -) - -await decode_endpoint.serve_endpoint(decode_handler.generate) - -# Prefill worker registration (in your prefill worker) -prefill_endpoint = runtime.namespace("dynamo").component("prefill").endpoint("generate") - -await register_llm( - model_input=ModelInput.Tokens, - model_type=ModelType.Prefill, # <-- Mark as prefill worker - endpoint=prefill_endpoint, - model_name="meta-llama/Llama-2-7b-hf", # Must match decode model name - # ... other parameters -) - -await prefill_endpoint.serve_endpoint(prefill_handler.generate) -``` - -> [!Note] -> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh). - -### Request Flow - -The following diagram shows an overview of the major components in disaggregated serving: - -```mermaid -graph TD - HTTP[HTTP] - ROUTER[Router] - PREFILL[Prefill Worker] - DECODE[Decode Worker] - - classDef worker_style fill:#f3e5f5,stroke:#333,stroke-width:2px,color:#333; - classDef router_style fill:#2e8b57,stroke:#333,stroke-width:2px,color:#fff; - - class PREFILL,DECODE worker_style - class ROUTER router_style - - HTTP <--> |"request/response"| ROUTER - ROUTER --> |"1. send to prefill"| PREFILL - PREFILL --> |"2. return NIXL metadata"| ROUTER - ROUTER --> |"3. send with metadata"| DECODE - DECODE --> |"4. stream response"| ROUTER - - PREFILL -.-> |"publish kv events"| ROUTER - - linkStyle 0,1,2,3,4 stroke:#8b4513,stroke-width:2px - linkStyle 5 stroke:#2196f3,stroke-width:2px -``` - -## Overview - -The KV-aware router operates on two key principles to optimize request routing: - -### Global KV Cache State Synchronization - -KV events from engines are collected by the router to maintain a global view of cached blocks across all workers. The router supports two event transport modes: - -#### Mode 1: JetStream (Default) - -KV events are sent to a persistent NATS JetStream. Each KV router/indexer replica acts as a durable consumer, pulling messages from this shared stream. This architecture ensures consistency across router replicas and persistence across restarts. - -- **Best for**: Production deployments requiring durability and multi-replica router consistency -- **Tradeoffs**: Requires JetStream setup; slightly higher latency due to persistence guarantees - -```mermaid -graph TD - subgraph Engines - E1[Engine 1
KVPublisher] - E2[Engine 2
KVPublisher] - E3[Engine 3
KVPublisher] - end - - subgraph "NATS JetStream" - JS[(Persistent KV Events Stream
- Block created
- Block removed)] - end - - subgraph "NATS Object Store" - OS[(Radix Tree
State Snapshot)] - end - - subgraph "Router Replicas" - R1[Router 1
KVIndexer] - R2[Router 2
KVIndexer] - end - - E1 -->|Publish Events| JS - E2 -->|Publish Events| JS - E3 -->|Publish Events| JS - - JS -->|Consume as Durable Consumer| R1 - JS -->|Consume as Durable Consumer| R2 - JS -->|Periodic Snapshot| OS - - style JS fill:#e1f5fe,stroke:#333,color:#333 - style OS fill:#e1f5fe,stroke:#333,color:#333 - style E1 fill:#f3e5f5,stroke:#333,color:#333 - style E2 fill:#f3e5f5,stroke:#333,color:#333 - style E3 fill:#f3e5f5,stroke:#333,color:#333 - style R1 fill:#2e8b57,stroke:#333,color:#fff - style R2 fill:#2e8b57,stroke:#333,color:#fff - - linkStyle 0,1,2,3,4,5 stroke:#2196f3,stroke-width:2px -``` - -#### Mode 2: NATS Core with Local Indexer - -When workers are started with `--enable-local-indexer`, each worker maintains its own local radix tree (local indexer) and publishes events over NATS Core (fire-and-forget pub/sub) instead of JetStream. Each worker assigns monotonically increasing event IDs to its events. The router detects gaps in event sequences and recovers missed events by querying the worker's local indexer directly. - -- **Best for**: Lower-latency setups; simpler deployments without JetStream; single-router scenarios -- **Tradeoffs**: State persists on workers (not centralized); recovery depends on workers being available -- **Enable with**: `--enable-local-indexer` flag on workers (vLLM, mocker) - -```mermaid -graph TD - subgraph Engines - E1[Engine 1
LocalKvIndexer] - E2[Engine 2
LocalKvIndexer] - E3[Engine 3
LocalKvIndexer] - end - - subgraph "NATS Core" - NC[KV Events Pub/Sub
- Block created
- Block removed] - end - - subgraph "Router Replicas" - R1[Router 1
KVIndexer] - R2[Router 2
KVIndexer] - end - - E1 -->|Publish Events| NC - E2 -->|Publish Events| NC - E3 -->|Publish Events| NC - - NC -->|Subscribe| R1 - NC -->|Subscribe| R2 - - style NC fill:#e1f5fe,stroke:#333,color:#333 - style E1 fill:#f3e5f5,stroke:#333,color:#333 - style E2 fill:#f3e5f5,stroke:#333,color:#333 - style E3 fill:#f3e5f5,stroke:#333,color:#333 - style R1 fill:#2e8b57,stroke:#333,color:#fff - style R2 fill:#2e8b57,stroke:#333,color:#fff - - linkStyle 0,1,2,3,4 stroke:#2196f3,stroke-width:2px -``` - -**How gap detection works:** -1. Each worker assigns monotonically increasing event IDs starting from 0 -2. The router tracks the last received event ID per worker -3. If an event arrives with `event_id > last_id + 1`, the router detects a gap -4. The router queries the worker's local indexer for the missing event range `[last_id+1, event_id-1]` -5. On worker discovery (Added event), the router dumps the worker's entire local indexer state - -**Startup behavior:** -- When a worker is discovered, the router queries and ingests its full local indexer state -- When a worker is removed, the router removes all its blocks from the global radix tree - ->[!Note] -> The router automatically selects the transport mode based on worker configuration. If all connected workers have `enable_local_indexer=true`, the router uses NATS Core mode. Otherwise, it uses JetStream mode. - -### Local Active Block Management with Replica Sync - -Second, in addition to cached blocks, each router replica needs to track active blocks (blocks being used for ongoing generation) as load metrics. Since this information is highly time-sensitive, it should be predicted immediately when: -- The router receives and routes a request -- The first token is generated (prefill complete) -- The response ends (request freed) - -This is managed locally in each router via a "slot manager". To maintain consistency across the system, router replicas synchronize these local predictions with each other through NATS core messaging. - -```mermaid -sequenceDiagram - participant C1 as Client 1 - participant R1 as Router 1
(Slot Manager) - participant R2 as Router 2
(Slot Manager) - participant C2 as Client 2 - - Note over R1,R2: Router Replica Sync Enabled - - C1->>R1: Request A - activate R1 - R1->>R1: Predict blocks & route to worker - R1-->>R2: Sync: AddRequest(A) - - C2->>R2: Request B - activate R2 - R2->>R2: Predict blocks & route to worker - R2-->>R1: Sync: AddRequest(B) - - R1->>R1: First token received
(prefill complete) - R1-->>R2: Sync: MarkPrefillCompleted(A) - R1->>C1: Stream response - - R2->>R2: First token received
(prefill complete) - R2-->>R1: Sync: MarkPrefillCompleted(B) - R2->>C2: Stream response - - R1->>R1: Response complete
(free blocks) - R1-->>R2: Sync: Free(A) - deactivate R1 - - R2->>R2: Response complete
(free blocks) - R2-->>R1: Sync: Free(B) - deactivate R2 - - Note over R1,R2: Both routers have consistent
view of active blocks -``` - -This dual-layer approach—persistent global KV cache state via JetStream and ephemeral active block synchronization via router replicas—enables the system to make optimal routing decisions that balance cache reuse with load distribution. - -## Basic Routing -Dynamo supports several routing strategies when sending requests from one component to another component's endpoint. - -First, we must create a client tied to a components endpoint, we can do this using the labels defined above. Here we are getting a client tied to the `generate` endpoint of the `VllmWorker` component. - -```python -client = namespace('dynamo').component('VllmWorker').endpoint('generate').client() -``` - -We can then use the default routing methods exposed by the client class to send requests to the `VllmWorker` component. - -- **Random routing**: Default strategy, available via `client.generate()` or `client.random()` -- **Round-robin routing**: Cycles through available workers via `client.round_robin()` -- **Direct routing**: Explicitly targets a specific worker via `client.direct(input, component_id)` - -KV Cache routing uses direct routing with a special worker selection algorithm. - -## Serving Multiple Router Replicas - -For improved fault tolerance, you can launch multiple frontend + router replicas. Since the frontend and router are currently tied together, you'll need to use different HTTP ports for each instance. (The separation of the frontend and Router is WIP.) - -### Router State Management - -The KV Router tracks two types of state (see [KV Router Architecture](../router/README.md) for details): - -1. **Prefix blocks (cached KV blocks)**: Maintained in a radix tree, tracking which blocks are cached on each worker. This state is **persistent** - backed by NATS JetStream events and object store snapshots. New router replicas automatically sync this state on startup, ensuring consistent cache awareness across restarts. - -2. **Active blocks (decoding blocks)**: Tracks blocks currently being used for active generation requests. This state is **ephemeral** - when a new router replica starts, it begins with zero active block knowledge but becomes eventually consistent as it handles requests. - -### Enabling Router Replica Synchronization - -```bash -# Router replica 1 -python -m dynamo.frontend --router-mode kv --port 8000 --router-replica-sync - -# Router replica 2 (can be started later) -python -m dynamo.frontend --router-mode kv --port 8001 --router-replica-sync -``` - -The `--router-replica-sync` flag enables active block synchronization between replicas: -- Active blocks are shared via NATS core messaging (fire-and-forget) -- Replicas exchange routing decisions to maintain consistent load estimates -- A new replica start with zero active blocks but quickly converge through request handling, by itself and active syncing with other replicas - -Without this flag, each replica maintains its own isolated view of active blocks, potentially leading to suboptimal routing. - -### Persistence and Recovery - -Persistence behavior depends on which event transport mode is active: - -**JetStream Mode (default):** -- Prefix blocks are stored in NATS JetStream with 1-hour retention -- Snapshots saved to NATS object store at configurable thresholds -- New replicas automatically restore this state on startup -- You can launch a third Router replica even if the first two are down, and it will recover the full prefix state - -```bash -python -m dynamo.frontend --router-mode kv --port 8002 --router-replica-sync -``` - -**NATS Core with Local Indexer Mode:** -- State persists on workers—events are fire-and-forget but workers retain their local indexer state -- On startup, the router queries each worker's local indexer to rebuild state -- Recovery depends on workers being available; if a worker is down, its blocks cannot be recovered -- Simpler infrastructure (no JetStream required) but less resilient - ->[!Note] -> If you need to start with a fresh state in JetStream mode, you have two options: -> 1. **Recommended**: Use a different namespace/component (see [Distributed Runtime](/docs/design_docs/distributed_runtime.md)) which will start a new stream and NATS object store path -> 2. **Use with caution**: Launch a router with the `--router-reset-states` flag, which will purge the entire stream and radix snapshot. This should only be done when launching the first router replica in a component, as it can bring existing router replicas into an inconsistent state. - -## Understanding KV Cache -The leading Large Language Models (LLMs) today are auto-regressive and based off of the [transformer architecture](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf). One key inference optimization technique is to cache the already computed keys and values and to reuse them for the future tokens. This is called the [KV Cache](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/#key-value_caching). - -### KV Cache Optimizations -Every inference framework will have a KV Cache for each worker. A popular inference framework library is [vLLM](https://github.com/vllm-project/vllm) where a key contribution was [PagedAttention](https://arxiv.org/abs/2309.06180), which allowed them to manage KV Cache in an efficient way by chunking requests into blocks. - -Another popular inference framework, [SGLang](https://github.com/sgl-project/sglang), contributed [RadixAttention](https://arxiv.org/abs/2312.07104) which introduced a -prefix tree which allows for efficient matching, inserting and eviction of KV Cache blocks. The prefix tree structure popularized KV Cache reuse. - -In Dynamo, we introduce a KVPublisher which emits KV Cache events that occur at each worker and a KVIndexer which keeps track of these events globally. - -To get a feel for how KV Cache management works on a single worker with KV Cache reuse turned on and where the KVPublisher gets plugged in, we can walk through the KV Block management flow: -1. Request tokenization: The incoming prompt is converted into tokens -2. Block partitioning: The token sequence is divided into fixed-size blocks (e.g., 16 or 64 tokens per block) -3. Block hashing: Each block of tokens is hashed to create a unique identifier -4. Cache lookup: - - For each block, the system checks if a matching block already exists in the KV cache - - If a match is found, the existing KV cache block is reused - - If no match is found, the system proceeds to the next step -5. Resource allocation: - - For blocks without matches, the system attempts to allocate new memory space - - If sufficient memory is available, allocate memory space and proceed to step 7 - - If memory is constrained, proceed to step 6 -6. Cache eviction (when necessary): - - The system applies an eviction policy (e.g., LRU, LFU) to identify blocks for removal - - Selected blocks are evicted from the cache - - **KVPublisher emits a KV removed event notifying KVIndexer about the removed block.** - - Alternatively, some systems may offload less-frequently used blocks to CPU memory. -7. KV computation: - - For new blocks, the model computes key and value tensors - - These tensors are stored in the newly allocated cache blocks - - **KVPublisher emits a kv stored event notifying KVIndexer about newly stored blocks**. - -Further details can be found for: [TRT-LLM](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/), [vLLM](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching) and [SGLang](https://lmsys.org/blog/2024-01-17-sglang/). - -## KV Cache Routing and Load Balancing -```mermaid -graph TD - T[Tokens] --> R[KV Aware Router] - - R -.-> W1["Worker 1
Cached: 2 blocks
Prefill: 8 blks
Decode: 10 blks"] - R ==>|Selected| W2["Worker 2
Cached: 5 blocks
Prefill: 5 blks
Decode: 5 blks"] - R -.-> W3["Worker 3
Cached: 8 blocks
Prefill: 2 blks
Decode: 9 blks"] - - style T fill:#fff3e0,stroke:#333,color:#333 - style R fill:#2e8b57,stroke:#333,color:#fff - style W1 fill:#f3e5f5,stroke:#333,color:#333 - style W2 fill:#c8e6c9,stroke:#333,color:#333 - style W3 fill:#f3e5f5,stroke:#333,color:#333 - - linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px -``` - -KV Cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to: -- Missed cache reuse opportunities due to suboptimal worker selection -- System throughput degradation from uneven request distribution across workers - -The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions: - -### Cost Calculation - -1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion. - -2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed. - -3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks` - - Lower costs indicate better routing choices - - `overlap_score_weight` balances cache hit optimization against load distribution - - Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL) - -### Worker Selection - -The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution. - -Example calculation with `overlap_score_weight = 1.0`: -- Worker 1: cost = 1.0 * 8 + 10 = 18 -- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost) -- Worker 3: cost = 1.0 * 2 + 9 = 11 - -## Events - -### KVPublisher -The KVPublisher can be initialized and then called in the inference framework where blocks are allocated and removed. - -The two types of events are: -- KV stored event -- KV removed event - -The publisher can be initialized and used through C bindings or Python bindings. - -### Deterministic Event IDs - -Engines do not need to emit deterministic block identifiers in KV events, as the router uses local block hashes (computed from token content) for tracking and matching blocks across workers. However, it is strongly preferred that engines do emit deterministic block identifiers, as this keeps the KvIndexer's internal lookup table smaller and more efficient. To ensure deterministic behavior, all workers should use identical engine versions/configuration. If your engine relies on Python's builtin `hash()` for any event IDs, set `PYTHONHASHSEED=0`; otherwise this setting has no effect. - -### KVIndexer -The KVIndexer builds and maintains a global view of cached blocks in a prefix tree. We modify the original prefix tree by also storing the worker id on each node. This is so we can return the number of matched blocks for each worker. - -The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks. - -### Inter-Router Communication - -In distributed deployments with multiple routers, each router maintains visibility over only a portion of the total requests. To ensure consistent routing decisions, routers synchronize their states through three event types: - -1. **AddRequest**: Notifies other routers when a request is assigned to a worker. Includes request ID, worker ID, token sequence blocks, and overlap score to track block usage across the system. - -2. **MarkPrefillCompleted**: Signals when a request moves from prefill to decode phase, allowing routers to update their worker load calculations by excluding completed prefill tokens. - -3. **Free**: Indicates request completion and resource release, enabling accurate block reference counting across all routers. - -Each event carries a unique router ID to prevent self-event processing. This asynchronous communication system ensures optimal routing decisions by maintaining consistent KV cache state across all routers, even as they handle different request streams. - -## Using KvPushRouter Python API - -Instead of launching the KV Router via command line, you can create a `KvPushRouter` object directly in Python. This allows per-request routing configuration overrides. - ->[!Warning] -> **Multiple Routers in Same Process**: If you need to run multiple `KvPushRouter` instances for fault tolerance or load distribution, you must launch them in **separate processes** (e.g., using `python -m dynamo.frontend` with different ports). Creating multiple `KvPushRouter` objects in the same Python process is not supported - they share the same cancellation token from the component's primary lease, so dropping one router will cancel all routers in that process. For in-process routing, use a single `KvPushRouter` instance. - -### Methods - -The `KvPushRouter` provides the following methods: - -- **`generate(token_ids, model, ...)`**: Route and execute a request, returning an async stream of responses. Automatically handles worker selection, state tracking, and lifecycle management. - -- **`best_worker(token_ids, router_config_override=None, request_id=None)`**: Query which worker would be selected for given tokens. Returns `(worker_id, dp_rank, overlap_blocks)`. - - Without `request_id`: Query-only, doesn't update router state - - With `request_id`: Updates router state to track the request. **Note**: If used with `request_id`, you must call `mark_prefill_complete()` and `free()` at the appropriate lifecycle points to maintain accurate load tracking - -- **`get_potential_loads(token_ids)`**: Get detailed load information for all workers, including potential prefill tokens and active decode blocks. Returns a list of load dictionaries. - -- **`mark_prefill_complete(request_id)`**: Signal that a request has completed its prefill phase. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`. - -- **`free(request_id)`**: Signal that a request has completed and its resources should be released. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`. - -- **`dump_events()`**: Dump all KV cache events from the router's indexer as a JSON string. Useful for debugging and analysis. - -### Setup - -First, launch your backend engines: -```bash -python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf -``` - -### Example Script - -```python -import asyncio -from dynamollm import DistributedRuntime, KvPushRouter, KvRouterConfig - -async def main(): - # Get runtime and create endpoint - runtime = DistributedRuntime.detached() - namespace = runtime.namespace("dynamo") - component = namespace.component("backend") - endpoint = component.endpoint("generate") - - # Create KV router - kv_router_config = KvRouterConfig() - router = KvPushRouter( - endpoint=endpoint, - block_size=16, - kv_router_config=kv_router_config - ) - - # Your input tokens - token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] - - # Generate with per-request routing override - stream = await router.generate( - token_ids=token_ids, - model="meta-llama/Llama-2-7b-hf", - stop_conditions={ - "max_tokens": 20, # Generate exactly 20 tokens - "ignore_eos": True, # Don't stop at EOS token - }, - sampling_options={ - "temperature": 0.7, - "top_p": 0.9, - }, - router_config_override={ - "overlap_score_weight": 2.0, # Prioritize cache hits for this request - "router_temperature": 0.5, # Add routing randomness - } - ) - - # Collect generated tokens - generated_tokens = [] - async for response in stream: - if isinstance(response, dict) and "token_ids" in response: - generated_tokens.extend(response["token_ids"]) - - print(f"Generated {len(generated_tokens)} tokens: {generated_tokens}") - -if __name__ == "__main__": - asyncio.run(main()) -``` - -### Routing Patterns - -The `KvPushRouter` supports multiple usage patterns depending on your control requirements: - -#### 1. Automatic Routing (Recommended) -Call `generate()` directly and let the router handle everything: -```python -stream = await router.generate(token_ids=tokens, model="model-name") -``` -- **Best for**: Most use cases -- **Router automatically**: Selects best worker, updates state, routes request, tracks lifecycle - -#### 2. Manual State Management (Advanced) -Use `best_worker(request_id=...)` to select and track, then manage the request yourself: -```python -worker_id, _dp_rank, overlap = await router.best_worker(tokens, request_id="req-123") -response = await client.generate(tokens, request_id="req-123") -# await anext(response) # Get first token -await router.mark_prefill_complete("req-123") # After first token -# async for _ in response: # Continue generating -# ... -await router.free("req-123") # After completion -``` -- **Best for**: Custom request handling with router state tracking -- **Requires**: Calling `mark_prefill_complete()` and `free()` at correct lifecycle points -- **Caution**: Incorrect lifecycle management degrades load balancing accuracy - -#### 3. Hierarchical Router Probing -Query without state updates, then route through a chosen router: -```python -# Probe multiple routers without updating state -worker_id_1, dp_rank, overlap_1 = await router_1.best_worker(tokens) # No request_id -worker_id_2, dp_rank, overlap_2 = await router_2.best_worker(tokens) - -# Pick the best router based on results -chosen_router = router_1 if overlap_1 > overlap_2 else router_2 -stream = await chosen_router.generate(tokens, model="model-name", worker_id=worker_id) -``` -- **Best for**: Multi-tier deployments (e.g., Envoy Gateway routing to multiple router groups) -- **Advantage**: Query multiple routers before committing to one - -#### 4. Custom Load-Based Routing -Use `get_potential_loads()` to implement custom routing logic: -```python -loads = await router.get_potential_loads(tokens) -# Apply custom logic (e.g., weighted scoring, constraints) -best_worker = min(loads, key=lambda x: custom_cost_fn(x)) -stream = await router.generate(tokens, model="model-name", worker_id=best_worker['worker_id']) -``` -- **Best for**: Custom optimization strategies beyond the built-in cost function -- **Advantage**: Full control over worker selection logic -- **See also**: Detailed example below in "Custom Routing Example: Minimizing TTFT" - -All patterns support `router_config_override` to adjust routing behavior per-request without recreating the router. - -### Custom Routing Example: Minimizing TTFT - -Here's an example of using `get_potential_loads()` to implement custom routing that minimizes Time To First Token (TTFT) by selecting the worker with the least prefill work: - -```python -import asyncio -from dynamo.llm import DistributedRuntime, KvPushRouter, KvRouterConfig - -async def minimize_ttft_routing(): - # Setup router - runtime = DistributedRuntime.detached() - namespace = runtime.namespace("dynamo") - component = namespace.component("backend") - endpoint = component.endpoint("generate") - - router = KvPushRouter( - endpoint=endpoint, - block_size=16, - kv_router_config=KvRouterConfig() - ) - - # Your input tokens - token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] - - # Get potential loads for all workers - potential_loads = await router.get_potential_loads(token_ids) - - # Find worker with minimum prefill tokens (best for TTFT) - best_worker = min(potential_loads, key=lambda x: x['potential_prefill_tokens']) - - print(f"Worker loads: {potential_loads}") - print(f"Selected worker {best_worker['worker_id']} with {best_worker['potential_prefill_tokens']} prefill tokens") - - # Route directly to the selected worker - stream = await router.generate( - token_ids=token_ids, - model="meta-llama/Llama-2-7b-hf", - worker_id=best_worker['worker_id'], # Force routing to optimal worker - stop_conditions={"max_tokens": 20} - ) - - # Process response - async for response in stream: - if isinstance(response, dict) and "token_ids" in response: - print(f"Generated tokens: {response['token_ids']}") - -if __name__ == "__main__": - asyncio.run(minimize_ttft_routing()) -``` - -This approach gives you complete control over routing decisions, allowing you to optimize for different metrics based on your specific requirements. As some examples: - -- **Minimize TTFT**: Select worker with lowest `potential_prefill_tokens` -- **Maximize cache reuse**: Use `best_worker()` which considers both prefill and decode loads -- **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together - -See [KV Router Architecture](../router/README.md) for performance tuning details. - -## Dynamic Threshold Configuration - -The busy thresholds can be updated at runtime without restarting the frontend. The frontend exposes HTTP endpoints at `/busy_threshold`: - -**Get or set a model's thresholds (POST):** -```bash -# Set both thresholds for a model -curl -X POST http://localhost:8000/busy_threshold \ - -H "Content-Type: application/json" \ - -d '{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}' -# Response: {"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000} - -# Set only active decode blocks threshold -curl -X POST http://localhost:8000/busy_threshold \ - -H "Content-Type: application/json" \ - -d '{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85}' -# Response: {"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": } - -# Get current thresholds (omit threshold fields) -curl -X POST http://localhost:8000/busy_threshold \ - -H "Content-Type: application/json" \ - -d '{"model": "meta-llama/Llama-2-7b-hf"}' -# Response: {"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000} -# Or if not configured: {"model": "...", "active_decode_blocks_threshold": null, "active_prefill_tokens_threshold": null} -``` - -**List all configured thresholds (GET):** -```bash -curl http://localhost:8000/busy_threshold -# Response: {"thresholds": [{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}]} -``` diff --git a/docs/router/router_examples.md b/docs/router/router_examples.md new file mode 100644 index 00000000000..38ae414c091 --- /dev/null +++ b/docs/router/router_examples.md @@ -0,0 +1,550 @@ + + +# Router Examples + +For quick start instructions, see the [Router README](README.md). This document provides further examples for using the Dynamo Router, including Python API usage, Kubernetes deployments, and custom routing patterns. + +## Table of Contents + +- [Using KvPushRouter Python API](#using-kvpushrouter-python-api) +- [K8s Examples](#k8s-examples) +- [Routing Patterns](#routing-patterns) +- [Custom Routing Example: Minimizing TTFT](#custom-routing-example-minimizing-ttft) +- [KV Event Publishing for Custom Engines](#kv-event-publishing-for-custom-engines) + +## Using KvPushRouter Python API + +Instead of launching the KV Router via command line, you can create a `KvPushRouter` object directly in Python. This allows per-request routing configuration overrides. + +>[!Warning] +> **Multiple Routers in Same Process**: If you need to run multiple `KvPushRouter` instances for fault tolerance or load distribution, you must launch them in **separate processes** (e.g., using `python -m dynamo.frontend` with different ports). Creating multiple `KvPushRouter` objects in the same Python process is not supported - they share the same cancellation token from the component's primary lease, so dropping one router will cancel all routers in that process. For in-process routing, use a single `KvPushRouter` instance. + +### Methods + +The `KvPushRouter` provides the following methods: + +- **`generate(token_ids, model, ...)`**: Route and execute a request, returning an async stream of responses. Automatically handles worker selection, state tracking, and lifecycle management. + +- **`best_worker(token_ids, router_config_override=None, request_id=None)`**: Query which worker would be selected for given tokens. Returns `(worker_id, dp_rank, overlap_blocks)`. + - Without `request_id`: Query-only, doesn't update router state + - With `request_id`: Updates router state to track the request. **Note**: If used with `request_id`, you must call `mark_prefill_complete()` and `free()` at the appropriate lifecycle points to maintain accurate load tracking + +- **`get_potential_loads(token_ids)`**: Get detailed load information for all workers, including potential prefill tokens and active decode blocks. Returns a list of load dictionaries. + +- **`mark_prefill_complete(request_id)`**: Signal that a request has completed its prefill phase. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`. + +- **`free(request_id)`**: Signal that a request has completed and its resources should be released. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`. + +- **`dump_events()`**: Dump all KV cache events from the router's indexer as a JSON string. Useful for debugging and analysis. + +### Setup + +First, launch your backend engines: +```bash +python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf +``` + +### Example Script + +```python +import asyncio +from dynamollm import DistributedRuntime, KvPushRouter, KvRouterConfig + +async def main(): + # Get runtime and create endpoint + runtime = DistributedRuntime.detached() + namespace = runtime.namespace("dynamo") + component = namespace.component("backend") + endpoint = component.endpoint("generate") + + # Create KV router + kv_router_config = KvRouterConfig() + router = KvPushRouter( + endpoint=endpoint, + block_size=16, + kv_router_config=kv_router_config + ) + + # Your input tokens + token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] + + # Generate with per-request routing override + stream = await router.generate( + token_ids=token_ids, + model="meta-llama/Llama-2-7b-hf", + stop_conditions={ + "max_tokens": 20, # Generate exactly 20 tokens + "ignore_eos": True, # Don't stop at EOS token + }, + sampling_options={ + "temperature": 0.7, + "top_p": 0.9, + }, + router_config_override={ + "overlap_score_weight": 2.0, # Prioritize cache hits for this request + "router_temperature": 0.5, # Add routing randomness + } + ) + + # Collect generated tokens + generated_tokens = [] + async for response in stream: + if isinstance(response, dict) and "token_ids" in response: + generated_tokens.extend(response["token_ids"]) + + print(f"Generated {len(generated_tokens)} tokens: {generated_tokens}") + +if __name__ == "__main__": + asyncio.run(main()) +``` + +## K8s Examples + +For basic Kubernetes deployment with the KV Router, see the [Kubernetes Deployment section](README.md#kubernetes-deployment) in the Quick Start guide. + +### Complete K8s Examples + +- [TRT-LLM aggregated router example](../../examples/backends/trtllm/deploy/agg_router.yaml) +- [vLLM aggregated router example](../../examples/backends/vllm/deploy/agg_router.yaml) +- [SGLang aggregated router example](../../examples/backends/sglang/deploy/agg_router.yaml) +- [Distributed inference tutorial](../../examples/basics/kubernetes/Distributed_Inference/agg_router.yaml) + +**For A/B Testing and Advanced K8s Setup:** +See the comprehensive [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes. + +### Example with Advanced Configuration + +```yaml +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: my-deployment +spec: + services: + Frontend: + dynamoNamespace: my-namespace + componentType: frontend + replicas: 1 + envs: + - name: DYN_ROUTER_MODE + value: kv + - name: DYN_ROUTER_TEMPERATURE + value: "0.5" # Add some randomness to prevent worker saturation + - name: DYN_KV_OVERLAP_SCORE_WEIGHT + value: "1.5" # Prioritize TTFT over ITL + - name: DYN_KV_CACHE_BLOCK_SIZE + value: "16" + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0 +``` + +### Alternative: Using Command Args in K8s + +You can also pass CLI arguments directly in the container command: + +```yaml +extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0 + command: + - /bin/sh + - -c + args: + - "python3 -m dynamo.frontend --router-mode kv --router-temperature 0.5 --http-port 8000" +``` + +**Recommendation:** Use environment variables for easier configuration management and consistency with Dynamo's K8s patterns. + +## Routing Patterns + +The `KvPushRouter` supports multiple usage patterns depending on your control requirements: + +### 1. Automatic Routing (Recommended) +Call `generate()` directly and let the router handle everything: +```python +stream = await router.generate(token_ids=tokens, model="model-name") +``` +- **Best for**: Most use cases +- **Router automatically**: Selects best worker, updates state, routes request, tracks lifecycle + +### 2. Manual State Management (Advanced) +Use `best_worker(request_id=...)` to select and track, then manage the request yourself: +```python +worker_id, _dp_rank, overlap = await router.best_worker(tokens, request_id="req-123") +response = await client.generate(tokens, request_id="req-123") +# await anext(response) # Get first token +await router.mark_prefill_complete("req-123") # After first token +# async for _ in response: # Continue generating +# ... +await router.free("req-123") # After completion +``` +- **Best for**: Custom request handling with router state tracking +- **Requires**: Calling `mark_prefill_complete()` and `free()` at correct lifecycle points +- **Caution**: Incorrect lifecycle management degrades load balancing accuracy + +### 3. Hierarchical Router Probing +Query without state updates, then route through a chosen router: +```python +# Probe multiple routers without updating state +worker_id_1, dp_rank, overlap_1 = await router_1.best_worker(tokens) # No request_id +worker_id_2, dp_rank, overlap_2 = await router_2.best_worker(tokens) + +# Pick the best router based on results +chosen_router = router_1 if overlap_1 > overlap_2 else router_2 +stream = await chosen_router.generate(tokens, model="model-name", worker_id=worker_id) +``` +- **Best for**: Multi-tier deployments (e.g., Envoy Gateway routing to multiple router groups) +- **Advantage**: Query multiple routers before committing to one + +### 4. Custom Load-Based Routing +Use `get_potential_loads()` to implement custom routing logic: +```python +loads = await router.get_potential_loads(tokens) +# Apply custom logic (e.g., weighted scoring, constraints) +best_worker = min(loads, key=lambda x: custom_cost_fn(x)) +stream = await router.generate(tokens, model="model-name", worker_id=best_worker['worker_id']) +``` +- **Best for**: Custom optimization strategies beyond the built-in cost function +- **Advantage**: Full control over worker selection logic +- **See also**: Detailed example below in "Custom Routing Example: Minimizing TTFT" + +All patterns support `router_config_override` to adjust routing behavior per-request without recreating the router. + +## Custom Routing Example: Minimizing TTFT + +Here's an example of using `get_potential_loads()` to implement custom routing that minimizes Time To First Token (TTFT) by selecting the worker with the least prefill work: + +```python +import asyncio +from dynamo.llm import DistributedRuntime, KvPushRouter, KvRouterConfig + +async def minimize_ttft_routing(): + # Setup router + runtime = DistributedRuntime.detached() + namespace = runtime.namespace("dynamo") + component = namespace.component("backend") + endpoint = component.endpoint("generate") + + router = KvPushRouter( + endpoint=endpoint, + block_size=16, + kv_router_config=KvRouterConfig() + ) + + # Your input tokens + token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] + + # Get potential loads for all workers + potential_loads = await router.get_potential_loads(token_ids) + + # Find worker with minimum prefill tokens (best for TTFT) + best_worker = min(potential_loads, key=lambda x: x['potential_prefill_tokens']) + + print(f"Worker loads: {potential_loads}") + print(f"Selected worker {best_worker['worker_id']} with {best_worker['potential_prefill_tokens']} prefill tokens") + + # Route directly to the selected worker + stream = await router.generate( + token_ids=token_ids, + model="meta-llama/Llama-2-7b-hf", + worker_id=best_worker['worker_id'], # Force routing to optimal worker + stop_conditions={"max_tokens": 20} + ) + + # Process response + async for response in stream: + if isinstance(response, dict) and "token_ids" in response: + print(f"Generated tokens: {response['token_ids']}") + +if __name__ == "__main__": + asyncio.run(minimize_ttft_routing()) +``` + +This approach gives you complete control over routing decisions, allowing you to optimize for different metrics based on your specific requirements. As some examples: + +- **Minimize TTFT**: Select worker with lowest `potential_prefill_tokens` +- **Maximize cache reuse**: Use `best_worker()` which considers both prefill and decode loads +- **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together + +See [Router Design](../design_docs/router_design.md) for architecture details and the cost function algorithm. + +## KV Event Publishing for Custom Engines + +The KV Router relies on real-time events from backend workers to track which KV cache blocks are stored on each worker. When your custom engine allocates or evicts KV cache blocks, it should publish these events so the router can make optimal routing decisions. There are two main publishing pathways: direct NATS publishing (`KvEventPublisher`) which publishes events directly to NATS and is the simplest approach for custom engines, and ZMQ-based publishing for engines with ZMQ event output (like vLLM) which uses a ZMQ publisher in the engine and `ZmqKvEventPublisher` to forward events to NATS. + +### Event Types + +The KV cache supports three event types: + +| Event Type | Description | When to Publish | +|------------|-------------|-----------------| +| `BlockStored` | New blocks added to cache | After KV cache allocation succeeds | +| `BlockRemoved` | Blocks evicted from cache | When blocks are evicted or freed | +| `AllBlocksCleared` | All blocks removed | On cache reset or worker restart | + +### Event Structure + +Each event contains: +- **`event_id`**: Monotonically increasing identifier per worker +- **`dp_rank`**: Data parallel rank (0 if DP not enabled) +- **`data`**: One of `Stored`, `Removed`, or `Cleared` + +For `BlockStored` events: +- **`token_ids`**: List of token IDs for the stored blocks +- **`block_hashes`**: List of **sequence block hashes** from the engine's block manager. These are cumulative hashes that incorporate all tokens from the start of the sequence up to and including the current block (not just the tokens within that block). This enables prefix matching across requests. +- **`num_block_tokens`**: Number of tokens per block (should all equal `kv_block_size`) +- **`parent_hash`**: Hash of the parent block. Required for all blocks except the first block in a sequence (which has no parent). +- **`lora_id`**: LoRA adapter ID (0 if not using LoRA) + +For `BlockRemoved` events: +- **`block_hashes`**: List of sequence block hashes being evicted + +### Option 1: Direct NATS Publishing (Recommended) + +The `KvEventPublisher` class publishes events directly to NATS. This is the simplest approach for custom engines. + +```mermaid +flowchart LR + subgraph Engine["Custom Engine"] + cache["KV Cache Manager"] + end + + subgraph Worker["Dynamo Worker Process"] + pub["KvEventPublisher"] + end + + subgraph NATS["NATS"] + subject["kv-events subject"] + end + + subgraph Router["KV Router"] + indexer["KvIndexer"] + end + + cache -->|"on_blocks_stored()
on_blocks_removed()"| pub + pub -->|"publish to NATS"| subject + subject --> indexer +``` + +**When to use:** +- Building a custom inference engine from scratch +- Your engine doesn't have a ZMQ-based event system +- You want the simplest integration path + +#### Basic Setup + +```python +from dynamo.llm import KvEventPublisher + +class CustomEnginePublisher: + def __init__(self, component, worker_id: int, block_size: int, dp_rank: int = 0): + self.block_size = block_size + self.event_id = 0 + self.kv_publisher = KvEventPublisher( + component=component, + worker_id=worker_id, + kv_block_size=block_size, + dp_rank=dp_rank, + enable_local_indexer=False, + ) + + def on_blocks_stored(self, token_ids: list[int], block_hashes: list[int], + lora_id: int = 0, parent_hash: int | None = None): + """Call after KV cache blocks are allocated.""" + self.event_id += 1 + num_block_tokens = [self.block_size] * len(block_hashes) + self.kv_publisher.publish_stored( + event_id=self.event_id, + token_ids=token_ids, + num_block_tokens=num_block_tokens, + block_hashes=block_hashes, + lora_id=lora_id, + parent_hash=parent_hash, + ) + + def on_blocks_removed(self, block_hashes: list[int]): + """Call when KV cache blocks are evicted.""" + self.event_id += 1 + self.kv_publisher.publish_removed(event_id=self.event_id, block_hashes=block_hashes) +``` + +#### Integration with Your Engine + +```python +from dynamo.llm import register_llm + +async def main(): + # Register your engine with Dynamo + component, endpoint = await register_llm( + model="my-model", + generator=my_generate_fn, + ) + + # Initialize publisher + publisher = CustomEnginePublisher( + component=component, + worker_id=endpoint.connection_id(), + block_size=16, # Match your engine's block size + ) + + # Hook into your engine's cache events + def on_prefill_complete(request_id, token_ids, blocks): + block_hashes = [block.hash for block in blocks] + publisher.on_blocks_stored(token_ids=token_ids, block_hashes=block_hashes) + + def on_cache_eviction(evicted_blocks): + block_hashes = [block.hash for block in evicted_blocks] + publisher.on_blocks_removed(block_hashes=block_hashes) +``` + +### Option 2: ZMQ-based Publishing + +For engines that publish events via ZMQ (like vLLM), this option uses two components that work together: + +1. **ZMQ Publisher** (in your engine) - Publishes events to a ZMQ socket +2. **ZmqKvEventPublisher** (Dynamo binding) - Subscribes to ZMQ and forwards to NATS + +```mermaid +flowchart LR + subgraph Engine["Custom Engine / vLLM"] + cache["KV Cache Manager"] + zmq_pub["ZMQ Publisher
(Pure Python)"] + end + + subgraph ZMQ["ZMQ Socket"] + socket["tcp://127.0.0.1:5557"] + end + + subgraph Worker["Dynamo Worker Process"] + zmq_sub["ZmqKvEventPublisher
(Rust bindings)"] + end + + subgraph NATS["NATS"] + subject["kv-events subject"] + end + + subgraph Router["KV Router"] + indexer["KvIndexer"] + end + + cache --> zmq_pub + zmq_pub -->|"PUB"| socket + socket -->|"SUB"| zmq_sub + zmq_sub --> subject + subject --> indexer +``` + +**When to use:** +- Your engine already has a ZMQ-based event system (like vLLM) +- You're integrating with a consolidator (like KVBM) +- You want to decouple event publishing from your engine's main loop + +#### Part 1: ZMQ Subscriber (Dynamo Bindings) + +If your engine already publishes to ZMQ, use `ZmqKvEventPublisher` to subscribe and forward to NATS: + +```python +from dynamo.llm import ZmqKvEventPublisher, ZmqKvEventPublisherConfig + +# Configure the ZMQ subscriber +config = ZmqKvEventPublisherConfig( + worker_id=endpoint.connection_id(), + kv_block_size=block_size, + zmq_endpoint="tcp://127.0.0.1:5557", # Where your engine publishes + zmq_topic="", # Subscribe to all topics + enable_local_indexer=False, +) + +# Create publisher - it automatically subscribes to ZMQ and forwards to NATS +kv_publisher = ZmqKvEventPublisher( + component=component, + config=config, +) +``` + +#### Part 2: ZMQ Publisher (Pure Python) + +If your engine needs to publish to ZMQ (e.g., for consolidator integration), implement the ZMQ protocol: + +```python +import zmq +import msgpack +import time + +class ZmqKvEventPublisher: + """Pure Python ZMQ publisher for KV events (vLLM-compatible format).""" + + def __init__(self, zmq_endpoint: str, kv_block_size: int, topic: str = ""): + self.kv_block_size = kv_block_size + self.topic = topic + self.ctx = zmq.Context() + self.socket = self.ctx.socket(zmq.PUB) + self.socket.bind(zmq_endpoint) + self.sequence = 0 + self.data_parallel_rank = 0 + + def _to_signed_i64(self, value: int | None) -> int | None: + if value is None: + return None + return value - 0x10000000000000000 if value > 0x7FFFFFFFFFFFFFFF else value + + def publish_stored(self, event_id: int, token_ids: list[int], num_block_tokens: list[int], + block_hashes: list[int], lora_id: int = 0, parent_hash: int | None = None): + event = { + "type": "BlockStored", + "block_hashes": [self._to_signed_i64(h) for h in block_hashes], + "parent_block_hash": self._to_signed_i64(parent_hash), + "token_ids": token_ids, + "block_size": self.kv_block_size, + "lora_id": lora_id if lora_id != 0 else None, + } + self._publish_event(event) + + def publish_removed(self, event_id: int, block_hashes: list[int]): + event = {"type": "BlockRemoved", "block_hashes": [self._to_signed_i64(h) for h in block_hashes]} + self._publish_event(event) + + def publish_all_cleared(self): + self._publish_event({"type": "AllBlocksCleared"}) + + def _publish_event(self, event: dict): + batch = [time.time(), [event], self.data_parallel_rank] + payload = msgpack.packb(batch, use_bin_type=True) + sequence_bytes = self.sequence.to_bytes(8, byteorder="big") + self.sequence += 1 + self.socket.send_multipart([self.topic.encode(), sequence_bytes, payload]) + + def shutdown(self): + self.socket.close() + self.ctx.term() +``` + +### ZMQ Wire Format + +The ZMQ message format (compatible with vLLM): + +| Frame | Description | +|-------|-------------| +| 1 | Topic (empty string for all topics) | +| 2 | Sequence number (8 bytes, big-endian) | +| 3 | Msgpack payload: `[timestamp, [events], dp_rank]` | + +Each event in the payload is a dictionary with `type` field (`BlockStored`, `BlockRemoved`, or `AllBlocksCleared`). + +### Best Practices + +1. **Event IDs must be monotonically increasing** per worker (use a thread-safe counter) + +2. **Block size must match** your engine's actual `kv_block_size` + +3. **`parent_hash` is required** for all blocks except the first in a sequence - it links blocks to enable prefix matching + +## See Also + +- **[Router README](README.md)**: Quick start guide for the KV Router +- **[Router Guide](router_guide.md)**: Configuration, tuning, and production setup +- **[Router Design](../design_docs/router_design.md)**: Architecture details and event transport modes diff --git a/docs/router/router_guide.md b/docs/router/router_guide.md new file mode 100644 index 00000000000..c8604ce1881 --- /dev/null +++ b/docs/router/router_guide.md @@ -0,0 +1,350 @@ + + +# Router Guide + +## Overview + +For quick start instructions, start with the [Router README](README.md). This guide covers details into further configuration, disaggregated serving setup, and parameter tuning. + +## KV Cache Routing + +KV cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data. By maximizing cache reuse, it reduces redundant computation and improves both throughput and latency. + +```mermaid +graph TD + T[Tokens] --> R[KV Aware Router] + + R -.-> W1["Worker 1
Cached: 2 blocks
Prefill: 8 blks
Decode: 10 blks"] + R ==>|Selected| W2["Worker 2
Cached: 5 blocks
Prefill: 5 blks
Decode: 5 blks"] + R -.-> W3["Worker 3
Cached: 8 blocks
Prefill: 2 blks
Decode: 9 blks"] + + style T fill:#fff3e0,stroke:#333,color:#333 + style R fill:#2e8b57,stroke:#333,color:#fff + style W1 fill:#f3e5f5,stroke:#333,color:#333 + style W2 fill:#c8e6c9,stroke:#333,color:#333 + style W3 fill:#f3e5f5,stroke:#333,color:#333 + + linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px +``` + +KV Cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to: +- Missed cache reuse opportunities due to suboptimal worker selection +- System throughput degradation from uneven request distribution across workers + +The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions: + +### Cost Calculation + +1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion. + +2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed. + +3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks` + - Lower costs indicate better routing choices + - `overlap_score_weight` balances cache hit optimization against load distribution + - Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL) + +### Worker Selection + +The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution. + +Example calculation with `overlap_score_weight = 1.0`: +- Worker 1: cost = 1.0 * 8 + 10 = 18 +- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost) +- Worker 3: cost = 1.0 * 2 + 9 = 11 + +### Using the KV Cache Router + +To enable KV cache-aware routing, start the frontend node like this: +```bash +python -m dynamo.frontend --router-mode kv +``` + +When KV blocks are created or removed, the engine notifies the Dynamo router, which then identifies the worker with the best matching blocks and routes traffic accordingly. + +To evaluate the benefits of KV-aware routing, compare your workload's performance using `--router-mode random|round-robin` against KV-aware routing. + +The main KV-aware routing arguments: + +- `--kv-overlap-score-weight`: Controls the importance of prefix cache overlaps in prefill cost calculations. Higher values improve Time To First Token (TTFT) at the cost of Inter-Token Latency (ITL). When set to 0, the router ignores prefix caches and uses pure load balancing. Defaults to 1. + +- `--router-temperature`: Controls worker selection randomness through softmax sampling of router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness. + +- `--no-kv-events`: Disables KV event tracking. By default (when this flag is not provided), the router uses KV events to monitor block creation and deletion from workers. When disabled with this flag, the router predicts cache state based on routing decisions with TTL-based expiration (default 120s) and pruning. Use this flag if your backend doesn't support KV events (or you are not confident in the accuracy or responsiveness of the events). + +- `--router-replica-sync`: Disabled by default. Enables NATS-based synchronization of local routing decisions between router replicas. When enabled, routers share their active sequence information and local predictions of block usage, improving routing consistency across instances. Note that this does not sync the radix tree or cached KV block states themselves - those are synchronized through JetStream events + +- `--router-reset-states`: When specified, resets the router state on startup by clearing both the JetStream event stream and NATS object store, starting with a fresh state. By default (when this flag is not provided), the router persists state across restarts, downloading any available snapshot from NATS object store and continuing to consume events from where it left off. This enables routers to maintain KV cache awareness across restarts. **Warning**: Using `--router-reset-states` can bring existing router replicas into an inconsistent state. Only use this flag when launching the first router replica in a component, or consider using a different namespace/component for a clean slate. + +- `--router-snapshot-threshold`: Sets the number of messages in the JetStream before triggering a snapshot. When the message count exceeds this threshold, a router will attempt to purge acknowledged messages from the stream and create a snapshot of the current radix tree state in NATs object store. Defaults to 1000000. This helps manage stream size and provides faster initialization for routers that restart. + +- `--no-track-active-blocks`: Disables tracking of active blocks (blocks being used for ongoing generation/decode phases). By default, the router tracks active blocks for load balancing. Disable this when routing to workers that only perform prefill (no decode phase), as tracking decode load is not relevant. This reduces router overhead and simplifies state management. + +- `--no-assume-kv-reuse`: When tracking active blocks, disables the assumption of KV cache reuse. By default (`router_assume_kv_reuse=true`), the router computes actual block hashes for sequence tracking to deduplicate blocks and optimize load balancing. When disabled via this flag, the router generates random hashes for sequence blocks, treating each request's blocks as unique. This is useful in disaggregated setups where prefill transfers blocks to decode workers that may already have those blocks cached, but the engine cannot coordinate transfers to avoid duplication. Without this flag, the router's load balancing heuristics would undercount decode blocks when duplicates exist. + +- `--active-decode-blocks-threshold`: Initial threshold (0.0-1.0) for determining when a worker is considered busy based on KV cache block utilization. When a worker's KV cache active blocks exceed this percentage of total blocks, it will be marked as busy and excluded from routing. If not set, blocks-based busy detection is disabled. This feature works with all routing modes (`--router-mode kv|round-robin|random`) as long as backend engines publish load metrics. The threshold can be dynamically updated at runtime via the `/busy_threshold` HTTP endpoint (see [Dynamic Threshold Configuration](#dynamic-threshold-configuration)). + +- `--active-prefill-tokens-threshold`: Literal token count threshold for determining when a worker is considered busy based on prefill token utilization. When active prefill tokens exceed this threshold, the worker is marked as busy. If not set, tokens-based busy detection is disabled. + +- `--router-ttl`: Time-to-live in seconds for blocks in the router's local cache predictions. Blocks older than this duration will be automatically expired and removed from the router's radix tree. Defaults to 120.0 seconds when `--no-kv-events` is used. This helps manage memory usage by removing stale cache predictions that are unlikely to be accurate. + +- `--router-max-tree-size`: Maximum tree size (number of blocks) before pruning is triggered. When the total number of blocks in the radix tree exceeds this threshold, the router will prune the least recently used blocks. Defaults to 1048576 (2^20 blocks) when `--no-kv-events` is used. This prevents unbounded memory growth in long-running deployments. + +- `--router-prune-target-ratio`: Target size ratio to prune down to when `--router-max-tree-size` is exceeded. For example, with a value of 0.8 (default) and max tree size of 1048576, the router will prune down to approximately 838860 blocks when the threshold is exceeded. Defaults to 0.8 when `--no-kv-events` is used. This creates headroom before the next pruning cycle. + +>[!Note] +> **State persistence** depends on the event transport mode: +> - **JetStream mode** (default): State persists across router restarts via JetStream and NATS object store snapshots. +> - **NATS Core with Local Indexer mode** (`--enable-local-indexer` on workers): State persists on workers—router rebuilds state by querying workers on startup. +> - **No KV events** (`--no-kv-events`): State persistence is not supported. +> +> **Request plane is independent of KV event transport.** +> `DYN_REQUEST_PLANE` controls how **requests** are sent (TCP/HTTP/NATS), but KV-aware routing still uses **NATS** for KV events in both JetStream and NATS Core + Local Indexer modes. +> When KV events are enabled (default), NATS is automatically initialized. You can optionally set `NATS_SERVER=nats://...` to specify a custom NATS server; otherwise, it defaults to `localhost:4222`. +> Use `--no-kv-events` to disable KV events and remove the NATS requirement entirely (with request plane being `tcp` or `http`). +> +> When `--kv-overlap-score-weight` is set to 0, no KvIndexer is created and prefix matching is disabled (pure load balancing). When `--no-kv-events` is set, a KvIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning. +> +> **Backend Configuration:** When using `--no-kv-events`, configure your backend workers to disable KV event publishing: +> - **vLLM**: Use `--kv-events-config '{"enable_kv_cache_events": false}'` +> - **SGLang**: Do not use `--kv-events-config` +> - **TRT-LLM**: Do not use `--publish-events-and-metrics` +> +> The cli args `--router-ttl`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When KV events are enabled (default), the router relies on worker-side eviction events and these parameters are ignored. + +To implement KV event publishing for custom inference engines, enabling them to participate in Dynamo's KV cache-aware routing, see [KV Event Publishing for Custom Engines](../integrations/kv_events_custom_engines.md). + +## Basic Routing + +Dynamo supports several routing strategies when sending requests from one component to another component's endpoint. + +First, we must create a client tied to a components endpoint, we can do this using the labels defined above. Here we are getting a client tied to the `generate` endpoint of the `VllmWorker` component. + +```python +client = namespace('dynamo').component('VllmWorker').endpoint('generate').client() +``` + +We can then use the default routing methods exposed by the client class to send requests to the `VllmWorker` component. + +- **Random routing**: Default strategy, available via `client.generate()` or `client.random()` +- **Round-robin routing**: Cycles through available workers via `client.round_robin()` +- **Direct routing**: Explicitly targets a specific worker via `client.direct(input, component_id)` + +KV Cache routing uses direct routing with a special worker selection algorithm. + +For benchmarking KV router performance, see the [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md). + +For custom routing logic and advanced patterns, see [Routing Patterns](router_examples.md#routing-patterns) in the examples documentation. + +## Tuning Guidelines + +### 1. Understand Your Workload Characteristics + +- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight` +- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight` + +### 2. Monitor Key Metrics + +The router logs the cost calculation for each worker: +```text +Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15) +``` + +This shows: +- Total cost (125.3) +- Overlap weight × prefill blocks (1.0 × 100.5) +- Active blocks (25.0) +- Cached blocks that contribute to overlap (15) + +### 3. Temperature-Based Routing + +The `router_temperature` parameter controls routing randomness: +- **0.0 (default)**: Deterministic selection of the best worker +- **> 0.0**: Probabilistic selection, higher values increase randomness +- Useful for preventing worker saturation and improving load distribution + +### 4. Iterative Optimization + +1. Begin with default settings +2. Monitor TTFT and ITL metrics +3. Adjust `kv-overlap-score-weight` to meet your performance goals: + - To reduce TTFT: Increase the weight + - To reduce ITL: Decrease the weight +4. If you observe severe load imbalance, increase the temperature setting + +## Disaggregated Serving + +Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../development/backend-guide.md)), the frontend automatically detects them and activates an internal prefill router. + +### Automatic Prefill Router Activation + +The prefill router is automatically created when: +1. A decode model is registered (e.g., via `register_llm()` with `ModelType.Chat | ModelType.Completions`) +2. A prefill worker is detected with the same model name and `ModelType.Prefill` + +**Key characteristics of the prefill router:** +- **Always disables active block tracking** (`track_active_blocks=false`) since prefill workers don't perform decode +- **Seamlessly integrated** into the request pipeline between preprocessing and decode routing +- **Falls back gracefully** to decode-only mode if prefill fails or no prefill workers are available + +### Setup Example + +When both workers are registered, requests are automatically routed. + +```python +# Decode worker registration (in your decode worker) +decode_endpoint = runtime.namespace("dynamo").component("decode").endpoint("generate") + +await register_llm( + model_input=ModelInput.Tokens, + model_type=ModelType.Chat | ModelType.Completions, + endpoint=decode_endpoint, + model_name="meta-llama/Llama-2-7b-hf", + # ... other parameters +) + +await decode_endpoint.serve_endpoint(decode_handler.generate) + +# Prefill worker registration (in your prefill worker) +prefill_endpoint = runtime.namespace("dynamo").component("prefill").endpoint("generate") + +await register_llm( + model_input=ModelInput.Tokens, + model_type=ModelType.Prefill, # <-- Mark as prefill worker + endpoint=prefill_endpoint, + model_name="meta-llama/Llama-2-7b-hf", # Must match decode model name + # ... other parameters +) + +await prefill_endpoint.serve_endpoint(prefill_handler.generate) +``` + +> [!Note] +> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh). + +### Request Flow + +The following diagram shows an overview of the major components in disaggregated serving: + +```mermaid +graph TD + HTTP[HTTP] + ROUTER[Router] + PREFILL[Prefill Worker] + DECODE[Decode Worker] + + classDef worker_style fill:#f3e5f5,stroke:#333,stroke-width:2px,color:#333; + classDef router_style fill:#2e8b57,stroke:#333,stroke-width:2px,color:#fff; + + class PREFILL,DECODE worker_style + class ROUTER router_style + + HTTP <--> |"request/response"| ROUTER + ROUTER --> |"1. send to prefill"| PREFILL + PREFILL --> |"2. return NIXL metadata"| ROUTER + ROUTER --> |"3. send with metadata"| DECODE + DECODE --> |"4. stream response"| ROUTER + + PREFILL -.-> |"publish kv events"| ROUTER + + linkStyle 0,1,2,3,4 stroke:#8b4513,stroke-width:2px + linkStyle 5 stroke:#2196f3,stroke-width:2px +``` + +## Serving Multiple Router Replicas + +For improved fault tolerance, you can launch multiple frontend + router replicas. Since the frontend and router are currently tied together, you'll need to use different HTTP ports for each instance. (The separation of the frontend and Router is WIP.) + +### Router State Management + +The KV Router tracks two types of state (see [Router Design](../design_docs/router_design.md) for details): + +1. **Prefix blocks (cached KV blocks)**: Maintained in a radix tree, tracking which blocks are cached on each worker. This state is **persistent** - backed by NATS JetStream events and object store snapshots. New router replicas automatically sync this state on startup, ensuring consistent cache awareness across restarts. + +2. **Active blocks (decoding blocks)**: Tracks blocks currently being used for active generation requests. This state is **ephemeral** - when a new router replica starts, it begins with zero active block knowledge but becomes eventually consistent as it handles requests. + +### Enabling Router Replica Synchronization + +```bash +# Router replica 1 +python -m dynamo.frontend --router-mode kv --port 8000 --router-replica-sync + +# Router replica 2 (can be started later) +python -m dynamo.frontend --router-mode kv --port 8001 --router-replica-sync +``` + +The `--router-replica-sync` flag enables active block synchronization between replicas: +- Active blocks are shared via NATS core messaging (fire-and-forget) +- Replicas exchange routing decisions to maintain consistent load estimates +- A new replica start with zero active blocks but quickly converge through request handling, by itself and active syncing with other replicas + +Without this flag, each replica maintains its own isolated view of active blocks, potentially leading to suboptimal routing. + +### Persistence and Recovery + +Persistence behavior depends on which event transport mode is active: + +**JetStream Mode (default):** +- Prefix blocks are stored in NATS JetStream with 1-hour retention +- Snapshots saved to NATS object store at configurable thresholds +- New replicas automatically restore this state on startup +- You can launch a third Router replica even if the first two are down, and it will recover the full prefix state + +```bash +python -m dynamo.frontend --router-mode kv --port 8002 --router-replica-sync +``` + +**NATS Core with Local Indexer Mode:** +- State persists on workers—events are fire-and-forget but workers retain their local indexer state +- On startup, the router queries each worker's local indexer to rebuild state +- Recovery depends on workers being available; if a worker is down, its blocks cannot be recovered +- Simpler infrastructure (no JetStream required) but less resilient + +>[!Note] +> If you need to start with a fresh state in JetStream mode, you have two options: +> 1. **Recommended**: Use a different namespace/component (see [Distributed Runtime](/docs/design_docs/distributed_runtime.md)) which will start a new stream and NATS object store path +> 2. **Use with caution**: Launch a router with the `--router-reset-states` flag, which will purge the entire stream and radix snapshot. This should only be done when launching the first router replica in a component, as it can bring existing router replicas into an inconsistent state. + +## Dynamic Threshold Configuration + +Dynamic threshold configuration allows you to adjust worker busy thresholds at runtime without restarting the frontend, enabling real-time tuning of load balancing behavior based on observed system performance. + +The busy thresholds can be updated at runtime without restarting the frontend. The frontend exposes HTTP endpoints at `/busy_threshold`: + +**Get or set a model's thresholds (POST):** +```bash +# Set both thresholds for a model +curl -X POST http://localhost:8000/busy_threshold \ + -H "Content-Type: application/json" \ + -d '{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}' +# Response: {"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000} + +# Set only active decode blocks threshold +curl -X POST http://localhost:8000/busy_threshold \ + -H "Content-Type: application/json" \ + -d '{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85}' +# Response: {"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": } + +# Get current thresholds (omit threshold fields) +curl -X POST http://localhost:8000/busy_threshold \ + -H "Content-Type: application/json" \ + -d '{"model": "meta-llama/Llama-2-7b-hf"}' +# Response: {"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000} +# Or if not configured: {"model": "...", "active_decode_blocks_threshold": null, "active_prefill_tokens_threshold": null} +``` + +**List all configured thresholds (GET):** +```bash +curl http://localhost:8000/busy_threshold +# Response: {"thresholds": [{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}]} +``` + +## See Also + +- **[Router README](README.md)**: Quick start guide for the KV Router +- **[Router Examples](router_examples.md)**: Python API usage, K8s examples, and custom routing patterns +- **[Router Design](../design_docs/router_design.md)**: Architecture details and event transport modes +- **[KV Event Publishing for Custom Engines](../integrations/kv_events_custom_engines.md)**: Integrate custom inference engines with KV-aware routing diff --git a/docs/templates/MIGRATION_GUIDE.md b/docs/templates/MIGRATION_GUIDE.md index ccb364f30a1..b1c61b3b407 100644 --- a/docs/templates/MIGRATION_GUIDE.md +++ b/docs/templates/MIGRATION_GUIDE.md @@ -130,6 +130,52 @@ Check `docs/_includes/` for includes: --- +## Pre-Migration Link Validation + +Before migrating, validate source docs to avoid carrying over broken links. + +### Pre-flight Broken Link Check + +```bash +# Install lychee (if not available) +cargo install lychee # or: brew install lychee + +# Check source files (example: migrating kvbm docs) +lychee docs/kvbm/ --offline --exclude-path docs/_build + +# Or use the full check with external URLs +lychee docs/kvbm/ --exclude-path docs/_build +``` + +If lychee is unavailable, use ripgrep to find potentially broken links: + +```bash +# Find all internal markdown links and spot-check targets +rg -n '\]\([^http][^)]*\.md' docs/kvbm/ +``` + +### Golden Rule + +**Only link to files that exist.** Before adding any link: + +1. Verify the target file exists at the expected path +2. Test the relative path calculation (count `../` correctly) +3. For cross-section links, consider using the cross-reference path table + +### Post-Migration Validation + +After moving files, run link check again to catch broken references: + +```bash +# Check all docs after migration +lychee docs/ --offline --exclude-path docs/_build + +# Check specific migrated directory (example: after moving to components/kvbm) +lychee docs/components/kvbm/ --offline +``` + +--- + ## Style Editing Guidelines After migrating content, review for FLOW, STYLE, and CONSISTENCY. diff --git a/dynamo.code-workspace b/dynamo.code-workspace index c3c84ca205e..2dae53708e6 100644 --- a/dynamo.code-workspace +++ b/dynamo.code-workspace @@ -2,6 +2,9 @@ "folders": [ { "path": "." + }, + { + "path": "../dynamo-tpm" } ], "settings": { diff --git a/examples/backends/trtllm/deploy/README.md b/examples/backends/trtllm/deploy/README.md index 7529783e5e5..834ea4544b1 100644 --- a/examples/backends/trtllm/deploy/README.md +++ b/examples/backends/trtllm/deploy/README.md @@ -266,7 +266,7 @@ Configure the `model` name and `host` based on your deployment. - **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md) - **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation_guide.md) - **Examples**: [Deployment Examples](../../../../docs/examples/README.md) -- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/router/kv_cache_routing.md) +- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/router/README.md) - **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/multinode-examples.md) - **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/llama4_plus_eagle.md) - **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) diff --git a/examples/backends/vllm/deploy/README.md b/examples/backends/vllm/deploy/README.md index 7f8c9520c40..4fc72dbb0fb 100644 --- a/examples/backends/vllm/deploy/README.md +++ b/examples/backends/vllm/deploy/README.md @@ -249,7 +249,7 @@ args: - **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation_guide.md) - **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/planner/sla_planner_quickstart.md) - **Examples**: [Deployment Examples](../../../../docs/examples/README.md) -- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/router/kv_cache_routing.md) +- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/router/README.md) ## Troubleshooting diff --git a/examples/basics/multinode/README.md b/examples/basics/multinode/README.md index 5ce827842c9..0076cfe3f67 100644 --- a/examples/basics/multinode/README.md +++ b/examples/basics/multinode/README.md @@ -5,7 +5,7 @@ This example demonstrates running Dynamo across multiple nodes with **KV-aware r For more information about the core concepts, see: - [Dynamo Disaggregated Serving](../../../docs/design_docs/disagg_serving.md) -- [KV Cache Routing Architecture](../../../docs/router/kv_cache_routing.md) +- [KV Cache Routing](../../../docs/router/README.md) ## Architecture Overview @@ -65,7 +65,7 @@ This is particularly beneficial for: - **Similar queries**: Common prefixes are computed once and reused - **Batch processing**: Related requests can be routed to workers with shared context -For detailed technical information about how KV routing works, see the [KV Cache Routing Architecture documentation](../../../docs/router/kv_cache_routing.md). +For detailed technical information about how KV routing works, see the [Router Guide](../../../docs/router/router_guide.md). ## Prerequisites @@ -475,7 +475,7 @@ python -m dynamo.frontend \ --router-temperature 0.0 # Temperature for probabilistic routing (0 = deterministic) ``` -For more advanced configuration options including custom worker selection, block size tuning, and alternative indexing strategies, see the [KV Cache Routing documentation](../../../docs/router/kv_cache_routing.md). +For more advanced configuration options including custom worker selection, block size tuning, and alternative indexing strategies, see the [Router Guide](../../../docs/router/router_guide.md). ## Cleanup