Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,10 @@ Built in Rust for performance and Python for extensibility, Dynamo is fully open
|---|:----:|:----------:|:--:|
| **Best For** | High-throughput serving | Maximum performance | Broadest feature coverage |
| [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**KV-Aware Routing**](docs/router/README.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ |
| [**KVBM**](docs/kvbm/README.md) | 🚧 | ✅ | ✅ |
| [**Multimodal**](docs/multimodal/index.md) | ✅ | ✅ | ✅ |
| [**KV-Aware Routing**](docs/components/router/README.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](docs/components/planner/planner_guide.md) | ✅ | ✅ | ✅ |
| [**KVBM**](docs/components/kvbm/README.md) | 🚧 | ✅ | ✅ |
| [**Multimodal**](docs/features/multimodal/README.md) | ✅ | ✅ | ✅ |
| [**Tool Calling**](docs/agents/tool-calling.md) | ✅ | ✅ | ✅ |

> **[Full Feature Matrix →](docs/reference/feature-matrix.md)** — Detailed compatibility including LoRA, Request Migration, Speculative Decoding, and feature interactions.
Expand Down Expand Up @@ -347,7 +347,7 @@ python3 -m dynamo.frontend
Dynamo provides comprehensive benchmarking tools:

- **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf
- **[SLA-Driven Deployments](docs/planner/sla_planner_quickstart.md)** – Optimize deployments to meet SLA requirements
- **[SLA-Driven Deployments](docs/components/planner/planner_guide.md)** – Optimize deployments to meet SLA requirements

## Frontend OpenAPI Specification

Expand All @@ -357,7 +357,7 @@ The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To
cargo run -p dynamo-llm --bin generate-frontend-openapi
```

This writes to `docs/frontends/openapi.json`.
This writes to `docs/reference/api/openapi.json`.

## Service Discovery and Messaging

Expand Down Expand Up @@ -388,9 +388,9 @@ See [SGLang on Slurm](examples/backends/sglang/slurm_jobs/README.md) and [TRT-LL

<!-- Reference links for Feature Compatibility Matrix -->
[disagg]: docs/design_docs/disagg_serving.md
[kv-routing]: docs/router/README.md
[planner]: docs/planner/sla_planner.md
[kvbm]: docs/kvbm/README.md
[kv-routing]: docs/components/router/README.md
[planner]: docs/components/planner/planner_guide.md
[kvbm]: docs/components/kvbm/README.md
[mm]: examples/multimodal/
[migration]: docs/fault_tolerance/request_migration.md
[lora]: examples/backends/vllm/deploy/lora/README.md
Expand Down
1 change: 0 additions & 1 deletion benchmarks/profiler/README.md

This file was deleted.

13 changes: 13 additions & 0 deletions benchmarks/profiler/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# Profiler

Documentation for the Dynamo Profiler has moved to [docs/components/profiler/](../../docs/components/profiler/README.md).

- [Profiler Overview](../../docs/components/profiler/README.md)
- [Profiler Guide](../../docs/components/profiler/profiler_guide.md)
- [Profiler Examples](../../docs/components/profiler/profiler_examples.md)
2 changes: 1 addition & 1 deletion benchmarks/profiler/webui/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -620,7 +620,7 @@ def create_gradio_interface(

> 📝 **Note:** The dotted red line in the prefill and decode charts are default TTFT and ITL SLAs if not specified.

> ⚠️ **Warning:** The TTFT values here represent the ideal case when requests arrive uniformly, minimizing queueing. Real-world TTFT may be higher than profiling results. To mitigate the issue, planner uses [correction factors](https://github.com/ai-dynamo/dynamo/blob/main/docs/planner/sla_planner.md#2-correction-factor-calculation) to adjust dynamically at runtime.
> ⚠️ **Warning:** The TTFT values here represent the ideal case when requests arrive uniformly, minimizing queueing. Real-world TTFT may be higher than profiling results. To mitigate the issue, planner uses [correction factors](https://github.com/ai-dynamo/dynamo/blob/main/docs/design_docs/planner_design.md#step-2-correction-factor-calculation) to adjust dynamically at runtime.

> 💡 **Tip:** Use the GPU cost checkbox and input in the charts section to convert GPU hours to cost.
"""
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/router/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ To see all available router arguments, run:
python -m dynamo.frontend --help
```

For detailed explanations of router arguments (especially KV cache routing parameters), see the [Router Guide](../../docs/router/router_guide.md).
For detailed explanations of router arguments (especially KV cache routing parameters), see the [Router Guide](../../docs/components/router/router_guide.md).

> [!Note]
> If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
Expand All @@ -146,7 +146,7 @@ When you launch prefill workers using `run_engines.sh --prefill`, the frontend a
- Uses the same routing mode as the frontend's `--router-mode` setting
- Seamlessly integrates with your decode workers for token generation

No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [Router Guide](../../docs/router/router_guide.md#disaggregated-serving) for more details.
No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [Router Guide](../../docs/components/router/router_guide.md#disaggregated-serving) for more details.

> [!Note]
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh)
Expand Down
2 changes: 1 addition & 1 deletion components/src/dynamo/mocker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ python -m dynamo.mocker \

The profile results directory should contain `selected_prefill_interpolation/` and `selected_decode_interpolation/` subdirectories with `raw_data.npz` files. This works seamlessly in Kubernetes where profile data is mounted via ConfigMap or PersistentVolume.

To generate profiling data for your own model/hardware configuration, run the profiler (see [SLA-driven profiling documentation](../../../../docs/benchmarks/sla_driven_profiling.md) for details):
To generate profiling data for your own model/hardware configuration, run the profiler (see [SLA-driven profiling documentation](../../../../docs/components/profiler/profiler_guide.md) for details):

```bash
python benchmarks/profiler/profile_sla.py \
Expand Down
2 changes: 1 addition & 1 deletion components/src/dynamo/planner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,5 @@ limitations under the License.

SLA-driven autoscaling controller for Dynamo inference graphs.

- **User docs**: [docs/planner/](/docs/planner/) (deployment, configuration, examples)
- **User docs**: [docs/planner/](/docs/components/planner/) (deployment, configuration, examples)
- **Design docs**: [docs/design_docs/planner_design.md](/docs/design_docs/planner_design.md) (architecture, algorithms)
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@

MISSING_PROFILING_DATA_ERROR_MESSAGE = (
"SLA-Planner requires pre-deployment profiling results to run.\n"
"Please follow /docs/benchmarks/sla_driven_profiling.md to run the profiling first,\n"
"Please follow /docs/components/profiler/profiler_guide.md to run the profiling first,\n"
"and make sure the profiling results are present in --profile-results-dir."
)

Expand Down
8 changes: 4 additions & 4 deletions components/src/dynamo/router/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

# Standalone Router

A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [Router Guide](/docs/router/router_guide.md).
A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [Router Guide](/docs/components/router/router_guide.md).

## Overview

Expand All @@ -29,7 +29,7 @@ python -m dynamo.router \
- `--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`)

**Router Configuration:**
For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/router/router_guide.md).
For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/components/router/router_guide.md).

## Architecture

Expand All @@ -43,7 +43,7 @@ Clients query the `find_best_worker` endpoint to determine which worker should p
## Example: Manual Disaggregated Serving (Alternative Setup)

> [!Note]
> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [Router Guide](/docs/router/router_guide.md#disaggregated-serving) for the default setup.
> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [Router Guide](/docs/components/router/router_guide.md#disaggregated-serving) for the default setup.
>
> Use this manual setup if you need explicit control over prefill routing configuration or want to manage prefill and decode routers separately.

Expand Down Expand Up @@ -103,7 +103,7 @@ See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a refere

## See Also

- [Router Guide](/docs/router/router_guide.md) - Configuration and tuning for KV-aware routing
- [Router Guide](/docs/components/router/router_guide.md) - Configuration and tuning for KV-aware routing
- [Router Design](/docs/design_docs/router_design.md) - Architecture details and event transport modes
- [Frontend Router](../frontend/README.md) - Main HTTP frontend with integrated routing
- [Router Benchmarking](/benchmarks/router/README.md) - Performance testing and tuning
2 changes: 1 addition & 1 deletion deploy/inference-gateway/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ Common Vars for Routing Configuration:
- Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes.
- Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
- Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing
- See the [Router Guide](../../docs/router/router_guide.md) for details.
- See the [Router Guide](../../docs/components/router/router_guide.md) for details.


Stand-Alone installation only:
Expand Down
2 changes: 1 addition & 1 deletion deploy/utils/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ kubectl delete pod pvc-access-pod -n $NAMESPACE

For complete benchmarking and profiling workflows:
- **Benchmarking Guide**: See [docs/benchmarks/benchmarking.md](../../docs/benchmarks/benchmarking.md) for comparing DynamoGraphDeployments and external endpoints
- **Pre-Deployment Profiling**: See [docs/benchmarks/sla_driven_profiling.md](../../docs/benchmarks/sla_driven_profiling.md) for optimizing configurations before deployment
- **Pre-Deployment Profiling**: See [docs/components/profiler/profiler_guide.md](../../docs/components/profiler/profiler_guide.md) for optimizing configurations before deployment

## Notes

Expand Down
9 changes: 0 additions & 9 deletions docs/_sections/frontends.rst

This file was deleted.

2 changes: 1 addition & 1 deletion docs/api/nixl_connect/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ flowchart LR

### Multimodal Example

In the case of the [Dynamo Multimodal Disaggregated Example](../../multimodal/vllm.md):
In the case of the [Dynamo Multimodal Disaggregated Example](../../features/multimodal/multimodal_vllm.md):

1. The HTTP frontend accepts a text prompt and a URL to an image.

Expand Down
8 changes: 4 additions & 4 deletions docs/backends/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
|---------|--------|-------|
| [**Disaggregated Serving**](../../design_docs/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
| [**KV-Aware Routing**](../../router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ | |
| [**Multimodal Support**](../../multimodal/sglang.md) | ✅ | |
| [**KVBM**](../../kvbm/README.md) | ❌ | Planned |
| [**KV-Aware Routing**](../../components/router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../components/planner/planner_guide.md) | ✅ | |
| [**Multimodal Support**](../../features/multimodal/multimodal_sglang.md) | ✅ | |
| [**KVBM**](../../components/kvbm/README.md) | ❌ | Planned |


## Dynamo SGLang Integration
Expand Down
65 changes: 0 additions & 65 deletions docs/backends/sglang/sgl-hicache-example.md

This file was deleted.

14 changes: 7 additions & 7 deletions docs/backends/trtllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,10 +55,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
|---------|--------------|-------|
| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/kvbm/README.md) | ✅ | |
| [**KV-Aware Routing**](../../components/router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/components/planner/planner_guide.md) | ✅ | |
| [**Load Based Planner**](../../../docs/components/planner/README.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/components/kvbm/README.md) | ✅ | |

### Large Scale P/D and WideEP Features

Expand Down Expand Up @@ -114,7 +114,7 @@ apt-get update && apt-get -y install git git-lfs
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.

For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../router/router_guide.md).
For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../components/router/router_guide.md).

### Aggregated
```bash
Expand Down Expand Up @@ -231,7 +231,7 @@ To benchmark your deployment with AIPerf, see this utility script, configuring t

## Multimodal support

Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../multimodal/trtllm.md).
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal_trtllm.md).

## Logits Processing

Expand Down Expand Up @@ -327,7 +327,7 @@ For detailed instructions on running comprehensive performance sweeps across bot

Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.

Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) .
Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/components/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) .

## Known Issues and Mitigations

Expand Down
Loading
Loading