diff --git a/README.md b/README.md index 5eedabd5f08..69006a1b98e 100644 --- a/README.md +++ b/README.md @@ -52,10 +52,10 @@ Built in Rust for performance and Python for extensibility, Dynamo is fully open |---|:----:|:----------:|:--:| | **Best For** | High-throughput serving | Maximum performance | Broadest feature coverage | | [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ | -| [**KV-Aware Routing**](docs/router/README.md) | ✅ | ✅ | ✅ | -| [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ | -| [**KVBM**](docs/kvbm/README.md) | 🚧 | ✅ | ✅ | -| [**Multimodal**](docs/multimodal/index.md) | ✅ | ✅ | ✅ | +| [**KV-Aware Routing**](docs/components/router/README.md) | ✅ | ✅ | ✅ | +| [**SLA-Based Planner**](docs/components/planner/planner_guide.md) | ✅ | ✅ | ✅ | +| [**KVBM**](docs/components/kvbm/README.md) | 🚧 | ✅ | ✅ | +| [**Multimodal**](docs/features/multimodal/README.md) | ✅ | ✅ | ✅ | | [**Tool Calling**](docs/agents/tool-calling.md) | ✅ | ✅ | ✅ | > **[Full Feature Matrix →](docs/reference/feature-matrix.md)** — Detailed compatibility including LoRA, Request Migration, Speculative Decoding, and feature interactions. @@ -347,7 +347,7 @@ python3 -m dynamo.frontend Dynamo provides comprehensive benchmarking tools: - **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf -- **[SLA-Driven Deployments](docs/planner/sla_planner_quickstart.md)** – Optimize deployments to meet SLA requirements +- **[SLA-Driven Deployments](docs/components/planner/planner_guide.md)** – Optimize deployments to meet SLA requirements ## Frontend OpenAPI Specification @@ -357,7 +357,7 @@ The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To cargo run -p dynamo-llm --bin generate-frontend-openapi ``` -This writes to `docs/frontends/openapi.json`. +This writes to `docs/reference/api/openapi.json`. ## Service Discovery and Messaging @@ -388,9 +388,9 @@ See [SGLang on Slurm](examples/backends/sglang/slurm_jobs/README.md) and [TRT-LL [disagg]: docs/design_docs/disagg_serving.md -[kv-routing]: docs/router/README.md -[planner]: docs/planner/sla_planner.md -[kvbm]: docs/kvbm/README.md +[kv-routing]: docs/components/router/README.md +[planner]: docs/components/planner/planner_guide.md +[kvbm]: docs/components/kvbm/README.md [mm]: examples/multimodal/ [migration]: docs/fault_tolerance/request_migration.md [lora]: examples/backends/vllm/deploy/lora/README.md diff --git a/benchmarks/profiler/README.md b/benchmarks/profiler/README.md deleted file mode 120000 index d0192ec6a3e..00000000000 --- a/benchmarks/profiler/README.md +++ /dev/null @@ -1 +0,0 @@ -../../docs/benchmarks/sla_driven_profiling.md \ No newline at end of file diff --git a/benchmarks/profiler/README.md b/benchmarks/profiler/README.md new file mode 100644 index 00000000000..1c059261ce4 --- /dev/null +++ b/benchmarks/profiler/README.md @@ -0,0 +1,13 @@ + + +# Profiler + +Documentation for the Dynamo Profiler has moved to [docs/components/profiler/](../../docs/components/profiler/README.md). + +- [Profiler Overview](../../docs/components/profiler/README.md) +- [Profiler Guide](../../docs/components/profiler/profiler_guide.md) +- [Profiler Examples](../../docs/components/profiler/profiler_examples.md) diff --git a/benchmarks/profiler/webui/utils.py b/benchmarks/profiler/webui/utils.py index d049bffe64c..3451939c2b6 100644 --- a/benchmarks/profiler/webui/utils.py +++ b/benchmarks/profiler/webui/utils.py @@ -620,7 +620,7 @@ def create_gradio_interface( > 📝 **Note:** The dotted red line in the prefill and decode charts are default TTFT and ITL SLAs if not specified. - > ⚠️ **Warning:** The TTFT values here represent the ideal case when requests arrive uniformly, minimizing queueing. Real-world TTFT may be higher than profiling results. To mitigate the issue, planner uses [correction factors](https://github.com/ai-dynamo/dynamo/blob/main/docs/planner/sla_planner.md#2-correction-factor-calculation) to adjust dynamically at runtime. + > ⚠️ **Warning:** The TTFT values here represent the ideal case when requests arrive uniformly, minimizing queueing. Real-world TTFT may be higher than profiling results. To mitigate the issue, planner uses [correction factors](https://github.com/ai-dynamo/dynamo/blob/main/docs/design_docs/planner_design.md#step-2-correction-factor-calculation) to adjust dynamically at runtime. > 💡 **Tip:** Use the GPU cost checkbox and input in the charts section to convert GPU hours to cost. """ diff --git a/benchmarks/router/README.md b/benchmarks/router/README.md index c009762caa7..3e379394bcb 100644 --- a/benchmarks/router/README.md +++ b/benchmarks/router/README.md @@ -127,7 +127,7 @@ To see all available router arguments, run: python -m dynamo.frontend --help ``` -For detailed explanations of router arguments (especially KV cache routing parameters), see the [Router Guide](../../docs/router/router_guide.md). +For detailed explanations of router arguments (especially KV cache routing parameters), see the [Router Guide](../../docs/components/router/router_guide.md). > [!Note] > If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead: @@ -146,7 +146,7 @@ When you launch prefill workers using `run_engines.sh --prefill`, the frontend a - Uses the same routing mode as the frontend's `--router-mode` setting - Seamlessly integrates with your decode workers for token generation -No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [Router Guide](../../docs/router/router_guide.md#disaggregated-serving) for more details. +No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [Router Guide](../../docs/components/router/router_guide.md#disaggregated-serving) for more details. > [!Note] > The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh) diff --git a/components/src/dynamo/mocker/README.md b/components/src/dynamo/mocker/README.md index 030f39c3002..2fe8fb8b7eb 100644 --- a/components/src/dynamo/mocker/README.md +++ b/components/src/dynamo/mocker/README.md @@ -60,7 +60,7 @@ python -m dynamo.mocker \ The profile results directory should contain `selected_prefill_interpolation/` and `selected_decode_interpolation/` subdirectories with `raw_data.npz` files. This works seamlessly in Kubernetes where profile data is mounted via ConfigMap or PersistentVolume. -To generate profiling data for your own model/hardware configuration, run the profiler (see [SLA-driven profiling documentation](../../../../docs/benchmarks/sla_driven_profiling.md) for details): +To generate profiling data for your own model/hardware configuration, run the profiler (see [SLA-driven profiling documentation](../../../../docs/components/profiler/profiler_guide.md) for details): ```bash python benchmarks/profiler/profile_sla.py \ diff --git a/components/src/dynamo/planner/README.md b/components/src/dynamo/planner/README.md index 257a8253a52..da52391b59d 100644 --- a/components/src/dynamo/planner/README.md +++ b/components/src/dynamo/planner/README.md @@ -19,5 +19,5 @@ limitations under the License. SLA-driven autoscaling controller for Dynamo inference graphs. -- **User docs**: [docs/planner/](/docs/planner/) (deployment, configuration, examples) +- **User docs**: [docs/planner/](/docs/components/planner/) (deployment, configuration, examples) - **Design docs**: [docs/design_docs/planner_design.md](/docs/design_docs/planner_design.md) (architecture, algorithms) diff --git a/components/src/dynamo/planner/utils/perf_interpolation.py b/components/src/dynamo/planner/utils/perf_interpolation.py index a82a09eba38..de93104f2dc 100644 --- a/components/src/dynamo/planner/utils/perf_interpolation.py +++ b/components/src/dynamo/planner/utils/perf_interpolation.py @@ -29,7 +29,7 @@ MISSING_PROFILING_DATA_ERROR_MESSAGE = ( "SLA-Planner requires pre-deployment profiling results to run.\n" - "Please follow /docs/benchmarks/sla_driven_profiling.md to run the profiling first,\n" + "Please follow /docs/components/profiler/profiler_guide.md to run the profiling first,\n" "and make sure the profiling results are present in --profile-results-dir." ) diff --git a/components/src/dynamo/router/README.md b/components/src/dynamo/router/README.md index 98183c49d46..da087065890 100644 --- a/components/src/dynamo/router/README.md +++ b/components/src/dynamo/router/README.md @@ -3,7 +3,7 @@ # Standalone Router -A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [Router Guide](/docs/router/router_guide.md). +A backend-agnostic standalone KV-aware router service for Dynamo deployments. For details on how KV-aware routing works, see the [Router Guide](/docs/components/router/router_guide.md). ## Overview @@ -29,7 +29,7 @@ python -m dynamo.router \ - `--endpoint`: Full endpoint path for workers in the format `namespace.component.endpoint` (e.g., `dynamo.prefill.generate`) **Router Configuration:** -For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/router/router_guide.md). +For detailed descriptions of all KV router configuration options including `--block-size`, `--kv-overlap-score-weight`, `--router-temperature`, `--no-kv-events`, `--router-replica-sync`, `--router-snapshot-threshold`, `--router-reset-states`, and `--no-track-active-blocks`, see the [Router Guide](/docs/components/router/router_guide.md). ## Architecture @@ -43,7 +43,7 @@ Clients query the `find_best_worker` endpoint to determine which worker should p ## Example: Manual Disaggregated Serving (Alternative Setup) > [!Note] -> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [Router Guide](/docs/router/router_guide.md#disaggregated-serving) for the default setup. +> **This is an alternative advanced setup.** The recommended approach for disaggregated serving is to use the frontend's automatic prefill routing, which activates when you register workers with `ModelType.Prefill`. See the [Router Guide](/docs/components/router/router_guide.md#disaggregated-serving) for the default setup. > > Use this manual setup if you need explicit control over prefill routing configuration or want to manage prefill and decode routers separately. @@ -103,7 +103,7 @@ See [`components/src/dynamo/vllm/handlers.py`](../vllm/handlers.py) for a refere ## See Also -- [Router Guide](/docs/router/router_guide.md) - Configuration and tuning for KV-aware routing +- [Router Guide](/docs/components/router/router_guide.md) - Configuration and tuning for KV-aware routing - [Router Design](/docs/design_docs/router_design.md) - Architecture details and event transport modes - [Frontend Router](../frontend/README.md) - Main HTTP frontend with integrated routing - [Router Benchmarking](/benchmarks/router/README.md) - Performance testing and tuning diff --git a/deploy/inference-gateway/README.md b/deploy/inference-gateway/README.md index f4b08a0c5e1..72448107d40 100644 --- a/deploy/inference-gateway/README.md +++ b/deploy/inference-gateway/README.md @@ -220,7 +220,7 @@ Common Vars for Routing Configuration: - Set `DYNAMO_OVERLAP_SCORE_WEIGHT` to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. - Set `DYNAMO_ROUTER_TEMPERATURE` to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration). - Set `DYNAMO_USE_KV_EVENTS=false` if you want to disable the workers sending KV events while using kv-routing - - See the [Router Guide](../../docs/router/router_guide.md) for details. + - See the [Router Guide](../../docs/components/router/router_guide.md) for details. Stand-Alone installation only: diff --git a/deploy/utils/README.md b/deploy/utils/README.md index 5b724d74362..cbbc8d80eeb 100644 --- a/deploy/utils/README.md +++ b/deploy/utils/README.md @@ -145,7 +145,7 @@ kubectl delete pod pvc-access-pod -n $NAMESPACE For complete benchmarking and profiling workflows: - **Benchmarking Guide**: See [docs/benchmarks/benchmarking.md](../../docs/benchmarks/benchmarking.md) for comparing DynamoGraphDeployments and external endpoints -- **Pre-Deployment Profiling**: See [docs/benchmarks/sla_driven_profiling.md](../../docs/benchmarks/sla_driven_profiling.md) for optimizing configurations before deployment +- **Pre-Deployment Profiling**: See [docs/components/profiler/profiler_guide.md](../../docs/components/profiler/profiler_guide.md) for optimizing configurations before deployment ## Notes diff --git a/docs/_sections/frontends.rst b/docs/_sections/frontends.rst deleted file mode 100644 index 89aa6dbfb42..00000000000 --- a/docs/_sections/frontends.rst +++ /dev/null @@ -1,9 +0,0 @@ -Frontends -========= - -.. toctree:: - :maxdepth: 1 - - Frontend Overview <../components/frontend/README.md> - Frontend Guide <../components/frontend/frontend_guide.md> - KServe (deprecated) <../frontends/kserve.md> \ No newline at end of file diff --git a/docs/api/nixl_connect/README.md b/docs/api/nixl_connect/README.md index 2a65fa76951..b70ac3a5dbd 100644 --- a/docs/api/nixl_connect/README.md +++ b/docs/api/nixl_connect/README.md @@ -103,7 +103,7 @@ flowchart LR ### Multimodal Example -In the case of the [Dynamo Multimodal Disaggregated Example](../../multimodal/vllm.md): +In the case of the [Dynamo Multimodal Disaggregated Example](../../features/multimodal/multimodal_vllm.md): 1. The HTTP frontend accepts a text prompt and a URL to an image. diff --git a/docs/backends/sglang/README.md b/docs/backends/sglang/README.md index 9f282391dcd..e6180b1ef61 100644 --- a/docs/backends/sglang/README.md +++ b/docs/backends/sglang/README.md @@ -36,10 +36,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) |---------|--------|-------| | [**Disaggregated Serving**](../../design_docs/disagg_serving.md) | ✅ | | | [**Conditional Disaggregation**](../../design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) | -| [**KV-Aware Routing**](../../router/README.md) | ✅ | | -| [**SLA-Based Planner**](../../planner/sla_planner.md) | ✅ | | -| [**Multimodal Support**](../../multimodal/sglang.md) | ✅ | | -| [**KVBM**](../../kvbm/README.md) | ❌ | Planned | +| [**KV-Aware Routing**](../../components/router/README.md) | ✅ | | +| [**SLA-Based Planner**](../../components/planner/planner_guide.md) | ✅ | | +| [**Multimodal Support**](../../features/multimodal/multimodal_sglang.md) | ✅ | | +| [**KVBM**](../../components/kvbm/README.md) | ❌ | Planned | ## Dynamo SGLang Integration diff --git a/docs/backends/sglang/sgl-hicache-example.md b/docs/backends/sglang/sgl-hicache-example.md deleted file mode 100644 index 4c71cbd4eb2..00000000000 --- a/docs/backends/sglang/sgl-hicache-example.md +++ /dev/null @@ -1,65 +0,0 @@ - - -# Enable SGLang Hierarchical Cache (HiCache) - -This guide shows how to enable SGLang's Hierarchical Cache (HiCache) inside Dynamo. - -## 1) Start the SGLang worker with HiCache enabled - -```bash -python -m dynamo.sglang \ - --model-path Qwen/Qwen3-0.6B \ - --host 0.0.0.0 --port 8000 \ - --page-size 64 \ - --enable-hierarchical-cache \ - --hicache-ratio 2 \ - --hicache-write-policy write_through \ - --hicache-storage-backend nixl \ - --log-level debug \ - --skip-tokenizer-init -``` - -- **--enable-hierarchical-cache**: Enables hierarchical KV cache/offload -- **--hicache-ratio**: The ratio of the size of host KV cache memory pool to the size of device pool. Lower this number if your machine has less CPU memory. -- **--hicache-write-policy**: Write policy (e.g., `write_through` for synchronous host writes) -- **--hicache-storage-backend**: Host storage backend for HiCache (e.g., `nixl`). NIXL selects the concrete store automatically; see [PR #8488](https://github.com/sgl-project/sglang/pull/8488) - - -Then, start the frontend: -```bash -python -m dynamo.frontend --http-port 8000 -``` - -## 2) Send a single request - -```bash -curl localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "Qwen/Qwen3-0.6B", - "messages": [ - { - "role": "user", - "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time" - } - ], - "stream": false, - "max_tokens": 30 - }' -``` - -## 3) (Optional) Benchmarking - -Run the perf script: -```bash -bash -x $DYNAMO_ROOT/benchmarks/llm/perf.sh \ - --model Qwen/Qwen3-0.6B \ - --tensor-parallelism 1 \ - --data-parallelism 1 \ - --concurrency "2,4,8" \ - --input-sequence-length 2048 \ - --output-sequence-length 256 -``` diff --git a/docs/backends/trtllm/README.md b/docs/backends/trtllm/README.md index b47225f0967..6c0e241a885 100644 --- a/docs/backends/trtllm/README.md +++ b/docs/backends/trtllm/README.md @@ -55,10 +55,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) |---------|--------------|-------| | [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | | | [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet | -| [**KV-Aware Routing**](../../router/README.md) | ✅ | | -| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | | -| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | Planned | -| [**KVBM**](../../../docs/kvbm/README.md) | ✅ | | +| [**KV-Aware Routing**](../../components/router/README.md) | ✅ | | +| [**SLA-Based Planner**](../../../docs/components/planner/planner_guide.md) | ✅ | | +| [**Load Based Planner**](../../../docs/components/planner/README.md) | 🚧 | Planned | +| [**KVBM**](../../../docs/components/kvbm/README.md) | ✅ | | ### Large Scale P/D and WideEP Features @@ -114,7 +114,7 @@ apt-get update && apt-get -y install git git-lfs > [!IMPORTANT] > Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend ` to start up the ingress and using `python3 -m dynamo.trtllm ` to start up the workers. You can easily take each command and run them in separate terminals. -For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../router/router_guide.md). +For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../components/router/router_guide.md). ### Aggregated ```bash @@ -231,7 +231,7 @@ To benchmark your deployment with AIPerf, see this utility script, configuring t ## Multimodal support -Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../multimodal/trtllm.md). +Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal_trtllm.md). ## Logits Processing @@ -327,7 +327,7 @@ For detailed instructions on running comprehensive performance sweeps across bot Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests. -Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) . +Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/components/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) . ## Known Issues and Mitigations diff --git a/docs/backends/vllm/README.md b/docs/backends/vllm/README.md index 794e4183fc8..0ff990dd4f2 100644 --- a/docs/backends/vllm/README.md +++ b/docs/backends/vllm/README.md @@ -37,10 +37,10 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) |---------|------|-------| | [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | | | [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP | -| [**KV-Aware Routing**](../../router/README.md) | ✅ | | -| [**SLA-Based Planner**](../../../docs/planner/sla_planner.md) | ✅ | | -| [**Load Based Planner**](../../../docs/planner/load_planner.md) | 🚧 | WIP | -| [**KVBM**](../../../docs/kvbm/README.md) | ✅ | | +| [**KV-Aware Routing**](../../components/router/README.md) | ✅ | | +| [**SLA-Based Planner**](../../../docs/components/planner/planner_guide.md) | ✅ | | +| [**Load Based Planner**](../../../docs/components/planner/README.md) | 🚧 | WIP | +| [**KVBM**](../../../docs/components/kvbm/README.md) | ✅ | | | [**LMCache**](../../integrations/lmcache_integration.md) | ✅ | | | [**Prompt Embeddings**](./prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag | @@ -144,7 +144,7 @@ Below we provide a selected list of advanced deployments. Please open up an issu Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model using **aggregated speculative decoding** on a single node. This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the **VLLM aggregated serving framework** for faster inference while maintaining accuracy. -**Guide:** [Speculative Decoding Quickstart](./speculative_decoding.md) +**Guide:** [Speculative Decoding Quickstart](../../features/speculative_decoding/speculative_decoding_vllm.md) > **See also:** [Speculative Decoding Feature Overview](../../features/speculative_decoding/README.md) for cross-backend documentation. diff --git a/docs/backends/vllm/speculative_decoding.md b/docs/backends/vllm/speculative_decoding.md deleted file mode 100644 index 92ca08bc234..00000000000 --- a/docs/backends/vllm/speculative_decoding.md +++ /dev/null @@ -1,126 +0,0 @@ - - -> **Note**: This content has moved to [Speculative Decoding with vLLM](../../features/speculative_decoding/speculative_decoding_vllm.md). -> See [Speculative Decoding Overview](../../features/speculative_decoding/README.md) for cross-backend documentation. -> This file will be removed in a future release. - -# Running **Meta-Llama-3.1-8B-Instruct** with Speculative Decoding (Eagle3) - -This guide walks through how to deploy **Meta-Llama-3.1-8B-Instruct** using **aggregated speculative decoding** with **Eagle3** on a single node. -Since the model is only **8B parameters**, you can run it on **any GPU with at least 16GB VRAM**. - - - -## Step 1: Set Up Your Docker Environment - -First, we’ll initialize a Docker container using the VLLM backend. -You can refer to the [VLLM Quickstart Guide](./README.md#vllm-quick-start) — or follow the full steps below. - -### 1. Launch Docker Compose - -```bash -docker compose -f deploy/docker-compose.yml up -d -``` - -### 2. Build the Container - -```bash -./container/build.sh --framework VLLM -``` - -### 3. Run the Container - -```bash -./container/run.sh -it --framework VLLM --mount-workspace -``` - - - -## Step 2: Get Access to the Llama-3 Model - -The **Meta-Llama-3.1-8B-Instruct** model is gated, so you’ll need to request access on Hugging Face. -Go to the official [Meta-Llama-3.1-8B-Instruct repository](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and fill out the access form. -Approval usually takes around **5 minutes**. - -Once you have access, generate a **Hugging Face access token** with permission for gated repositories, then set it inside your container: - -```bash -export HUGGING_FACE_HUB_TOKEN="insert_your_token_here" -export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN -``` - - - -## Step 3: Run Aggregated Speculative Decoding - -Now that your environment is ready, start the aggregated server with **speculative decoding**. - -```bash -# Requires only one GPU -cd examples/backends/vllm -bash launch/agg_spec_decoding.sh -``` - -Once the weights finish downloading and serving begins, you’ll be ready to send inference requests to your model. - - - - -## Step 4: Example Request - -To verify your setup, try sending a simple prompt to your model: - -```bash -curl http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", - "messages": [ - {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."} - ], - "max_tokens": 250 - }' -``` - -### Example Output - -```json -{ - "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8", - "choices": [ - { - "text": "In cherry blossom’s gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes.", - "index": 0, - "finish_reason": "stop" - } - ], - "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", - "usage": { - "prompt_tokens": 16, - "completion_tokens": 250, - "total_tokens": 266 - } -} -``` - - - -## Additional Resources - -* [VLLM Quickstart](./README.md#vllm-quick-start) -* [Meta-Llama-3.1-8B-Instruct on Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) \ No newline at end of file diff --git a/docs/benchmarks/sla_driven_profiling.md b/docs/benchmarks/sla_driven_profiling.md deleted file mode 100644 index 3f8a7485107..00000000000 --- a/docs/benchmarks/sla_driven_profiling.md +++ /dev/null @@ -1,639 +0,0 @@ -# SLA-Driven Profiling with DynamoGraphDeploymentRequest - -> [!TIP] -> **New to DGDR and SLA-Driven Profiling?** Start with the [SLA-Driven Profiling and Planner Deployment Quick Start Guide](/docs/planner/sla_planner_quickstart.md) for step-by-step instructions. This document provides deeper technical details about the profiling process. - -> [!NOTE] -> **See also**: [Profiler Component Overview](/docs/components/profiler/README.md) for a quick start guide and feature matrix. - -## Overview - -Dynamo provides automated SLA-driven profiling through **DynamoGraphDeploymentRequests (DGDR)**. Instead of manually running profiling scripts, you declare your performance requirements and let the Dynamo Operator handle profiling and deployment automatically. - -**Key Benefits:** -- **Declarative**: Specify SLAs, not implementation details -- **Automated**: No manual job setup or result processing -- **Integrated**: Seamlessly works with Dynamo Operator -- **Production-Ready**: Generates optimized configurations with SLA planner - -This document covers: -- Technical details of online vs offline profiling -- Profiling process internals (GPU usage, measurements, interpolation) -- Direct script usage for advanced scenarios -- Comprehensive troubleshooting - -## Support Matrix - -| Backend | Dense Models | MoE Models | -|---------|-------------|------------| -| vLLM | ✅ | 🚧 | -| SGLang | ✅ | ✅ | -| TensorRT-LLM | ✅ | 🚧 | - -Specifically, the profiler sweeps over the following parallelization mapping for prefill and decode: -| Model Architecture | Prefill Parallelization Mapping | Decode Parallelization Mapping | -|---------|-------------|------------| -| MLA+MoE (DeepseekV3ForCausalLM, DeepseekV32ForCausalLM) | TEP, DEP | TEP, DEP | -| GQA+MoE (Qwen3MoeForCausalLM) | TP, TEP, DEP | TP, TEP, DEP | -| Other Models | TP | TP | - -> [!NOTE] -> - Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend. - -## Using DGDR for Profiling (Recommended) - -The recommended way to profile models is through DGDRs. Sample configurations are provided in `deploy/`: - -**Available Samples:** -- **`profile_sla_dgdr.yaml`**: Standard profiling with AIPerf on real engines -- **`profile_sla_aic_dgdr.yaml`**: Fast profiling with AI Configurator simulation -- **`profile_sla_moe_dgdr.yaml`**: MoE model profiling - -The Dynamo Operator automatically: -1. Discovers GPU resources (cluster-scoped operators only) -2. Runs profiling (AIPerf on real engines or AI Configurator simulation) -3. Generates optimal DGD configuration with SLA planner -4. Deploys the DGD to your cluster - -See the [Quick Start Guide](/docs/planner/sla_planner_quickstart.md) for prerequisites and detailed instructions. - -## Hardware Configuration - -Hardware parameters have sensible defaults and are **optional** - you can override them if needed: - -```yaml -profilingConfig: - config: - # Override hardware defaults if needed - hardware: - minNumGpusPerEngine: 1 - maxNumGpusPerEngine: 8 - numGpusPerNode: 8 - - # Only needed when using AI Configurator (sweep.useAiConfigurator: true) - sweep: - aicSystem: h200_sxm # GPU type for AI Configurator (h100_sxm, h200_sxm, etc.) -``` - -### Automatic GPU Discovery (Optional Feature) - -Cluster-scoped operators can optionally enable automatic GPU discovery to detect hardware from cluster nodes. When enabled, hardware config is auto-detected and overrides any manually specified values. - -```yaml -spec: - enableGpuDiscovery: true -``` - -This feature is only available with cluster-scoped operators (`namespaceRestriction.enabled=false`) as it requires cluster-wide node access permissions. It is not available for namespace-restricted operators. - -## Profiling Method - -1. **Hardware Setup**: Uses defaults or user-specified hardware configuration. Optionally, cluster-scoped operators can enable automatic GPU discovery to detect specifications from cluster nodes. -2. **Identify Sweep Ranges**: Automatically determine minimum and maximum number of GPUs per engine. Minimum is determined by the model size and GPU VRAM. Maximum is set to one node for dense model and 4 nodes for MoE models. -3. **Parallelization Mapping Sweep**: Use the input ISL and OSL, test the performance of the engines with different parallelization mappings. - - For dense models, we test different TP sizes for both prefill and decode. - - For MoE models (SGLang), we evaluate both TEP and DEP as candidates for prefill and decode. - - **Prefill**: - - TP/TEP: We measure TTFT with batch size = 1 (assuming ISL is long enough to saturate compute) without KV reuse. - - DEP: Attention uses data parallelism. We send a single burst with total concurrency `attention_dp_size × attn_dp_num_req_ratio` (defaults to 4) and compute the reported TTFT as `time_to_first_token.max / attn_dp_num_req_ratio` from the AIPerf summary of that burst. This stabilizes measurements when the first batch may launch before all requests arrive. - ![Prefill Performance](../images/h100_prefill_performance.png) - - **Decode**: Since the ITL (or iteration time) is relevant with how many requests are in-flight, we measure the ITL under different number of in-flight requests. The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold. To measure the ITL without being affected by piggy-backed prefill requests, the script will enable kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL. Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL. However, for MoE models, this is not guaranteed because the kv cache in different attention DP ranks is different. We are working on framework-side change to fix this issue. For example, the below plot shows the decode parallelization mapping sweep results for H100 for deepseek-ai/DeepSeek-R1-Distill-Llama-8B. - ![Decode Performance](../images/h100_decode_performance.png) -4. **Recommendation**: Selects optimal parallelization mapping for prefill and decode that achieves the highest per GPU throughput while adhering the SLA on TTFT and ITL. Specifically, the profiler will choose the point (or a point on the curve for decode) that is left to the vertical red dashed line that represents the SLAs while has the highest y coordinate (throughput per GPU). -5. **In-Depth Profiling on the Recommended P/D Engine**: After finding the best TP size for prefill and decode, the script will then interpolate the TTFT with ISL and ITL with active KV cache and decode context length. This is to provide a more accurate estimation of the performance when ISL and OSL changes and will be used in the sla-planner. -![ITL Interpolation](../images/pd_interpolation.png) - - **Prefill**: Measures TTFT and throughput per GPU across different input lengths with batch size=1. - - **Decode**: Measures ITL and throughput per GPU under various KV cache loads and decode context lengths. The active kv usage determines the complexity of the memory-bounded attention kernel while the active kv usage divided the average context length determines the complexity of the computation bound MLP kernel. For example, the below figure shows the ITL of DS-Distilled Llama 8b model on H100 TP4. The ITL grows near-linearly with active kv usage under a fixed context length. And the slope increases as the context length decreases. - - -To run the parallelization mapping sweep and the in-depth profiling on the recommended P/D engine, the profiler need to know the engine's forward pass time with different loads. There are two ways to achieve this: run AIPerf on real engines or use AI Configurator to run simulations. - -### AIPerf on Real Engines - -Profiles your model by creating real test deployments in Kubernetes and measuring their performance. - -**Characteristics:** -- **Duration**: 2-4 hours -- **Accuracy**: Highest (real measurements) -- **GPU Requirements**: Full access to test different parallelization mappings -- **Backends**: vLLM, SGLang, TensorRT-LLM - -**DGDR Configuration:** -```yaml -profilingConfig: - config: - sweep: - useAiConfigurator: false # Default -``` - -### AI Configurator Simulation - -Uses performance simulation to rapidly estimate optimal configurations without running real deployments. - -**Characteristics:** -- **Duration**: 20-30 seconds -- **Accuracy**: Estimated (may have errors for unusual configurations) -- **GPU Requirements**: None -- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon) - -**DGDR Configuration:** -```yaml -profilingConfig: - config: - sweep: - useAiConfigurator: true - aicSystem: h200_sxm # GPU system type - aicHfId: Qwen/Qwen3-32B # HuggingFace model ID - aicBackendVersion: "0.20.0" -``` - -**Supported Configurations:** - -For the current list of supported models, systems, and backend versions, see the [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features). - -To check from the command line: `aiconfigurator cli --help` - -**Currently supports:** -- **Backends**: TensorRT-LLM (versions 0.20.0, 1.0.0rc3, 1.0.0rc6) -- **Systems**: H100 SXM, H200 SXM, B200 SXM, GB200 SXM, A100 SXM -- **Models**: Wide range including GPT, Llama, Mixtral, DeepSeek, Qwen, and more - -### Output Format - -After profiling, the DGDR status contains: - -1. **Recommended Configuration**: Optimal TP for prefill and decode -2. **Performance Data**: Interpolation models for SLA planner -3. **Generated DGD**: Complete deployment manifest - -**Example Recommendations:** -``` -Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU) -Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU) -``` - -#### Interactive Configuration Selection WebUI - -When running the profiler with `--pick-with-webui`, an interactive web interface is launched that allows you to visually explore profiling results and manually select configurations. - -**Features:** -- **Interactive Charts**: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables -- **Pareto-Optimal Analysis**: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput -- **DGD Config Preview**: Click "Show Config" on any row to view the corresponding DynamoGraphDeployment YAML -- **GPU Cost Estimation**: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests) -- **SLA Visualization**: Red dashed lines indicate your TTFT and ITL targets - -**Selection Methods:** -1. **GPU Hours Table** (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination -2. **Individual Selection**: Click one row in the Prefill table AND one row in the Decode table to manually choose each - -**Example DGD Config Output:** - -When you click "Show Config", you'll see a DynamoGraphDeployment configuration like: - -```yaml -# DynamoGraphDeployment Configuration -# Prefill: 1 GPU(s), TP=1 -# Decode: 4 GPU(s), TP=4 -# Model: Qwen/Qwen3-32B-FP8 -# Backend: trtllm -apiVersion: nvidia.com/v1alpha1 -kind: DynamoGraphDeployment -spec: - services: - PrefillWorker: - subComponentType: prefill - replicas: 1 - extraPodSpec: - mainContainer: - args: - - --tensor-parallel-size=1 - DecodeWorker: - subComponentType: decode - replicas: 1 - extraPodSpec: - mainContainer: - args: - - --tensor-parallel-size=4 -``` - -**Usage:** -```bash -python -m benchmarks.profiler.profile_sla \ - --backend trtllm \ - --config path/to/disagg.yaml \ - --pick-with-webui \ - --use-ai-configurator \ - --model Qwen/Qwen3-32B-FP8 \ - --aic-system h200_sxm \ - --ttft 200 --itl 15 -``` - -Once you have selected a configuration, the full DynamoGraphDeployment CRD will be saved in your output folder as `config_with_planner.yaml`. - -The WebUI launches on port 8000 by default (configurable with `--webui-port`). - -#### Output Performance Plots - -The profiler will generate the following plots to better visualize the performance data: - -**Parallelization Mapping Sweep Plots:** -- `prefill_performance.png`: TTFT vs Parallelization Mapping size -- `decode_performance.png`: ITL vs Parallelization Mapping size and in-flight requests - -Note these two plots are based on the input ISL and OSL. - -**In-Depth Profiling for the Recommended P/D Engine Plots:** -- `selected_prefill_interpolation/prefill_ttft_interpolation.png`: TTFT vs ISL for the recommended prefill engine -- `selected_prefill_interpolation/prefill_throughput_interpolation.png`: Throughput vs ISL for the recommended prefill engine -- `selected_decode_interpolation/decode_itl_interplation.png`: ITL vs KV usage and context length for the recommended decode engine -- `selected_decode_interpolation/decode_throughput_interpolation.png`: Throughput vs KV usage and context length for the recommended decode engine - - -### Output Interpolation Data - -The profiler generates `.npz` files to store the performance data for the recommended P/D engine: - -**Prefill Interpolation** (`selected_prefill_interpolation/raw_data.npz`): -- `prefill_isl`: 1D array of input sequence lengths tested -- `prefill_ttft`: 1D array of TTFTs (ms) at each ISL -- `prefill_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each ISL - -**Decode Interpolation** (`selected_decode_interpolation/raw_data.npz`): -- `max_kv_tokens`: Total KV tokens capacity in decode engine -- `x_kv_usage`: 1D array of active KV usage percentages [0, 1] -- `y_context_length`: 1D array of average context lengths tested -- `z_itl`: 1D array of ITLs (ms) at each (KV usage, context length) point -- `z_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each point - -## DGDR Configuration Reference - -This section provides detailed explanations of all DGDR `profilingConfig` options. The DGDR controller passes this configuration to the profiler script, which is defined in `benchmarks/profiler/utils/profiler_argparse.py`. - -### Configuration Structure - -All profiler configuration goes under `spec.profilingConfig.config`: - -```yaml -apiVersion: nvidia.com/v1alpha1 -kind: DynamoGraphDeploymentRequest -metadata: - name: my-deployment -spec: - model: "Qwen/Qwen3-0.6B" # High-level: model to deploy - backend: vllm # High-level: inference backend - - profilingConfig: - profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" # Required - configMapRef: # Optional: base DGD config - name: my-config - key: disagg.yaml - - config: # Profiler configuration - sla: { ... } - hardware: { ... } - sweep: { ... } # AIC settings go here (aicSystem, aicHfId, etc.) - planner: { ... } - - deploymentOverrides: # Optional - workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" -``` - -### SLA Configuration (Required) - -Define your performance requirements and workload characteristics: - -```yaml -profilingConfig: - config: - sla: - isl: 3000 # Average input sequence length (tokens) - osl: 150 # Average output sequence length (tokens) - ttft: 200.0 # Target Time To First Token (milliseconds) - itl: 20.0 # Target Inter-Token Latency (milliseconds) -``` - -**What these control:** -- **ISL/OSL**: Based on your expected traffic patterns -- **TTFT**: First token latency target (lower = more GPUs needed, affects prefill engine) -- **ITL**: Token generation latency target (lower = more GPUs needed, affects decode engine) -- **Trade-offs**: Tighter SLAs require more GPU resources - -### Hardware Configuration (Optional) - -Control GPU search space and constraints: - -```yaml -profilingConfig: - config: - hardware: - minNumGpusPerEngine: 2 # if not provided, will automatically determine based on model and VRAM size - maxNumGpusPerEngine: 8 # Maximum GPUs to test - numGpusPerNode: 8 # GPUs per node (for multi-node MoE) - gpuType: h200_sxm # GPU type hint -``` - -**When to use:** -- **minNumGpusPerEngine**: Skip small TP sizes if your model is large -- **maxNumGpusPerEngine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error)) -- **numGpusPerNode**: Determine the upper bound of number of GPUs per node for dense models and configure Grove for multi-node MoE engines. -- **gpu_type**: Informational, auto-detected by controller - -> [!TIP] -> If you don't specify hardware constraints, the controller auto-detects based on your model size and available cluster resources. - -### Sweep Configuration (Optional) - -Control profiling behavior: - -```yaml -profilingConfig: - config: - sweep: - useAiConfigurator: false # Use offline profiling (default: false) - prefillInterpolationGranularity: 16 # Samples for prefill TTFT curve - decodeInterpolationGranularity: 6 # Samples for decode ITL curve -``` - -**Use cases:** -- **useAiConfigurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only) -- **prefillInterpolationGranularity**: How many samples to benchmark for prefill TTFT curve (lower = faster but may be less accurate) -- **decodeInterpolationGranularity**: How many samples to benchmark for decode ITL curve (lower = faster but may be less accurate). Since ITL interpolation is a 3d plot and takes longer to run, we default to a smaller number of samples. Increasing this value might quadratically increase the profiling time. - -### AI Configurator Configuration (Required if `useAiConfigurator: true`) - -Configure AI Configurator profiling mode: - -```yaml -profilingConfig: - config: - sweep: - useAiConfigurator: true - aicSystem: h200_sxm # GPU system: h100_sxm, h200_sxm, b200_sxm, gb200_sxm, a100_sxm - aicHfId: Qwen/Qwen3-32B # Huggingface model id - aicBackendVersion: "0.20.0" # TensorRT-LLM version: 0.20.0, 1.0.0rc3 -``` - -**Supported configurations:** See [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features) - -### Planner Configuration (Optional) - -Pass arguments to the SLA planner: - -```yaml -profilingConfig: - config: - planner: - planner_min_endpoint: 2 # Minimum endpoints to maintain - planner_adjustment_interval: 60 # Adjustment interval (seconds) - planner_load_predictor: linear # Load prediction method -``` - -> [!NOTE] -> Planner arguments use `planner_` prefix. See planner documentation for full list. - -### Model Cache PVC (Advanced) - -For large models, you can use a pre-populated PVC containing model weights instead of downloading from HuggingFace. This is useful when: -- The model is not publicly available on HuggingFace -- You want to avoid repeated downloads during profiling -- You have a shared model cache across your cluster - -```yaml -profilingConfig: - config: - deployment: - modelCache: - pvcName: "model-cache" # Name of PVC containing model weights (required) - pvcPath: "hub/models--deepseek-ai--DeepSeek-R1" # Subpath within PVC (optional) - mountPath: "/opt/model-cache" # Mount path in container (optional, default: /opt/model-cache) -``` - -**Requirements:** -- The PVC must exist in the same namespace as the DGDR -- The model weights must be accessible at `{mountPath}/{pvcPath}` - -### Engine Configuration (Auto-configured) - -The controller automatically sets these from high-level fields: - -```yaml -# You specify: -spec: - model: "Qwen/Qwen3-0.6B" - backend: vllm - -# Controller auto-injects into config: -profilingConfig: - config: - deployment: - model: "Qwen/Qwen3-0.6B" # From spec.model - engine: - backend: vllm # From spec.backend - config: /path/to/configmap # From spec.profilingConfig.configMapRef (if provided) -``` - -**You should not manually set** `deployment.model` or `engine.backend` in `profilingConfig.config` - they are automatically injected from the high-level fields. - -### Complete Example: AIPerf on Real Engines - -```yaml -apiVersion: nvidia.com/v1alpha1 -kind: DynamoGraphDeploymentRequest -metadata: - name: vllm-dense-online -spec: - model: "Qwen/Qwen3-0.6B" - backend: vllm - - profilingConfig: - profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" - config: - sla: - isl: 3000 - osl: 150 - ttft: 200.0 - itl: 20.0 - - hardware: - minNumGpusPerEngine: 1 - maxNumGpusPerEngine: 8 - - sweep: - useAiConfigurator: false - - deploymentOverrides: - workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" - - autoApply: true -``` - -### Complete Example: AI Configurator Simulation - -```yaml -apiVersion: nvidia.com/v1alpha1 -kind: DynamoGraphDeploymentRequest -metadata: - name: trtllm-aic-offline -spec: - model: "Qwen/Qwen3-32B" - backend: trtllm - - profilingConfig: - profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1" - config: - sla: - isl: 4000 - osl: 500 - ttft: 300.0 - itl: 10.0 - - sweep: - useAiConfigurator: true - aicSystem: h200_sxm - aicHfId: Qwen/Qwen3-32B - aicBackendVersion: "0.20.0" - - deploymentOverrides: - workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1" - - autoApply: true -``` - -### Complete Example: MoE Model - -```yaml -apiVersion: nvidia.com/v1alpha1 -kind: DynamoGraphDeploymentRequest -metadata: - name: sglang-moe -spec: - model: "deepseek-ai/DeepSeek-R1" - backend: sglang - - profilingConfig: - profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1" - config: - sla: - isl: 2048 - osl: 512 - ttft: 300.0 - itl: 25.0 - - hardware: - numGpusPerNode: 8 - maxNumGpusPerEngine: 32 - - engine: - isMoeModel: true # Enable MoE profiling mode - - deploymentOverrides: - workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1" - - autoApply: true -``` - -## Troubleshooting - -### Profiling Takes Too Long - -**Solution 1**: Use AI Configurator for rapid profiling (TensorRT-LLM only): -```yaml -sweep: - useAiConfigurator: true -``` - -**Solution 2**: Reduce search space: -```yaml -config: - sweep: - minNumGpus: 4 # Skip TP1, TP2 - maxNumGpus: 8 # Don't test beyond TP8 -``` - -### SLA Cannot Be Met - -**Symptoms**: Profiler reports no configuration meets targets - -**Solutions:** -1. Relax SLA targets (increase TTFT/ITL) -2. Add more GPU resources -3. Try a different backend -4. Use a smaller model - -### AI Configurator: Attention Head Constraint Error - -**Symptoms**: Profiling fails with error: -``` -AssertionError: num_heads should be divisible by tp_size and the division result should be >= 4 -``` - -**Cause**: AI Configurator requires **≥4 attention heads per GPU**. Small models with few heads cannot use high TP sizes. - -**Affected Models:** -- **Qwen3-0.6B** (16 heads): Max TP = 4 ❌ Fails at TP=8 -- **GPT-2** (12 heads): Max TP = 3 -- Most models **<1B parameters**: May hit this constraint - -**Solution**: Limit `maxNumGpusPerEngine` in your DGDR: - -```yaml -profilingConfig: - profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1" - config: - hardware: - maxNumGpusPerEngine: 4 # For Qwen3-0.6B (16 heads / 4 = max TP of 4) - sweep: - useAiConfigurator: true - aicSystem: h200_sxm - aicHfId: Qwen/Qwen3-0.6B -``` - -**Calculate Max TP**: `max_tp = num_attention_heads / 4` - -> **Note**: This is an AI Configurator limitation. Online profiling doesn't have this constraint. - -### Image Pull Errors - -**Symptoms**: `ErrImagePull` or `ImagePullBackOff` - -**Solution**: Ensure image pull secrets are configured: -```bash -kubectl create secret docker-registry nvcr-imagepullsecret \ - --docker-server=nvcr.io \ - --docker-username='$oauthtoken' \ - --docker-password= \ - --namespace -``` - -### Out of Memory During Profiling - -**Symptoms**: OOM errors in profiling jobs - -**Solutions:** -1. Reduce `gpu_memory_utilization` in engine config -2. Reduce `--max-context-length` -3. Skip larger TP configurations -4. Use fewer GPUs per test - -### Unsupported Parallelization Mapping in Backend - -**Symptoms**: Starttime/runtime error in the backend. For example, prime number of attention heads restrain TP size to be 1 (i.e., falcon-7b with 71 attention heads). Or some backend does not support different TP sizes for prefill and decode. - -**Solutions:** -1. Contact the backend to add support for the use cases and bump backend version in dynamo. -2. Restrain the max and min number of GPUs per engine to the supported range. - -## Next Steps - -- **Deploy with DGDR**: See [Quick Start Guide](/docs/planner/sla_planner_quickstart.md) -- **Understand SLA Planner**: Read [SLA Planner Deep Dive](/docs/planner/sla_planner.md) -- **Monitor Deployments**: Set up [Observability](/docs/kubernetes/observability/metrics.md) -- **Optimize Performance**: See [Performance Tuning](/docs/performance/tuning.md) - -## Related Documentation - -- [DGDR API Reference](/docs/kubernetes/api_reference.md) -- [SLA Planner Quick Start](/docs/planner/sla_planner_quickstart.md) -- [SLA Planner Architecture](/docs/planner/sla_planner.md) -- [Profiler Arguments Reference](/benchmarks/profiler/utils/profiler_argparse.py) diff --git a/docs/components/frontend/README.md b/docs/components/frontend/README.md index 72213800e5f..2b5dd7861ad 100644 --- a/docs/components/frontend/README.md +++ b/docs/components/frontend/README.md @@ -78,4 +78,4 @@ See the [Frontend Guide](frontend_guide.md) for full configuration options. | Document | Description | |----------|-------------| | [Frontend Guide](frontend_guide.md) | KServe gRPC configuration and integration | -| [Router Documentation](../../router/README.md) | KV-aware routing configuration | +| [Router Documentation](../router/README.md) | KV-aware routing configuration | diff --git a/docs/components/frontend/frontend_guide.md b/docs/components/frontend/frontend_guide.md index bdc79e730cb..51ecbf1d8d3 100644 --- a/docs/components/frontend/frontend_guide.md +++ b/docs/components/frontend/frontend_guide.md @@ -144,7 +144,7 @@ The frontend includes an integrated router for request distribution. Configure r python -m dynamo.frontend --router-mode kv --http-port 8000 ``` -See [Router Documentation](../../router/README.md) for routing configuration details. +See [Router Documentation](../router/README.md) for routing configuration details. ### With Backends @@ -159,4 +159,4 @@ Backends auto-register with the frontend when they call `register_llm()`. Suppor | Document | Description | |----------|-------------| | [Frontend Overview](README.md) | Quick start and feature matrix | -| [Router Documentation](../../router/README.md) | KV-aware routing configuration | +| [Router Documentation](../router/README.md) | KV-aware routing configuration | diff --git a/docs/kvbm/README.md b/docs/components/kvbm/README.md similarity index 88% rename from docs/kvbm/README.md rename to docs/components/kvbm/README.md index aafa186551e..ab6cb70a8f1 100644 --- a/docs/kvbm/README.md +++ b/docs/components/kvbm/README.md @@ -53,7 +53,7 @@ Offloading KV cache to CPU or storage is most effective when KV Cache exceeds GP ## Architecture -![KVBM Architecture](../images/kvbm-architecture.png) +![KVBM Architecture](../../images/kvbm-architecture.png) *High-level layered architecture view of Dynamo KV Block Manager and how it interfaces with different components of the LLM inference ecosystem* KVBM has three primary logical layers: @@ -64,13 +64,13 @@ KVBM has three primary logical layers: **NIXL Layer** — The bottom layer provides unified support for all data and storage transactions. NIXL enables P2P GPU transfers, RDMA and NVLink remote memory sharing, dynamic block registration and metadata exchange, and provides a plugin interface for storage backends including block memory (GPU HBM, Host DRAM, Remote DRAM, Local SSD), local/remote filesystems, object stores, and cloud storage. -> **Learn more:** See the [KVBM Design Document](kvbm_design.md) for detailed architecture, components, and data flows. +> **Learn more:** See the [KVBM Design Document](../../design_docs/kvbm_design.md) for detailed architecture, components, and data flows. ## Next Steps - **[KVBM Guide](kvbm_guide.md)** — Installation, configuration, and deployment instructions -- **[KVBM Design](kvbm_design.md)** — Architecture deep dive, components, and data flows -- **[LMCache Integration](../integrations/lmcache_integration.md)** — Use LMCache with Dynamo vLLM backend -- **[FlexKV Integration](../integrations/flexkv_integration.md)** — Use FlexKV for KV cache management -- **[SGLang HiCache](../integrations/sglang_hicache.md)** — Enable SGLang's hierarchical cache with NIXL +- **[KVBM Design](../../design_docs/kvbm_design.md)** — Architecture deep dive, components, and data flows +- **[LMCache Integration](../../integrations/lmcache_integration.md)** — Use LMCache with Dynamo vLLM backend +- **[FlexKV Integration](../../integrations/flexkv_integration.md)** — Use FlexKV for KV cache management +- **[SGLang HiCache](../../integrations/sglang_hicache.md)** — Enable SGLang's hierarchical cache with NIXL - **[NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** — NIXL communication library details diff --git a/docs/kvbm/kvbm_guide.md b/docs/components/kvbm/kvbm_guide.md similarity index 95% rename from docs/kvbm/kvbm_guide.md rename to docs/components/kvbm/kvbm_guide.md index 21e8b5894bd..c923e94bca6 100644 --- a/docs/kvbm/kvbm_guide.md +++ b/docs/components/kvbm/kvbm_guide.md @@ -43,11 +43,11 @@ KVBM can be used independently without using the rest of the Dynamo stack: pip install kvbm ``` -See the [support matrix](../reference/support-matrix.md) for version compatibility. +See the [support matrix](../../reference/support-matrix.md) for version compatibility. ### Build from Source -To build KVBM from source, see the detailed instructions in the [KVBM bindings README](../../lib/bindings/kvbm/README.md#build-from-source). +To build KVBM from source, see the detailed instructions in the [KVBM bindings README](../../../lib/bindings/kvbm/README.md#build-from-source). ## Run KVBM in Dynamo with vLLM @@ -189,7 +189,7 @@ curl localhost:8000/v1/chat/completions \ }' ``` -> **Learn more:** See the [SGLang HiCache Integration Guide](../integrations/sglang_hicache.md) for detailed configuration, deployment examples, and troubleshooting. +> **Learn more:** See the [SGLang HiCache Integration Guide](../../integrations/sglang_hicache.md) for detailed configuration, deployment examples, and troubleshooting. ## Disaggregated Serving with KVBM @@ -369,7 +369,7 @@ trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --ex **Solution:** Enable KVBM metrics and check the Grafana dashboard for `Onboard Blocks - Host to Device` and `Onboard Blocks - Disk to Device`. Large numbers of onboarded KV blocks indicate good cache reuse: -![Grafana Example](../images/kvbm_metrics_grafana.png) +![Grafana Example](../../images/kvbm_metrics_grafana.png) ### KVBM Worker Initialization Timeout @@ -413,7 +413,7 @@ uv pip install --upgrade --force-reinstall --no-deps /workspace/dist/kvbm*.whl ## See Also - [KVBM Overview](README.md) -- [KVBM Design](kvbm_design.md) -- [LMCache Integration](../integrations/lmcache_integration.md) -- [FlexKV Integration](../integrations/flexkv_integration.md) -- [SGLang HiCache](../integrations/sglang_hicache.md) +- [KVBM Design](../../design_docs/kvbm_design.md) +- [LMCache Integration](../../integrations/lmcache_integration.md) +- [FlexKV Integration](../../integrations/flexkv_integration.md) +- [SGLang HiCache](../../integrations/sglang_hicache.md) diff --git a/docs/planner/README.md b/docs/components/planner/README.md similarity index 87% rename from docs/planner/README.md rename to docs/components/planner/README.md index dd51863d253..d4b27208d6d 100644 --- a/docs/planner/README.md +++ b/docs/components/planner/README.md @@ -19,7 +19,7 @@ limitations under the License. The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes. -> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](sla_planner_quickstart.md) for a complete workflow including profiling and deployment. +> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](planner_guide.md) for a complete workflow including profiling and deployment. ## Feature Matrix @@ -47,7 +47,7 @@ The Planner monitors system performance and automatically scales prefill/decode - Dynamo platform installed on Kubernetes ([Installation Guide](/docs/kubernetes/installation_guide.md)) - kube-prometheus-stack installed ([Metrics Setup](/docs/kubernetes/observability/metrics.md)) -- Pre-deployment profiling completed ([Profiling Guide](/docs/benchmarks/sla_driven_profiling.md)) +- Pre-deployment profiling completed ([Profiling Guide](/docs/components/profiler/profiler_guide.md)) ### Deploy with DGDR (Recommended) @@ -57,7 +57,7 @@ The fastest path to a planner-enabled deployment is through a DynamoGraphDeploym kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE ``` -This automatically profiles your model and deploys with the SLA planner. See [SLA Planner Quick Start](sla_planner_quickstart.md) for the full workflow. +This automatically profiles your model and deploys with the SLA planner. See [SLA Planner Guide](planner_guide.md) for the full workflow. ### Deploy with DGD (Manual) @@ -74,10 +74,10 @@ kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE |----------|-------------| | [Planner Guide](planner_guide.md) | Deployment, configuration, integration, troubleshooting | | [Planner Examples](planner_examples.md) | DGDR YAML examples, sample configurations, advanced patterns | -| [SLA Planner Quick Start](sla_planner_quickstart.md) | End-to-end DGDR workflow: define SLAs, profile, deploy, monitor | -| [SLA-based Planner](sla_planner.md) | Scaling algorithm, correction factors, load prediction details | -| [Load-based Planner](load_planner.md) | Legacy load-based scaling (deprecated) | -| [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md) | Pre-deployment profiling process and configuration | +| [SLA Planner Guide](planner_guide.md) | End-to-end DGDR workflow: define SLAs, profile, deploy, monitor | +| [SLA-based Planner](planner_guide.md) | Scaling algorithm, correction factors, load prediction details | +| [Load-based Planner](README.md) | Legacy load-based scaling (deprecated) | +| [SLA-Driven Profiling](/docs/components/profiler/profiler_guide.md) | Pre-deployment profiling process and configuration | | [Planner Design](/docs/design_docs/planner_design.md) | Architecture deep-dive for contributors | ## Configuration Reference diff --git a/docs/planner/planner_examples.md b/docs/components/planner/planner_examples.md similarity index 95% rename from docs/planner/planner_examples.md rename to docs/components/planner/planner_examples.md index 60f72fd4e8b..1ce3d88876b 100644 --- a/docs/planner/planner_examples.md +++ b/docs/components/planner/planner_examples.md @@ -1,3 +1,9 @@ + + # Planner Examples Practical examples for deploying the SLA Planner with different configurations. For deployment concepts, see the [Planner Guide](planner_guide.md). For a quick overview, see the [Planner README](README.md). @@ -229,7 +235,7 @@ Profiling runs against the real backend (via GPUs or AIC). The mocker deployment For large models, use a pre-populated PVC instead of downloading from HuggingFace: -See [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md) for configuration details. +See [SLA-Driven Profiling](/docs/components/profiler/profiler_guide.md) for configuration details. ## Advanced Examples @@ -374,5 +380,5 @@ kubectl delete pod pvc-access-pod -n $NAMESPACE - [Planner README](README.md) -- Overview and quick start - [Planner Guide](planner_guide.md) -- Deployment, configuration, integration - [Planner Design](/docs/design_docs/planner_design.md) -- Architecture deep-dive -- [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference) -- [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md) +- [DGDR Configuration Reference](/docs/components/profiler/profiler_guide.md#dgdr-configuration-reference) +- [SLA-Driven Profiling](/docs/components/profiler/profiler_guide.md) diff --git a/docs/planner/planner_guide.md b/docs/components/planner/planner_guide.md similarity index 96% rename from docs/planner/planner_guide.md rename to docs/components/planner/planner_guide.md index 5b9dc4082fd..eaee4294274 100644 --- a/docs/planner/planner_guide.md +++ b/docs/components/planner/planner_guide.md @@ -1,3 +1,9 @@ + + # Planner Guide Deployment, configuration, and integration guide for the Dynamo SLA Planner. For a quick overview, see the [Planner README](README.md). For architecture internals, see [Planner Design](/docs/design_docs/planner_design.md). @@ -162,7 +168,7 @@ sla: - **ITL**: Token generation latency target (lower = more GPUs needed) - **Trade-offs**: Tighter SLAs require more GPU resources -For comprehensive documentation of all configuration options, see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference). +For comprehensive documentation of all configuration options, see the [DGDR Configuration Reference](/docs/components/profiler/profiler_guide.md#dgdr-configuration-reference). ### Profiling Methods @@ -181,7 +187,7 @@ sweep: aicBackendVersion: "0.20.0" ``` -For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-methods). +For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](/docs/components/profiler/profiler_guide.md#profiling-methods). ### Load Predictors @@ -440,7 +446,7 @@ kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE | **DGD not deployed** | Verify `autoApply: true` in DGDR spec | | **Prometheus errors** | Ensure `PROMETHEUS_ENDPOINT` env var points to your Prometheus service | -For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](/docs/benchmarks/sla_driven_profiling.md#troubleshooting). +For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](/docs/components/profiler/profiler_guide.md#troubleshooting). ## Related Documentation @@ -448,5 +454,5 @@ For comprehensive troubleshooting including AI Configurator constraints, perform - [Planner Examples](planner_examples.md) -- DGDR YAML examples and sample configurations - [Planner Design](/docs/design_docs/planner_design.md) -- Architecture deep-dive for contributors - [DGDR API Reference](/docs/kubernetes/api_reference.md) -- [Pre-Deployment Profiling](/docs/benchmarks/sla_driven_profiling.md) +- [Pre-Deployment Profiling](/docs/components/profiler/profiler_guide.md) - [Dynamo Operator Guide](/docs/kubernetes/dynamo_operator.md) diff --git a/docs/components/profiler/README.md b/docs/components/profiler/README.md index 604baeabdaa..644229e2287 100644 --- a/docs/components/profiler/README.md +++ b/docs/components/profiler/README.md @@ -124,8 +124,8 @@ Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU) |----------|-------------| | [Profiler Guide](profiler_guide.md) | Configuration, methods, and troubleshooting | | [Profiler Examples](profiler_examples.md) | Complete DGDR YAMLs, WebUI, script examples | -| [SLA Planner Quick Start](/docs/planner/sla_planner_quickstart.md) | End-to-end deployment workflow | -| [SLA Planner Architecture](/docs/planner/sla_planner.md) | How the Planner uses profiling data | +| [SLA Planner Guide](/docs/components/planner/planner_guide.md) | End-to-end deployment workflow | +| [SLA Planner Architecture](/docs/components/planner/planner_guide.md) | How the Planner uses profiling data | ```{toctree} :hidden: diff --git a/docs/components/profiler/profiler_guide.md b/docs/components/profiler/profiler_guide.md index b3c1c2c66cf..d396ce71769 100644 --- a/docs/components/profiler/profiler_guide.md +++ b/docs/components/profiler/profiler_guide.md @@ -336,7 +336,7 @@ planner: ``` > [!NOTE] -> Planner arguments use `planner_` prefix. See [SLA Planner documentation](/docs/planner/sla_planner.md) for full list. +> Planner arguments use `planner_` prefix. See [SLA Planner documentation](/docs/components/planner/planner_guide.md) for full list. ### Model Cache PVC (Advanced) @@ -641,7 +641,7 @@ kubectl create secret docker-registry nvcr-imagepullsecret \ ## See Also - [Profiler Examples](profiler_examples.md) - Complete DGDR YAML examples -- [SLA Planner Quick Start](/docs/planner/sla_planner_quickstart.md) - End-to-end deployment workflow -- [SLA Planner Architecture](/docs/planner/sla_planner.md) - How the Planner uses profiling data +- [SLA Planner Guide](/docs/components/planner/planner_guide.md) - End-to-end deployment workflow +- [SLA Planner Architecture](/docs/components/planner/planner_guide.md) - How the Planner uses profiling data - [DGDR API Reference](/docs/kubernetes/api_reference.md) - DGDR specification - [Profiler Arguments Reference](/benchmarks/profiler/utils/profiler_argparse.py) - Full CLI reference diff --git a/docs/router/README.md b/docs/components/router/README.md similarity index 94% rename from docs/router/README.md rename to docs/components/router/README.md index d12b4db6746..504f4b5347a 100644 --- a/docs/router/README.md +++ b/docs/components/router/README.md @@ -75,7 +75,7 @@ All CLI arguments can be configured via environment variables using the `DYN_` p For complete K8s examples and advanced configuration, see [K8s Examples](router_examples.md#k8s-examples). -For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md). +For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md). For more configuration options and tuning guidelines, see the [Router Guide](router_guide.md). @@ -83,7 +83,7 @@ For more configuration options and tuning guidelines, see the [Router Guide](rou **Requirements:** - **Dynamic endpoints only**: KV router requires `register_llm()` with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text. -- Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../development/backend-guide.md)) +- Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../../development/backend-guide.md)) - You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead) **Multimodal Support:** @@ -100,4 +100,4 @@ For basic model registration without KV routing, use `--router-mode round-robin` - **[Router Guide](router_guide.md)**: Deep dive into KV cache routing, configuration, disaggregated serving, and tuning - **[Router Examples](router_examples.md)**: Python API usage, K8s examples, and custom routing patterns -- **[Router Design](../design_docs/router_design.md)**: Architecture details, algorithms, and event transport modes +- **[Router Design](../../design_docs/router_design.md)**: Architecture details, algorithms, and event transport modes diff --git a/docs/router/router_examples.md b/docs/components/router/router_examples.md similarity index 98% rename from docs/router/router_examples.md rename to docs/components/router/router_examples.md index 38ae414c091..9439d45ba3b 100644 --- a/docs/router/router_examples.md +++ b/docs/components/router/router_examples.md @@ -113,7 +113,7 @@ For basic Kubernetes deployment with the KV Router, see the [Kubernetes Deployme - [Distributed inference tutorial](../../examples/basics/kubernetes/Distributed_Inference/agg_router.yaml) **For A/B Testing and Advanced K8s Setup:** -See the comprehensive [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes. +See the comprehensive [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes. ### Example with Advanced Configuration @@ -270,7 +270,7 @@ This approach gives you complete control over routing decisions, allowing you to - **Maximize cache reuse**: Use `best_worker()` which considers both prefill and decode loads - **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together -See [Router Design](../design_docs/router_design.md) for architecture details and the cost function algorithm. +See [Router Design](../../design_docs/router_design.md) for architecture details and the cost function algorithm. ## KV Event Publishing for Custom Engines @@ -547,4 +547,4 @@ Each event in the payload is a dictionary with `type` field (`BlockStored`, `Blo - **[Router README](README.md)**: Quick start guide for the KV Router - **[Router Guide](router_guide.md)**: Configuration, tuning, and production setup -- **[Router Design](../design_docs/router_design.md)**: Architecture details and event transport modes +- **[Router Design](../../design_docs/router_design.md)**: Architecture details and event transport modes diff --git a/docs/router/router_guide.md b/docs/components/router/router_guide.md similarity index 97% rename from docs/router/router_guide.md rename to docs/components/router/router_guide.md index c8604ce1881..ed95901f53b 100644 --- a/docs/router/router_guide.md +++ b/docs/components/router/router_guide.md @@ -115,7 +115,7 @@ The main KV-aware routing arguments: > > The cli args `--router-ttl`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When KV events are enabled (default), the router relies on worker-side eviction events and these parameters are ignored. -To implement KV event publishing for custom inference engines, enabling them to participate in Dynamo's KV cache-aware routing, see [KV Event Publishing for Custom Engines](../integrations/kv_events_custom_engines.md). +To implement KV event publishing for custom inference engines, enabling them to participate in Dynamo's KV cache-aware routing, see [KV Event Publishing for Custom Engines](../../integrations/kv_events_custom_engines.md). ## Basic Routing @@ -135,7 +135,7 @@ We can then use the default routing methods exposed by the client class to send KV Cache routing uses direct routing with a special worker selection algorithm. -For benchmarking KV router performance, see the [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md). +For benchmarking KV router performance, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md). For custom routing logic and advanced patterns, see [Routing Patterns](router_examples.md#routing-patterns) in the examples documentation. @@ -177,7 +177,7 @@ The `router_temperature` parameter controls routing randomness: ## Disaggregated Serving -Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../development/backend-guide.md)), the frontend automatically detects them and activates an internal prefill router. +Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../../development/backend-guide.md)), the frontend automatically detects them and activates an internal prefill router. ### Automatic Prefill Router Activation @@ -260,7 +260,7 @@ For improved fault tolerance, you can launch multiple frontend + router replicas ### Router State Management -The KV Router tracks two types of state (see [Router Design](../design_docs/router_design.md) for details): +The KV Router tracks two types of state (see [Router Design](../../design_docs/router_design.md) for details): 1. **Prefix blocks (cached KV blocks)**: Maintained in a radix tree, tracking which blocks are cached on each worker. This state is **persistent** - backed by NATS JetStream events and object store snapshots. New router replicas automatically sync this state on startup, ensuring consistent cache awareness across restarts. @@ -346,5 +346,5 @@ curl http://localhost:8000/busy_threshold - **[Router README](README.md)**: Quick start guide for the KV Router - **[Router Examples](router_examples.md)**: Python API usage, K8s examples, and custom routing patterns -- **[Router Design](../design_docs/router_design.md)**: Architecture details and event transport modes -- **[KV Event Publishing for Custom Engines](../integrations/kv_events_custom_engines.md)**: Integrate custom inference engines with KV-aware routing +- **[Router Design](../../design_docs/router_design.md)**: Architecture details and event transport modes +- **[KV Event Publishing for Custom Engines](../../integrations/kv_events_custom_engines.md)**: Integrate custom inference engines with KV-aware routing diff --git a/docs/conf.py b/docs/conf.py index 7b2db2ad4c1..8b8f90c673b 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -53,7 +53,7 @@ "kubernetes/multinode-deployment": "../kubernetes/deployment/multinode-deployment.html", "kubernetes/logging": "../kubernetes/observability/logging.html", "kubernetes/metrics": "../kubernetes/observability/metrics.html", - "architecture/kv_cache_routing": "../router/kv_cache_routing.html", + "architecture/kv_cache_routing": "../components/router/router_guide.html", # PR #3658 "API/nixl_connect/README": "../../api/nixl_connect/README.html", "API/nixl_connect/connector": "../../api/nixl_connect/connector.html", @@ -69,34 +69,33 @@ "guides/backend": "../development/backend-guide.html", "runtime/README": "../development/runtime-guide.html", "guides/tool_calling": "../agents/tool-calling.html", - "architecture/kvbm_architecture": "../kvbm/kvbm_architecture.html", - "architecture/kvbm_components": "../kvbm/kvbm_components.html", - "architecture/kvbm_intro": "../kvbm/kvbm_intro.html", - "architecture/kvbm_motivation": "../kvbm/kvbm_motivation.html", - "architecture/kvbm_reading": "../kvbm/kvbm_reading.html", - "guides/run_kvbm_in_trtllm": "../kvbm/trtllm-setup.html", - "guides/run_kvbm_in_vllm": "../kvbm/vllm-setup.html", + "architecture/kvbm_architecture": "../design_docs/kvbm_design.html", + "architecture/kvbm_components": "../design_docs/kvbm_design.html", + "architecture/kvbm_intro": "../components/kvbm/README.html", + "architecture/kvbm_motivation": "../design_docs/kvbm_design.html", + "architecture/kvbm_reading": "../design_docs/kvbm_design.html", + "guides/run_kvbm_in_trtllm": "../components/kvbm/kvbm_guide.html", + "guides/run_kvbm_in_vllm": "../components/kvbm/kvbm_guide.html", "guides/health_check": "../observability/health-checks.html", "guides/logging": "../observability/logging.html", "guides/metrics": "../observability/metrics.html", "guides/disagg_perf_tuning": "../performance/tuning.html", - "architecture/load_planner": "../planner/load_planner.html", - "architecture/planner_intro": "../planner/planner_intro.html", - "architecture/sla_planner": "../planner/sla_planner.html", - "kubernetes/sla_planner_quickstart": "../planner/sla_planner_quickstart.html", + "architecture/load_planner": "../components/planner/README.html", + "architecture/planner_intro": "../components/planner/README.html", + "architecture/sla_planner": "../components/planner/planner_guide.html", + "kubernetes/sla_planner_quickstart": "../components/planner/planner_guide.html", "guides/dynamo_run": "../reference/cli.html", "dynamo_glossary": "../reference/glossary.html", "support_matrix": "../reference/support-matrix.html", - "components/router/README": "../router/README.html", - # Multimodal documentation consolidation - "backends/vllm/multimodal": "../../multimodal/vllm.html", - "backends/vllm/multimodal_vllm_guide": "../../multimodal/vllm.html", - "backends/trtllm/multimodal_support": "../../multimodal/trtllm.html", - "backends/trtllm/multimodal_trtllm_guide": "../../multimodal/trtllm.html", - "backends/trtllm/multinode/multinode-multimodal-example": "../../../multimodal/trtllm.html", - "backends/sglang/multimodal_epd": "../../multimodal/sglang.html", - "backends/sglang/multimodal_sglang_guide": "../../multimodal/sglang.html", - "multimodal/multimodal_intro": "index.html", + # Multimodal documentation consolidation (all redirect to features/multimodal/) + "backends/vllm/multimodal": "../../features/multimodal/multimodal_vllm.html", + "backends/vllm/multimodal_vllm_guide": "../../features/multimodal/multimodal_vllm.html", + "backends/trtllm/multimodal_support": "../../features/multimodal/multimodal_trtllm.html", + "backends/trtllm/multimodal_trtllm_guide": "../../features/multimodal/multimodal_trtllm.html", + "backends/trtllm/multinode/multinode-multimodal-example": "../../../features/multimodal/multimodal_trtllm.html", + "backends/sglang/multimodal_epd": "../../features/multimodal/multimodal_sglang.html", + "backends/sglang/multimodal_sglang_guide": "../../features/multimodal/multimodal_sglang.html", + "multimodal/multimodal_intro": "../features/multimodal/README.html", # Speculative decoding consolidation "backends/vllm/speculative_decoding": "../../features/speculative_decoding/speculative_decoding_vllm.html", # Multimodal migration to features/multimodal/ @@ -104,6 +103,23 @@ "multimodal/vllm": "../features/multimodal/multimodal_vllm.html", "multimodal/sglang": "../features/multimodal/multimodal_sglang.html", "multimodal/trtllm": "../features/multimodal/multimodal_trtllm.html", + # Component consolidation into docs/components/ + "router/README": "../components/router/README.html", + "router/kv_cache_routing": "../components/router/router_guide.html", + "router/kv_events": "../integrations/kv_events_custom_engines.html", + "planner/planner_intro": "../components/planner/README.html", + "planner/README": "../components/planner/README.html", + "planner/planner_guide": "../components/planner/planner_guide.html", + "planner/planner_examples": "../components/planner/planner_examples.html", + "planner/sla_planner_quickstart": "../components/planner/planner_guide.html", + "planner/sla_planner": "../components/planner/planner_guide.html", + "planner/load_planner": "../components/planner/README.html", + "kvbm/kvbm_intro": "../components/kvbm/README.html", + "kvbm/README": "../components/kvbm/README.html", + "kvbm/kvbm_guide": "../components/kvbm/kvbm_guide.html", + "kvbm/kvbm_design": "../design_docs/kvbm_design.html", + # Profiler consolidation + "benchmarks/sla_driven_profiling": "../components/profiler/profiler_guide.html", } # Custom extensions diff --git a/docs/design_docs/architecture.md b/docs/design_docs/architecture.md index e4ec91bd4fb..17675dcdfab 100644 --- a/docs/design_docs/architecture.md +++ b/docs/design_docs/architecture.md @@ -53,7 +53,7 @@ To address the growing demands of distributed inference serving, NVIDIA introduc The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features: - [Dynamo Disaggregated Serving](disagg_serving.md) -- [Dynamo Smart Router](../router/README.md) +- [Dynamo Smart Router](../components/router/README.md) - [Dynamo KV Cache Block Manager](../kvbm/kvbm_intro.rst) - [Planner](../planner/planner_intro.rst) - [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md) diff --git a/docs/kvbm/kvbm_design.md b/docs/design_docs/kvbm_design.md similarity index 99% rename from docs/kvbm/kvbm_design.md rename to docs/design_docs/kvbm_design.md index 2af39c4b39e..e531f3379b6 100644 --- a/docs/kvbm/kvbm_design.md +++ b/docs/design_docs/kvbm_design.md @@ -361,6 +361,6 @@ There are two components of the interface: ## See Also -- [KVBM Overview](README.md) -- [KVBM Guide](kvbm_guide.md) +- [KVBM Overview](../components/kvbm/README.md) +- [KVBM Guide](../components/kvbm/kvbm_guide.md) - [NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md) diff --git a/docs/design_docs/planner_design.md b/docs/design_docs/planner_design.md index c851cf8d299..1e6205bd518 100644 --- a/docs/design_docs/planner_design.md +++ b/docs/design_docs/planner_design.md @@ -1,6 +1,12 @@ + + # Planner Design -> **Tier 3 design documentation** for contributors and architects. For user-facing docs, see [docs/planner/](/docs/planner/). +> **Tier 3 design documentation** for contributors and architects. For user-facing docs, see [docs/components/planner/](/docs/components/planner/). ## Overview diff --git a/docs/design_docs/router_design.md b/docs/design_docs/router_design.md index a7fea649570..8b3eb3a63bc 100644 --- a/docs/design_docs/router_design.md +++ b/docs/design_docs/router_design.md @@ -304,7 +304,7 @@ This dual-layer approach—persistent global KV cache state via JetStream and ep ## See Also -- **[Router README](../router/README.md)**: Quick start guide for the KV Router -- **[Router Guide](../router/router_guide.md)**: Configuration, tuning, and production setup -- **[Router Examples](../router/router_examples.md)**: Python API usage and custom routing patterns +- **[Router README](../components/router/README.md)**: Quick start guide for the KV Router +- **[Router Guide](../components/router/router_guide.md)**: Configuration, tuning, and production setup +- **[Router Examples](../components/router/router_examples.md)**: Python API usage and custom routing patterns - **[KV Event Publishing for Custom Engines](../integrations/kv_events_custom_engines.md)**: Integrate custom inference engines with KV-aware routing diff --git a/docs/features/lora/README.md b/docs/features/lora/README.md index de22435c29a..ac3aad47a05 100644 --- a/docs/features/lora/README.md +++ b/docs/features/lora/README.md @@ -311,4 +311,4 @@ kubectl logs deployment/my-worker | grep -i lora - [Feature Matrix](../../reference/feature-matrix.md) - Backend compatibility overview - [vLLM Backend](../../backends/vllm/README.md) - vLLM-specific configuration - [Dynamo Operator](../../kubernetes/dynamo_operator.md) - Kubernetes operator overview -- [KV-Aware Routing](../../router/router_guide.md) - LoRA-aware request routing +- [KV-Aware Routing](../../components/router/router_guide.md) - LoRA-aware request routing diff --git a/docs/frontends/kserve.md b/docs/frontends/kserve.md deleted file mode 100644 index e62f821ce5c..00000000000 --- a/docs/frontends/kserve.md +++ /dev/null @@ -1,124 +0,0 @@ -# KServe gRPC frontend - -> **Note**: This content has moved to [Frontend Guide](../components/frontend/frontend_guide.md). -> This file will be removed in a future release. - -## Motivation - -[KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry standard protocol for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend. - -This documentation assumes readers are familiar with the usage of KServe v2 API and focuses on explaining the Dynamo parts that work together to support KServe API and how users may migrate existing KServe deployment to Dynamo. - -## Supported Endpoints - -* `ModelInfer` endpoint: KServe Standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#inference-1) -* `ModelStreamInfer` endpoint: Triton extension endpoint that provide bi-directional streaming version of the inference RPC to allow a sequence of inference requests/responses to be sent over a GRPC stream, as described [here](https://github.com/triton-inference-server/common/blob/main/protobuf/grpc_service.proto#L84-L92) -* `ModelMetadata` endpoint: KServe standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#model-metadata-1) -* `ModelConfig` endpoint: Triton extension endpoint as described [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_configuration.md) - -## Starting the Frontend - -To start the KServe frontend, run the below command -``` -python -m dynamo.frontend --kserve-grpc-server -``` - -## gRPC Performance Tuning - -The gRPC server supports optional HTTP/2 flow control tuning via environment variables. These can be set before starting the server to optimize for high-throughput streaming workloads. - -| Environment Variable | Description | Default | -|---------------------|-------------|---------| -| `DYN_GRPC_INITIAL_CONNECTION_WINDOW_SIZE` | HTTP/2 connection-level flow control window size in bytes | tonic default (64KB) | -| `DYN_GRPC_INITIAL_STREAM_WINDOW_SIZE` | HTTP/2 per-stream flow control window size in bytes | tonic default (64KB) | - -### Example: High-ISL/OSL configuration for streaming workloads - -```bash -# For 128 concurrent 15k-token requests -export DYN_GRPC_INITIAL_CONNECTION_WINDOW_SIZE=16777216 # 16MB -export DYN_GRPC_INITIAL_STREAM_WINDOW_SIZE=1048576 # 1MB -python -m dynamo.frontend --kserve-grpc-server -``` - -If these variables are not set, the server uses tonic's default values. - -> **Note**: Tune these values based on your workload. Connection window should accommodate `concurrent_requests × request_size`. Memory overhead equals the connection window size (shared across all streams). See [gRPC performance best practices](https://grpc.io/docs/guides/performance/) and [gRPC channel arguments](https://grpc.github.io/grpc/core/group__grpc__arg__keys.html) for more details. - -## Registering a Backend - -Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_llm()` API will be used. Currently the frontend support serving of the following model type and model input combination: -* `ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor -* `ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend) -* `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor based inference - -The first two combinations are backed by OpenAI Completions API, see [OpenAI Completions section](#openai-completions) for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for `NvCreateTensorRequest/NvCreateTensorResponse`, see [Tensor section](#tensor) for more detail: - -### OpenAI Completions - -Most of the Dynamo features are tailored for LLM inference and the combinations that are backed by OpenAI API can enable those features and are best suited for exploring those Dynamo features. However, this implies specific conversion between generic tensor based messages and OpenAI message and imposes specific structure of the KServe request message. - -#### Model Metadata / Config - -The metadata and config endpoint will report the registered backend to have the below, note that this is not the exact response. -``` -{ - name: $MODEL_NAME, - version: 1, - platform: "dynamo", - backend: "dynamo", # model config specific - inputs: [ - { - name: "text_input", - datatype: "BYTES", - shape: [1] - }, - { - name: "streaming", - datatype: "BOOL", - shape: [1], - optional: true - } - ] - outputs: [ - { - name: "text_output", - datatype: "BYTES", - shape: [-1] - }, - { - name: "finish_reason", - datatype: "BYTES", - shape: [-1], - optional: true - } - ] -} -``` - -#### Inference - -On receiving inference request, the following conversion will be performed: -* `text_input`: the element is expected to contain the user prompt string and will be converted to `prompt` field in OpenAI Completion request -* `streaming`: the element will be converted to `stream` field in OpenAI Completion request -On receiving model response, the following conversion will be performed: -* `text_output`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `text` of the choice. -* `finish_reason`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `finish_reason` of the choice. - -### Tensor - -This combination is used when the user is migrating an existing KServe based backend into Dynamo ecosystem. - -#### Model Metadata / Config - -When registering the backend, the backend must provide the model's metadata as tensor based deployment is generic and the frontend can't make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata: -* [TensorModelConfig](../../lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](../../lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values. -* [triton_model_config](../../lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which will supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](../../tests/frontend/grpc/echo_tensor_worker.py) for example. - -#### Inference - -When receiving inference request, the backend will receive [NvCreateTensorRequest](../../lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](../../lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo. - -## Python Bindings - -The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](../../lib/bindings/python/examples/kserve_grpc_service/server.py) for example. diff --git a/docs/frontends/openapi.json b/docs/frontends/openapi.json deleted file mode 100644 index 9600c11c3f9..00000000000 --- a/docs/frontends/openapi.json +++ /dev/null @@ -1,2893 +0,0 @@ -{ - "openapi": "3.1.0", - "info": { - "title": "NVIDIA Dynamo OpenAI Frontend", - "description": "OpenAI-compatible HTTP API for NVIDIA Dynamo.", - "contact": { - "name": "NVIDIA Dynamo", - "url": "https://github.com/ai-dynamo/dynamo" - }, - "license": { - "name": "Apache-2.0" - }, - "version": "0.7.0" - }, - "servers": [ - { - "url": "/", - "description": "Current server" - } - ], - "paths": { - "/busy_threshold": { - "get": { - "summary": "Endpoint: /busy_threshold", - "description": "Endpoint for path: /busy_threshold", - "operationId": "get_busy_threshold", - "responses": { - "200": { - "description": "Successful response" - }, - "400": { - "description": "Bad request - invalid input" - }, - "404": { - "description": "Model not found" - }, - "503": { - "description": "Service unavailable" - } - } - } - }, - "/docs": { - "get": { - "summary": "API documentation", - "description": "Interactive API documentation powered by Swagger UI.", - "operationId": "get_docs", - "responses": { - "200": { - "description": "Successful response" - }, - "400": { - "description": "Bad request - invalid input" - }, - "404": { - "description": "Model not found" - }, - "503": { - "description": "Service unavailable" - } - } - } - }, - "/health": { - "get": { - "summary": "Health check", - "description": "Returns the health status of the service. Used for readiness probes.", - "operationId": "get_health", - "responses": { - "200": { - "description": "Successful response" - }, - "400": { - "description": "Bad request - invalid input" - }, - "404": { - "description": "Model not found" - }, - "503": { - "description": "Service unavailable" - } - } - } - }, - "/live": { - "get": { - "summary": "Liveness check", - "description": "Returns the liveness status of the service. Used for liveness probes.", - "operationId": "get_live", - "responses": { - "200": { - "description": "Successful response" - }, - "400": { - "description": "Bad request - invalid input" - }, - "404": { - "description": "Model not found" - }, - "503": { - "description": "Service unavailable" - } - } - } - }, - "/metrics": { - "get": { - "summary": "Prometheus metrics", - "description": "Returns Prometheus metrics for monitoring the service.", - "operationId": "get_metrics", - "responses": { - "200": { - "description": "Successful response" - }, - "400": { - "description": "Bad request - invalid input" - }, - "404": { - "description": "Model not found" - }, - "503": { - "description": "Service unavailable" - } - } - } - }, - "/openapi.json": { - "get": { - "summary": "OpenAPI specification", - "description": "Returns the OpenAPI 3.0 specification for this API in JSON format.", - "operationId": "get_openapi.json", - "responses": { - "200": { - "description": "Successful response" - }, - "400": { - "description": "Bad request - invalid input" - }, - "404": { - "description": "Model not found" - }, - "503": { - "description": "Service unavailable" - } - } - } - }, - "/v1/chat/completions": { - "post": { - "summary": "Create chat completion", - "description": "Creates a completion for a chat conversation. Supports both streaming and non-streaming modes. Compatible with OpenAI's chat completions API.", - "operationId": "post_v1_chat_completions", - "requestBody": { - "description": "Chat completion request with model, messages, and optional parameters", - "content": { - "application/json": { - "schema": { - "allOf": [ - { - "$ref": "#/components/schemas/CreateChatCompletionRequest" - }, - { - "$ref": "#/components/schemas/CommonExt" - }, - { - "type": "object", - "properties": { - "chat_template_args": { - "type": [ - "object", - "null" - ], - "description": "Extra args to pass to the chat template rendering context", - "additionalProperties": {}, - "propertyNames": { - "type": "string" - } - }, - "nvext": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/NvExt" - } - ] - } - }, - "additionalProperties": { - "description": "Catch-all for unsupported fields - checked during validation" - } - } - ], - "description": "A request structure for creating a chat completion, extending OpenAI's\n`CreateChatCompletionRequest` with [`NvExt`] extensions and common fields.\n\n# Fields\n- `inner`: The base OpenAI chat completion request, embedded using `serde(flatten)`.\n- `common`: Common extension fields (ignore_eos, min_tokens) at root level, embedded using `serde(flatten)`.\n- `nvext`: The optional NVIDIA extension field. See [`NvExt`] for more details.\n Note: If ignore_eos is specified in both common and nvext, the common (root-level) value takes precedence." - }, - "example": { - "model": "Qwen/Qwen3-0.6B", - "messages": [ - { - "role": "system", - "content": "You are a helpful assistant." - }, - { - "role": "user", - "content": "Hello! Can you help me understand what this API does?" - } - ], - "temperature": 0.7, - "max_tokens": 50, - "stream": false - } - } - }, - "required": true - }, - "responses": { - "200": { - "description": "Successful response" - }, - "400": { - "description": "Bad request - invalid input" - }, - "404": { - "description": "Model not found" - }, - "503": { - "description": "Service unavailable" - } - } - } - }, - "/v1/completions": { - "post": { - "summary": "Create text completion", - "description": "Creates a completion for a given prompt. Supports both streaming and non-streaming modes. Compatible with OpenAI's completions API.", - "operationId": "post_v1_completions", - "requestBody": { - "description": "Text completion request with model, prompt, and optional parameters", - "content": { - "application/json": { - "schema": { - "allOf": [ - { - "$ref": "#/components/schemas/CreateCompletionRequest" - }, - { - "$ref": "#/components/schemas/CommonExt" - }, - { - "type": "object", - "properties": { - "metadata": {}, - "nvext": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/NvExt" - } - ] - } - }, - "additionalProperties": { - "description": "Catch-all for unsupported fields - checked during validation" - } - } - ] - }, - "example": { - "model": "Qwen/Qwen3-0.6B", - "prompt": "Once upon a time", - "temperature": 0.7, - "max_tokens": 50, - "stream": false - } - } - }, - "required": true - }, - "responses": { - "200": { - "description": "Successful response" - }, - "400": { - "description": "Bad request - invalid input" - }, - "404": { - "description": "Model not found" - }, - "503": { - "description": "Service unavailable" - } - } - } - }, - "/v1/embeddings": { - "post": { - "summary": "Create embeddings", - "description": "Creates an embedding vector representing the input text. Compatible with OpenAI's embeddings API.", - "operationId": "post_v1_embeddings", - "requestBody": { - "description": "Embedding request with model and input text", - "content": { - "application/json": { - "schema": { - "allOf": [ - { - "$ref": "#/components/schemas/CreateEmbeddingRequest" - }, - { - "type": "object", - "properties": { - "nvext": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/NvExt" - } - ] - } - } - } - ] - }, - "example": { - "model": "Qwen/Qwen3-Embedding-4B", - "input": "The quick brown fox jumps over the lazy dog" - } - } - }, - "required": true - }, - "responses": { - "200": { - "description": "Successful response" - }, - "400": { - "description": "Bad request - invalid input" - }, - "404": { - "description": "Model not found" - }, - "503": { - "description": "Service unavailable" - } - } - } - }, - "/v1/models": { - "get": { - "summary": "List available models", - "description": "Lists the currently available models and provides basic information about each.", - "operationId": "get_v1_models", - "responses": { - "200": { - "description": "Successful response" - }, - "400": { - "description": "Bad request - invalid input" - }, - "404": { - "description": "Model not found" - }, - "503": { - "description": "Service unavailable" - } - } - } - }, - "/v1/responses": { - "post": { - "summary": "Create response", - "description": "Creates a response for a given input. Compatible with OpenAI's responses API.", - "operationId": "post_v1_responses", - "requestBody": { - "description": "Response request with model and input", - "content": { - "application/json": { - "schema": { - "allOf": [ - { - "$ref": "#/components/schemas/CreateResponse", - "description": "Flattened CreateResponse fields (model, input, temperature, etc.)" - }, - { - "type": "object", - "properties": { - "nvext": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/NvExt" - } - ] - } - } - } - ] - }, - "example": { - "model": "Qwen/Qwen3-0.6B", - "input": "What is the capital of France?" - } - } - }, - "required": true - }, - "responses": { - "200": { - "description": "Successful response" - }, - "400": { - "description": "Bad request - invalid input" - }, - "404": { - "description": "Model not found" - }, - "503": { - "description": "Service unavailable" - } - } - } - } - }, - "components": { - "schemas": { - "AudioUrl": { - "type": "object", - "required": [ - "url" - ], - "properties": { - "url": { - "type": "string", - "format": "uri", - "description": "URL of the audio file" - }, - "uuid": { - "type": [ - "string", - "null" - ], - "format": "uuid", - "description": "Optional unique identifier for the audio." - } - } - }, - "ChatCompletionAudio": { - "type": "object", - "required": [ - "voice", - "format" - ], - "properties": { - "format": { - "$ref": "#/components/schemas/ChatCompletionAudioFormat", - "description": "Specifies the output audio format. Must be one of `wav`, `mp3`, `flac`, `opus`, or `pcm16`." - }, - "voice": { - "$ref": "#/components/schemas/ChatCompletionAudioVoice", - "description": "The voice the model uses to respond. Supported voices are `ash`, `ballad`, `coral`, `sage`, and `verse` (also supported but not recommended are `alloy`, `echo`, and `shimmer`; these voices are less expressive)." - } - } - }, - "ChatCompletionAudioFormat": { - "type": "string", - "enum": [ - "wav", - "mp3", - "flac", - "opus", - "pcm16" - ] - }, - "ChatCompletionAudioVoice": { - "type": "string", - "enum": [ - "alloy", - "ash", - "ballad", - "coral", - "echo", - "sage", - "shimmer", - "verse" - ] - }, - "ChatCompletionFunctionCall": { - "oneOf": [ - { - "type": "string", - "description": "The model does not call a function, and responds to the end-user.", - "enum": [ - "none" - ] - }, - { - "type": "string", - "description": "The model can pick between an end-user or calling a function.", - "enum": [ - "auto" - ] - }, - { - "type": "object", - "description": "Forces the model to call the specified function.", - "required": [ - "Function" - ], - "properties": { - "Function": { - "type": "object", - "description": "Forces the model to call the specified function.", - "required": [ - "name" - ], - "properties": { - "name": { - "type": "string" - } - } - } - } - } - ] - }, - "ChatCompletionFunctions": { - "type": "object", - "required": [ - "name", - "parameters" - ], - "properties": { - "description": { - "type": [ - "string", - "null" - ], - "description": "A description of what the function does, used by the model to choose when and how to call the function." - }, - "name": { - "type": "string", - "description": "The name of the function to be called. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64." - }, - "parameters": { - "description": "The parameters the functions accepts, described as a JSON Schema object. See the [guide](https://platform.openai.com/docs/guides/text-generation/function-calling) for examples, and the [JSON Schema reference](https://json-schema.org/understanding-json-schema/) for documentation about the format.\n\nOmitting `parameters` defines a function with an empty parameter list." - } - }, - "deprecated": true - }, - "ChatCompletionMessageToolCall": { - "type": "object", - "required": [ - "id", - "type", - "function" - ], - "properties": { - "function": { - "$ref": "#/components/schemas/FunctionCall", - "description": "The function that the model called." - }, - "id": { - "type": "string", - "description": "The ID of the tool call." - }, - "type": { - "$ref": "#/components/schemas/ChatCompletionToolType", - "description": "The type of the tool. Currently, only `function` is supported." - } - } - }, - "ChatCompletionModalities": { - "type": "string", - "description": "Output types that you would like the model to generate for this request.\n\nMost models are capable of generating text, which is the default: `[\"text\"]`\n\nThe `gpt-4o-audio-preview` model can also be used to [generate\naudio](https://platform.openai.com/docs/guides/audio). To request that this model generate both text and audio responses, you can use: `[\"text\", \"audio\"]`", - "enum": [ - "text", - "audio" - ] - }, - "ChatCompletionNamedToolChoice": { - "type": "object", - "description": "Specifies a tool the model should use. Use to force the model to call a specific function.", - "required": [ - "type", - "function" - ], - "properties": { - "function": { - "$ref": "#/components/schemas/FunctionName" - }, - "type": { - "$ref": "#/components/schemas/ChatCompletionToolType", - "description": "The type of the tool. Currently, only `function` is supported." - } - } - }, - "ChatCompletionRequestAssistantMessage": { - "type": "object", - "properties": { - "audio": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ChatCompletionRequestAssistantMessageAudio", - "description": "Data about a previous audio response from the model.\n[Learn more](https://platform.openai.com/docs/guides/audio)." - } - ] - }, - "content": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ChatCompletionRequestAssistantMessageContent", - "description": "The contents of the assistant message. Required unless `tool_calls` or `function_call` is specified." - } - ] - }, - "function_call": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/FunctionCall", - "description": "Deprecated and replaced by `tool_calls`. The name and arguments of a function that should be called, as generated by the model." - } - ] - }, - "name": { - "type": [ - "string", - "null" - ], - "description": "An optional name for the participant. Provides the model information to differentiate between participants of the same role." - }, - "refusal": { - "type": [ - "string", - "null" - ], - "description": "The refusal message by the assistant." - }, - "tool_calls": { - "type": [ - "array", - "null" - ], - "items": { - "$ref": "#/components/schemas/ChatCompletionMessageToolCall" - } - } - } - }, - "ChatCompletionRequestAssistantMessageAudio": { - "type": "object", - "required": [ - "id" - ], - "properties": { - "id": { - "type": "string", - "description": "Unique identifier for a previous audio response from the model." - } - } - }, - "ChatCompletionRequestAssistantMessageContent": { - "oneOf": [ - { - "type": "string", - "description": "The text contents of the message." - }, - { - "type": "array", - "items": { - "$ref": "#/components/schemas/ChatCompletionRequestAssistantMessageContentPart" - }, - "description": "An array of content parts with a defined type. Can be one or more of type `text`, or exactly one of type `refusal`." - } - ] - }, - "ChatCompletionRequestAssistantMessageContentPart": { - "oneOf": [ - { - "allOf": [ - { - "$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartText" - }, - { - "type": "object", - "required": [ - "type" - ], - "properties": { - "type": { - "type": "string", - "enum": [ - "text" - ] - } - } - } - ] - }, - { - "allOf": [ - { - "$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartRefusal" - }, - { - "type": "object", - "required": [ - "type" - ], - "properties": { - "type": { - "type": "string", - "enum": [ - "refusal" - ] - } - } - } - ] - } - ] - }, - "ChatCompletionRequestDeveloperMessage": { - "type": "object", - "required": [ - "content" - ], - "properties": { - "content": { - "$ref": "#/components/schemas/ChatCompletionRequestDeveloperMessageContent", - "description": "The contents of the developer message." - }, - "name": { - "type": [ - "string", - "null" - ], - "description": "An optional name for the participant. Provides the model information to differentiate between participants of the same role." - } - } - }, - "ChatCompletionRequestDeveloperMessageContent": { - "oneOf": [ - { - "type": "string" - }, - { - "type": "array", - "items": { - "$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartText" - } - } - ] - }, - "ChatCompletionRequestFunctionMessage": { - "type": "object", - "required": [ - "name" - ], - "properties": { - "content": { - "type": [ - "string", - "null" - ], - "description": "The return value from the function call, to return to the model." - }, - "name": { - "type": "string", - "description": "The name of the function to call." - } - } - }, - "ChatCompletionRequestMessage": { - "oneOf": [ - { - "allOf": [ - { - "$ref": "#/components/schemas/ChatCompletionRequestDeveloperMessage" - }, - { - "type": "object", - "required": [ - "role" - ], - "properties": { - "role": { - "type": "string", - "enum": [ - "developer" - ] - } - } - } - ] - }, - { - "allOf": [ - { - "$ref": "#/components/schemas/ChatCompletionRequestSystemMessage" - }, - { - "type": "object", - "required": [ - "role" - ], - "properties": { - "role": { - "type": "string", - "enum": [ - "system" - ] - } - } - } - ] - }, - { - "allOf": [ - { - "$ref": "#/components/schemas/ChatCompletionRequestUserMessage" - }, - { - "type": "object", - "required": [ - "role" - ], - "properties": { - "role": { - "type": "string", - "enum": [ - "user" - ] - } - } - } - ] - }, - { - "allOf": [ - { - "$ref": "#/components/schemas/ChatCompletionRequestAssistantMessage" - }, - { - "type": "object", - "required": [ - "role" - ], - "properties": { - "role": { - "type": "string", - "enum": [ - "assistant" - ] - } - } - } - ] - }, - { - "allOf": [ - { - "$ref": "#/components/schemas/ChatCompletionRequestToolMessage" - }, - { - "type": "object", - "required": [ - "role" - ], - "properties": { - "role": { - "type": "string", - "enum": [ - "tool" - ] - } - } - } - ] - }, - { - "allOf": [ - { - "$ref": "#/components/schemas/ChatCompletionRequestFunctionMessage" - }, - { - "type": "object", - "required": [ - "role" - ], - "properties": { - "role": { - "type": "string", - "enum": [ - "function" - ] - } - } - } - ] - } - ] - }, - "ChatCompletionRequestMessageContentPartAudio": { - "type": "object", - "description": "Learn about [audio inputs](https://platform.openai.com/docs/guides/audio).", - "required": [ - "input_audio" - ], - "properties": { - "input_audio": { - "$ref": "#/components/schemas/InputAudio" - } - } - }, - "ChatCompletionRequestMessageContentPartAudioUrl": { - "type": "object", - "required": [ - "audio_url" - ], - "properties": { - "audio_url": { - "$ref": "#/components/schemas/AudioUrl" - } - } - }, - "ChatCompletionRequestMessageContentPartImage": { - "type": "object", - "required": [ - "image_url" - ], - "properties": { - "image_url": { - "$ref": "#/components/schemas/ImageUrl" - } - } - }, - "ChatCompletionRequestMessageContentPartRefusal": { - "type": "object", - "required": [ - "refusal" - ], - "properties": { - "refusal": { - "type": "string", - "description": "The refusal message generated by the model." - } - } - }, - "ChatCompletionRequestMessageContentPartText": { - "type": "object", - "required": [ - "text" - ], - "properties": { - "text": { - "type": "string" - } - } - }, - "ChatCompletionRequestMessageContentPartVideo": { - "type": "object", - "required": [ - "video_url" - ], - "properties": { - "video_url": { - "$ref": "#/components/schemas/VideoUrl" - } - } - }, - "ChatCompletionRequestSystemMessage": { - "type": "object", - "required": [ - "content" - ], - "properties": { - "content": { - "$ref": "#/components/schemas/ChatCompletionRequestSystemMessageContent", - "description": "The contents of the system message." - }, - "name": { - "type": [ - "string", - "null" - ], - "description": "An optional name for the participant. Provides the model information to differentiate between participants of the same role." - } - } - }, - "ChatCompletionRequestSystemMessageContent": { - "oneOf": [ - { - "type": "string", - "description": "The text contents of the system message." - }, - { - "type": "array", - "items": { - "$ref": "#/components/schemas/ChatCompletionRequestSystemMessageContentPart" - }, - "description": "An array of content parts with a defined type. For system messages, only type `text` is supported." - } - ] - }, - "ChatCompletionRequestSystemMessageContentPart": { - "oneOf": [ - { - "allOf": [ - { - "$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartText" - }, - { - "type": "object", - "required": [ - "type" - ], - "properties": { - "type": { - "type": "string", - "enum": [ - "text" - ] - } - } - } - ] - } - ] - }, - "ChatCompletionRequestToolMessage": { - "type": "object", - "description": "Tool message", - "required": [ - "content", - "tool_call_id" - ], - "properties": { - "content": { - "$ref": "#/components/schemas/ChatCompletionRequestToolMessageContent", - "description": "The contents of the tool message." - }, - "tool_call_id": { - "type": "string" - } - } - }, - "ChatCompletionRequestToolMessageContent": { - "oneOf": [ - { - "type": "string", - "description": "The text contents of the tool message." - }, - { - "type": "array", - "items": { - "$ref": "#/components/schemas/ChatCompletionRequestToolMessageContentPart" - }, - "description": "An array of content parts with a defined type. For tool messages, only type `text` is supported." - } - ] - }, - "ChatCompletionRequestToolMessageContentPart": { - "oneOf": [ - { - "allOf": [ - { - "$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartText" - }, - { - "type": "object", - "required": [ - "type" - ], - "properties": { - "type": { - "type": "string", - "enum": [ - "text" - ] - } - } - } - ] - } - ] - }, - "ChatCompletionRequestUserMessage": { - "type": "object", - "required": [ - "content" - ], - "properties": { - "content": { - "$ref": "#/components/schemas/ChatCompletionRequestUserMessageContent", - "description": "The contents of the user message." - }, - "name": { - "type": [ - "string", - "null" - ], - "description": "An optional name for the participant. Provides the model information to differentiate between participants of the same role." - } - } - }, - "ChatCompletionRequestUserMessageContent": { - "oneOf": [ - { - "type": "string", - "description": "The text contents of the message." - }, - { - "type": "array", - "items": { - "$ref": "#/components/schemas/ChatCompletionRequestUserMessageContentPart" - }, - "description": "An array of content parts with a defined type. Supported options differ based on the [model](https://platform.openai.com/docs/models) being used to generate the response. Can contain text, image, or audio inputs." - } - ] - }, - "ChatCompletionRequestUserMessageContentPart": { - "oneOf": [ - { - "allOf": [ - { - "$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartText" - }, - { - "type": "object", - "required": [ - "type" - ], - "properties": { - "type": { - "type": "string", - "enum": [ - "text" - ] - } - } - } - ] - }, - { - "allOf": [ - { - "$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartImage" - }, - { - "type": "object", - "required": [ - "type" - ], - "properties": { - "type": { - "type": "string", - "enum": [ - "image_url" - ] - } - } - } - ] - }, - { - "allOf": [ - { - "$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartVideo" - }, - { - "type": "object", - "required": [ - "type" - ], - "properties": { - "type": { - "type": "string", - "enum": [ - "video_url" - ] - } - } - } - ] - }, - { - "allOf": [ - { - "$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartAudioUrl" - }, - { - "type": "object", - "required": [ - "type" - ], - "properties": { - "type": { - "type": "string", - "enum": [ - "audio_url" - ] - } - } - } - ] - }, - { - "allOf": [ - { - "$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartAudio" - }, - { - "type": "object", - "required": [ - "type" - ], - "properties": { - "type": { - "type": "string", - "enum": [ - "input_audio" - ] - } - } - } - ] - } - ] - }, - "ChatCompletionStreamOptions": { - "type": "object", - "description": "Options for streaming response. Only set this when you set `stream: true`.", - "required": [ - "include_usage" - ], - "properties": { - "include_usage": { - "type": "boolean", - "description": "If set, an additional chunk will be streamed before the `data: [DONE]` message. The `usage` field on this chunk shows the token usage statistics for the entire request, and the `choices` field will always be an empty array. All other chunks will also include a `usage` field, but with a null value." - } - } - }, - "ChatCompletionTool": { - "type": "object", - "required": [ - "type", - "function" - ], - "properties": { - "function": { - "$ref": "#/components/schemas/FunctionObject" - }, - "type": { - "$ref": "#/components/schemas/ChatCompletionToolType" - } - } - }, - "ChatCompletionToolChoiceOption": { - "oneOf": [ - { - "type": "string", - "enum": [ - "none" - ] - }, - { - "type": "string", - "enum": [ - "auto" - ] - }, - { - "type": "string", - "enum": [ - "required" - ] - }, - { - "type": "object", - "required": [ - "named" - ], - "properties": { - "named": { - "$ref": "#/components/schemas/ChatCompletionNamedToolChoice" - } - } - } - ], - "description": "Controls which (if any) tool is called by the model.\n`none` means the model will not call any tool and instead generates a message.\n`auto` means the model can pick between generating a message or calling one or more tools.\n`required` means the model must call one or more tools.\nSpecifying a particular tool via `{\"type\": \"function\", \"function\": {\"name\": \"my_function\"}}` forces the model to call that tool.\n\n`none` is the default when no tools are present. `auto` is the default if tools are present." - }, - "ChatCompletionToolType": { - "type": "string", - "enum": [ - "function" - ] - }, - "CommonExt": { - "type": "object", - "description": "Common extensions for OpenAI API requests that are not part of the standard OpenAI spec\nbut are commonly needed across different request types.", - "properties": { - "guided_choice": { - "type": [ - "array", - "null" - ], - "items": { - "type": "string" - }, - "description": "If specified, the output will be exactly one of the choices." - }, - "guided_decoding_backend": { - "type": [ - "string", - "null" - ], - "description": "If specified, the backend to use for guided decoding, can be backends like xgrammar or custom guided decoding backend" - }, - "guided_grammar": { - "type": [ - "string", - "null" - ], - "description": "If specified, the output will follow the context-free grammar. Can be a string or null." - }, - "guided_json": { - "description": "Guided Decoding Options\nIf specified, the output will be a JSON object. Can be a string, an object, or null." - }, - "guided_regex": { - "type": [ - "string", - "null" - ], - "description": "If specified, the output will follow the regex pattern. Can be a string or null." - }, - "guided_whitespace_pattern": { - "type": [ - "string", - "null" - ], - "description": "If specified, the output will follow the whitespace pattern. Can be a string or null." - }, - "ignore_eos": { - "type": [ - "boolean", - "null" - ], - "description": "If true, the model will ignore the end of string token and generate to max_tokens.\nThis field can also be specified in nvext, but the root-level value takes precedence." - }, - "include_stop_str_in_output": { - "type": [ - "boolean", - "null" - ], - "description": "include_stop_str_in_output" - }, - "min_p": { - "type": [ - "number", - "null" - ], - "format": "float", - "description": "Relative probability floor" - }, - "min_tokens": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "The minimum number of tokens to generate.\nThis is a common parameter needed across different request types.", - "minimum": 0 - }, - "repetition_penalty": { - "type": [ - "number", - "null" - ], - "format": "float", - "description": "How much to penalize tokens based on how frequently they occur in the text.\nA value of 1 means no penalty, while values larger than 1 discourage and values smaller encourage." - }, - "skip_special_tokens": { - "type": [ - "boolean", - "null" - ], - "description": "Whether to skip special tokens in the decoded output.\nWhen true, special tokens (like EOS, BOS, PAD) are removed from the output text.\nWhen false, special tokens are included in the output text.\nDefaults to false if not specified." - }, - "top_k": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens." - } - } - }, - "CreateChatCompletionRequest": { - "type": "object", - "required": [ - "messages", - "model" - ], - "properties": { - "audio": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ChatCompletionAudio", - "description": "Parameters for audio output. Required when audio output is requested with `modalities: [\"audio\"]`. [Learn more](https://platform.openai.com/docs/guides/audio)." - } - ] - }, - "frequency_penalty": { - "type": [ - "number", - "null" - ], - "format": "float", - "description": "Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim." - }, - "function_call": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ChatCompletionFunctionCall", - "description": "Deprecated in favor of `tool_choice`.\n\nControls which (if any) function is called by the model.\n`none` means the model will not call a function and instead generates a message.\n`auto` means the model can pick between generating a message or calling a function.\nSpecifying a particular function via `{\"name\": \"my_function\"}` forces the model to call that function.\n\n`none` is the default when no functions are present. `auto` is the default if functions are present." - } - ] - }, - "functions": { - "type": [ - "array", - "null" - ], - "items": { - "$ref": "#/components/schemas/ChatCompletionFunctions" - }, - "description": "Deprecated in favor of `tools`.\n\nA list of functions the model may generate JSON inputs for.", - "deprecated": true - }, - "logit_bias": { - "type": [ - "object", - "null" - ], - "description": "Modify the likelihood of specified tokens appearing in the completion.\n\nAccepts a json object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100.\nMathematically, the bias is added to the logits generated by the model prior to sampling.\nThe exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection;\nvalues like -100 or 100 should result in a ban or exclusive selection of the relevant token.", - "additionalProperties": {}, - "propertyNames": { - "type": "string" - } - }, - "logprobs": { - "type": [ - "boolean", - "null" - ], - "description": "Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the `content` of `message`." - }, - "max_completion_tokens": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "An upper bound for the number of tokens that can be generated for a completion, including visible output tokens and [reasoning tokens](https://platform.openai.com/docs/guides/reasoning).", - "minimum": 0 - }, - "max_tokens": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "The maximum number of [tokens](https://platform.openai.com/tokenizer) that can be generated in the chat completion.\n\nThis value can be used to control [costs](https://openai.com/api/pricing/) for text generated via API.\nThis value is now deprecated in favor of `max_completion_tokens`, and is\nnot compatible with [o1 series models](https://platform.openai.com/docs/guides/reasoning).", - "deprecated": true, - "minimum": 0 - }, - "messages": { - "type": "array", - "items": { - "$ref": "#/components/schemas/ChatCompletionRequestMessage" - }, - "description": "A list of messages comprising the conversation so far. Depending on the [model](https://platform.openai.com/docs/models) you use, different message types (modalities) are supported, like [text](https://platform.openai.com/docs/guides/text-generation), [images](https://platform.openai.com/docs/guides/vision), and [audio](https://platform.openai.com/docs/guides/audio)." - }, - "metadata": { - "description": "Developer-defined tags and values used for filtering completions in the [dashboard](https://platform.openai.com/chat-completions)." - }, - "mm_processor_kwargs": { - "description": "Multimodal processor configuration parameters" - }, - "modalities": { - "type": [ - "array", - "null" - ], - "items": { - "$ref": "#/components/schemas/ChatCompletionModalities" - } - }, - "model": { - "type": "string", - "description": "ID of the model to use.\nSee the [model endpoint compatibility](https://platform.openai.com/docs/models#model-endpoint-compatibility) table for details on which models work with the Chat API." - }, - "n": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "How many chat completion choices to generate for each input message. Note that you will be charged based on the number of generated tokens across all of the choices. Keep `n` as `1` to minimize costs.", - "minimum": 0 - }, - "parallel_tool_calls": { - "type": [ - "boolean", - "null" - ], - "description": "Whether to enable [parallel function calling](https://platform.openai.com/docs/guides/function-calling/parallel-function-calling) during tool use." - }, - "prediction": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/PredictionContent", - "description": "Configuration for a [Predicted Output](https://platform.openai.com/docs/guides/predicted-outputs),which can greatly improve response times when large parts of the model response are known ahead of time. This is most common when you are regenerating a file with only minor changes to most of the content." - } - ] - }, - "presence_penalty": { - "type": [ - "number", - "null" - ], - "format": "float", - "description": "Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics." - }, - "reasoning_effort": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ReasoningEffort", - "description": "**o1 models only**\n\nConstrains effort on reasoning for\n[reasoning models](https://platform.openai.com/docs/guides/reasoning).\n\nCurrently supported values are `low`, `medium`, and `high`. Reducing\n\nreasoning effort can result in faster responses and fewer tokens\nused on reasoning in a response." - } - ] - }, - "response_format": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ResponseFormat", - "description": "An object specifying the format that the model must output. Compatible with [GPT-4o](https://platform.openai.com/docs/models/gpt-4o), [GPT-4o mini](https://platform.openai.com/docs/models/gpt-4o-mini), [GPT-4 Turbo](https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo) and all GPT-3.5 Turbo models newer than `gpt-3.5-turbo-1106`.\n\nSetting to `{ \"type\": \"json_schema\", \"json_schema\": {...} }` enables Structured Outputs which guarantees the model will match your supplied JSON schema. Learn more in the [Structured Outputs guide](https://platform.openai.com/docs/guides/structured-outputs).\n\nSetting to `{ \"type\": \"json_object\" }` enables JSON mode, which guarantees the message the model generates is valid JSON.\n\n**Important:** when using JSON mode, you **must** also instruct the model to produce JSON yourself via a system or user message. Without this, the model may generate an unending stream of whitespace until the generation reaches the token limit, resulting in a long-running and seemingly \"stuck\" request. Also note that the message content may be partially cut off if `finish_reason=\"length\"`, which indicates the generation exceeded `max_tokens` or the conversation exceeded the max context length." - } - ] - }, - "seed": { - "type": [ - "integer", - "null" - ], - "format": "int64", - "description": " This feature is in Beta.\nIf specified, our system will make a best effort to sample deterministically, such that repeated requests\nwith the same `seed` and parameters should return the same result.\nDeterminism is not guaranteed, and you should refer to the `system_fingerprint` response parameter to monitor changes in the backend." - }, - "service_tier": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ServiceTier", - "description": "Specifies the latency tier to use for processing the request. This parameter is relevant for customers subscribed to the scale tier service:\n- If set to 'auto', the system will utilize scale tier credits until they are exhausted.\n- If set to 'default', the request will be processed using the default service tier with a lower uptime SLA and no latency guarentee.\n- When not set, the default behavior is 'auto'.\n\nWhen this parameter is set, the response body will include the `service_tier` utilized." - } - ] - }, - "stop": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/Stop", - "description": "Up to 4 sequences where the API will stop generating further tokens." - } - ] - }, - "store": { - "type": [ - "boolean", - "null" - ], - "description": "Whether or not to store the output of this chat completion request\n\nfor use in our [model distillation](https://platform.openai.com/docs/guides/distillation) or [evals](https://platform.openai.com/docs/guides/evals) products." - }, - "stream": { - "type": [ - "boolean", - "null" - ], - "description": "If set, partial message deltas will be sent, like in ChatGPT.\nTokens will be sent as data-only [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#Event_stream_format)\nas they become available, with the stream terminated by a `data: [DONE]` message. [Example Python code](https://cookbook.openai.com/examples/how_to_stream_completions)." - }, - "stream_options": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ChatCompletionStreamOptions" - } - ] - }, - "temperature": { - "type": [ - "number", - "null" - ], - "format": "float", - "description": "What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random,\nwhile lower values like 0.2 will make it more focused and deterministic.\n\nWe generally recommend altering this or `top_p` but not both." - }, - "tool_choice": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ChatCompletionToolChoiceOption" - } - ] - }, - "tools": { - "type": [ - "array", - "null" - ], - "items": { - "$ref": "#/components/schemas/ChatCompletionTool" - }, - "description": "A list of tools the model may call. Currently, only functions are supported as a tool.\nUse this to provide a list of functions the model may generate JSON inputs for. A max of 128 functions are supported." - }, - "top_logprobs": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability. `logprobs` must be set to `true` if this parameter is used.", - "minimum": 0 - }, - "top_p": { - "type": [ - "number", - "null" - ], - "format": "float", - "description": "An alternative to sampling with temperature, called nucleus sampling,\nwhere the model considers the results of the tokens with top_p probability mass.\nSo 0.1 means only the tokens comprising the top 10% probability mass are considered.\n\n We generally recommend altering this or `temperature` but not both." - }, - "user": { - "type": [ - "string", - "null" - ], - "description": "A unique identifier representing your end-user, which can help OpenAI to monitor and detect abuse. [Learn more](https://platform.openai.com/docs/guides/safety-best-practices#end-user-ids)." - }, - "web_search_options": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/WebSearchOptions", - "description": "This tool searches the web for relevant results to use in a response.\nLearn more about the [web search tool](https://platform.openai.com/docs/guides/tools-web-search?api-mode=chat)." - } - ] - } - } - }, - "CreateCompletionRequest": { - "type": "object", - "required": [ - "model", - "prompt" - ], - "properties": { - "best_of": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "Generates `best_of` completions server-side and returns the \"best\" (the one with the highest log probability per token). Results cannot be streamed.\n\nWhen used with `n`, `best_of` controls the number of candidate completions and `n` specifies how many to return – `best_of` must be greater than `n`.\n\n**Note:** Because this parameter generates many completions, it can quickly consume your token quota. Use carefully and ensure that you have reasonable settings for `max_tokens` and `stop`.", - "minimum": 0 - }, - "echo": { - "type": [ - "boolean", - "null" - ], - "description": "Echo back the prompt in addition to the completion" - }, - "frequency_penalty": { - "type": [ - "number", - "null" - ], - "format": "float", - "description": "Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.\n\n[See more information about frequency and presence penalties.](https://platform.openai.com/docs/guides/text-generation/parameter-details)" - }, - "logit_bias": { - "type": [ - "object", - "null" - ], - "description": "Modify the likelihood of specified tokens appearing in the completion.\n\nAccepts a json object that maps tokens (specified by their token ID in the GPT tokenizer) to an associated bias value from -100 to 100. You can use this [tokenizer tool](/tokenizer?view=bpe) (which works for both GPT-2 and GPT-3) to convert text to token IDs. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token.\n\nAs an example, you can pass `{\"50256\": -100}` to prevent the <|endoftext|> token from being generated.", - "additionalProperties": {}, - "propertyNames": { - "type": "string" - } - }, - "logprobs": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "Include the log probabilities on the `logprobs` most likely output tokens, as well the chosen tokens. For example, if `logprobs` is 5, the API will return a list of the 5 most likely tokens. The API will always return the `logprob` of the sampled token, so there may be up to `logprobs+1` elements in the response.\n\nThe maximum value for `logprobs` is 5.", - "minimum": 0 - }, - "max_tokens": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "The maximum number of [tokens](https://platform.openai.com/tokenizer) that can be generated in the completion.\n\nThe token count of your prompt plus `max_tokens` cannot exceed the model's context length. [Example Python code](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken) for counting tokens.", - "minimum": 0 - }, - "model": { - "type": "string", - "description": "ID of the model to use. You can use the [List models](https://platform.openai.com/docs/api-reference/models/list) API to see all of your available models, or see our [Model overview](https://platform.openai.com/docs/models/overview) for descriptions of them." - }, - "n": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "How many completions to generate for each prompt.\n**Note:** Because this parameter generates many completions, it can quickly consume your token quota. Use carefully and ensure that you have reasonable settings for `max_tokens` and `stop`.\n", - "minimum": 0 - }, - "presence_penalty": { - "type": [ - "number", - "null" - ], - "format": "float", - "description": "Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.\n\n[See more information about frequency and presence penalties.](https://platform.openai.com/docs/guides/text-generation/parameter-details)" - }, - "prompt": { - "$ref": "#/components/schemas/Prompt", - "description": "The prompt(s) to generate completions for, encoded as a string, array of strings, array of tokens, or array of token arrays.\n\nNote that <|endoftext|> is the document separator that the model sees during training, so if a prompt is not specified the model will generate as if from the beginning of a new document." - }, - "seed": { - "type": [ - "integer", - "null" - ], - "format": "int64", - "description": "If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same `seed` and parameters should return the same result.\n\nDeterminism is not guaranteed, and you should refer to the `system_fingerprint` response parameter to monitor changes in the backend." - }, - "stop": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/Stop", - "description": "Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence." - } - ] - }, - "stream": { - "type": [ - "boolean", - "null" - ], - "description": "Whether to stream back partial progress. If set, tokens will be sent as data-only [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#Event_stream_format)\nas they become available, with the stream terminated by a `data: [DONE]` message." - }, - "stream_options": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ChatCompletionStreamOptions" - } - ] - }, - "suffix": { - "type": [ - "string", - "null" - ], - "description": "The suffix that comes after a completion of inserted text.\n\nThis parameter is only supported for `gpt-3.5-turbo-instruct`." - }, - "temperature": { - "type": [ - "number", - "null" - ], - "format": "float", - "description": "What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.\n\nWe generally recommend altering this or `top_p` but not both." - }, - "top_p": { - "type": [ - "number", - "null" - ], - "format": "float", - "description": "An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.\n\n We generally recommend altering this or `temperature` but not both." - }, - "user": { - "type": [ - "string", - "null" - ], - "description": "A unique identifier representing your end-user, which will help OpenAI to monitor and detect abuse. [Learn more](https://platform.openai.com/docs/usage-policies/end-user-ids)." - } - } - }, - "CreateEmbeddingRequest": { - "type": "object", - "required": [ - "model", - "input" - ], - "properties": { - "dimensions": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "The number of dimensions the resulting output embeddings should have. Only supported in `text-embedding-3` and later models.", - "minimum": 0 - }, - "encoding_format": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/EncodingFormat", - "description": "The format to return the embeddings in. Can be either `float` or [`base64`](https://pypi.org/project/pybase64/). Defaults to float" - } - ] - }, - "input": { - "$ref": "#/components/schemas/EmbeddingInput", - "description": "Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays. The input must not exceed the max input tokens for the model (8192 tokens for `text-embedding-ada-002`), cannot be an empty string, and any array must be 2048 dimensions or less. [Example Python code](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken) for counting tokens." - }, - "model": { - "type": "string", - "description": "ID of the model to use. You can use the\n[List models](https://platform.openai.com/docs/api-reference/models/list)\nAPI to see all of your available models, or see our\n[Model overview](https://platform.openai.com/docs/models/overview)\nfor descriptions of them." - }, - "user": { - "type": [ - "string", - "null" - ], - "description": "A unique identifier representing your end-user, which will help OpenAI\n to monitor and detect abuse. [Learn more](https://platform.openai.com/docs/usage-policies/end-user-ids)." - } - } - }, - "CreateResponse": { - "type": "object", - "description": "Builder for a Responses API request.", - "required": [ - "input", - "model" - ], - "properties": { - "background": { - "type": [ - "boolean", - "null" - ], - "description": "Whether to run the model response in the background.\nboolean or null." - }, - "include": { - "type": [ - "array", - "null" - ], - "items": { - "type": "string" - }, - "description": "Specify additional output data to include in the model response.\n\nSupported values:\n- `file_search_call.results`\n Include the search results of the file search tool call.\n- `message.input_image.image_url`\n Include image URLs from the input message.\n- `computer_call_output.output.image_url`\n Include image URLs from the computer call output.\n- `reasoning.encrypted_content`\n Include an encrypted version of reasoning tokens in reasoning item outputs.\n This enables reasoning items to be used in multi-turn conversations when\n using the Responses API statelessly (for example, when the `store` parameter\n is set to `false`, or when an organization is enrolled in the zero-data-\n retention program).\n\nIf `None`, no additional data is returned." - }, - "input": { - "type": "object", - "description": "Text, image, or file inputs to the model, used to generate a response.\nUsing value_type to prevent deep schema recursion from Input's nested content types." - }, - "instructions": { - "type": [ - "string", - "null" - ], - "description": "Inserts a system (or developer) message as the first item in the model's context.\n\nWhen using along with previous_response_id, the instructions from a previous response will\nnot be carried over to the next response. This makes it simple to swap out system\n(or developer) messages in new responses." - }, - "max_output_tokens": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "An upper bound for the number of tokens that can be generated for a\nresponse, including visible output tokens and reasoning tokens.", - "minimum": 0 - }, - "max_tool_calls": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "The maximum number of total calls to built-in tools that can be processed in a response.\nThis maximum number applies across all built-in tool calls, not per individual tool.\nAny further attempts to call a tool by the model will be ignored.", - "minimum": 0 - }, - "metadata": { - "description": "Arbitrary JSON metadata used as a passthrough parameter" - }, - "model": { - "type": "string", - "description": "Model ID used to generate the response, like `gpt-4o`.\nOpenAI offers a wide range of models with different capabilities,\nperformance characteristics, and price points." - }, - "parallel_tool_calls": { - "type": [ - "boolean", - "null" - ], - "description": "Whether to allow the model to run tool calls in parallel." - }, - "previous_response_id": { - "type": [ - "string", - "null" - ], - "description": "The unique ID of the previous response to the model. Use this to create\nmulti-turn conversations." - }, - "prompt": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/PromptConfig", - "description": "Reference to a prompt template and its variables." - } - ] - }, - "reasoning": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ReasoningConfig", - "description": "**o-series models only**: Configuration options for reasoning models." - } - ] - }, - "service_tier": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ServiceTier", - "description": "Specifies the latency tier to use for processing the request.\n\nThis parameter is relevant for customers subscribed to the Scale tier service.\n\nSupported values:\n- `auto`\n - If the Project is Scale tier enabled, the system will utilize Scale tier credits until\n they are exhausted.\n - If the Project is not Scale tier enabled, the request will be processed using the\n default service tier with a lower uptime SLA and no latency guarantee.\n- `default`\n The request will be processed using the default service tier with a lower uptime SLA and\n no latency guarantee.\n- `flex`\n The request will be processed with the Flex Processing service tier. Learn more.\n\nWhen not set, the default behavior is `auto`.\n\nWhen this parameter is set, the response body will include the `service_tier` utilized." - } - ] - }, - "store": { - "type": [ - "boolean", - "null" - ], - "description": "Whether to store the generated model response for later retrieval via API." - }, - "stream": { - "type": [ - "boolean", - "null" - ], - "description": "If set to true, the model response data will be streamed to the client as it is\ngenerated using server-sent events." - }, - "temperature": { - "type": [ - "number", - "null" - ], - "format": "float", - "description": "What sampling temperature to use, between 0 and 2. Higher values like 0.8\nwill make the output more random, while lower values like 0.2 will make it\nmore focused and deterministic. We generally recommend altering this or\n`top_p` but not both." - }, - "text": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/TextConfig", - "description": "Configuration options for a text response from the model. Can be plain text\nor structured JSON data." - } - ] - }, - "tool_choice": { - "type": "object", - "description": "How the model should select which tool (or tools) to use when generating\na response." - }, - "tools": { - "type": "array", - "items": { - "type": "object" - }, - "description": "An array of tools the model may call while generating a response.\nCan include built-in tools (file_search, web_search_preview,\ncomputer_use_preview) or custom function definitions." - }, - "top_logprobs": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "An integer between 0 and 20 specifying the number of most likely tokens to return\nat each token position, each with an associated log probability.", - "minimum": 0 - }, - "top_p": { - "type": [ - "number", - "null" - ], - "format": "float", - "description": "An alternative to sampling with temperature, called nucleus sampling,\nwhere the model considers the results of the tokens with top_p probability\nmass. So 0.1 means only the tokens comprising the top 10% probability mass\nare considered. We generally recommend altering this or `temperature` but\nnot both." - }, - "truncation": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/Truncation", - "description": "The truncation strategy to use for the model response:\n- `auto`: drop items in the middle to fit context window.\n- `disabled`: error if exceeding context window." - } - ] - }, - "user": { - "type": [ - "string", - "null" - ], - "description": "A unique identifier representing your end-user, which can help OpenAI to\nmonitor and detect abuse." - } - } - }, - "EmbeddingInput": { - "oneOf": [ - { - "type": "string" - }, - { - "type": "array", - "items": { - "type": "string" - } - }, - { - "type": "array", - "items": { - "type": "integer", - "format": "int32", - "minimum": 0 - } - }, - { - "type": "array", - "items": { - "type": "array", - "items": { - "type": "integer", - "format": "int32", - "minimum": 0 - } - } - } - ] - }, - "EncodingFormat": { - "type": "string", - "enum": [ - "float", - "base64" - ] - }, - "FunctionCall": { - "type": "object", - "description": "The name and arguments of a function that should be called, as generated by the model.", - "required": [ - "name", - "arguments" - ], - "properties": { - "arguments": { - "type": "string", - "description": "The arguments to call the function with, as generated by the model in JSON format. Note that the model does not always generate valid JSON, and may hallucinate parameters not defined by your function schema. Validate the arguments in your code before calling your function." - }, - "name": { - "type": "string", - "description": "The name of the function to call." - } - } - }, - "FunctionName": { - "type": "object", - "required": [ - "name" - ], - "properties": { - "name": { - "type": "string", - "description": "The name of the function to call." - } - } - }, - "FunctionObject": { - "type": "object", - "required": [ - "name" - ], - "properties": { - "description": { - "type": [ - "string", - "null" - ], - "description": "A description of what the function does, used by the model to choose when and how to call the function." - }, - "name": { - "type": "string", - "description": "The name of the function to be called. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64." - }, - "parameters": { - "description": "The parameters the functions accepts, described as a JSON Schema object. See the [guide](https://platform.openai.com/docs/guides/text-generation/function-calling) for examples, and the [JSON Schema reference](https://json-schema.org/understanding-json-schema/) for documentation about the format.\n\nOmitting `parameters` defines a function with an empty parameter list." - }, - "strict": { - "type": [ - "boolean", - "null" - ], - "description": "Whether to enable strict schema adherence when generating the function call. If set to true, the model will follow the exact schema defined in the `parameters` field. Only a subset of JSON Schema is supported when `strict` is `true`. Learn more about Structured Outputs in the [function calling guide](https://platform.openai.com/docs/guides/function-calling)." - } - } - }, - "ImageDetail": { - "type": "string", - "enum": [ - "auto", - "low", - "high" - ] - }, - "ImageUrl": { - "type": "object", - "required": [ - "url" - ], - "properties": { - "detail": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ImageDetail", - "description": "Specifies the detail level of the image. Learn more in the [Vision guide](https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding)." - } - ] - }, - "url": { - "type": "string", - "format": "uri", - "description": "Either a URL of the image or the base64 encoded image data." - }, - "uuid": { - "type": [ - "string", - "null" - ], - "format": "uuid", - "description": "Optional unique identifier for the image." - } - } - }, - "InputAudio": { - "type": "object", - "required": [ - "data", - "format" - ], - "properties": { - "data": { - "type": "string", - "description": "Base64 encoded audio data." - }, - "format": { - "$ref": "#/components/schemas/InputAudioFormat", - "description": "The format of the encoded audio data. Currently supports \"wav\" and \"mp3\"." - } - } - }, - "InputAudioFormat": { - "type": "string", - "enum": [ - "wav", - "mp3" - ] - }, - "NvCreateChatCompletionRequest": { - "allOf": [ - { - "$ref": "#/components/schemas/CreateChatCompletionRequest" - }, - { - "$ref": "#/components/schemas/CommonExt" - }, - { - "type": "object", - "properties": { - "chat_template_args": { - "type": [ - "object", - "null" - ], - "description": "Extra args to pass to the chat template rendering context", - "additionalProperties": {}, - "propertyNames": { - "type": "string" - } - }, - "nvext": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/NvExt" - } - ] - } - }, - "additionalProperties": { - "description": "Catch-all for unsupported fields - checked during validation" - } - } - ], - "description": "A request structure for creating a chat completion, extending OpenAI's\n`CreateChatCompletionRequest` with [`NvExt`] extensions and common fields.\n\n# Fields\n- `inner`: The base OpenAI chat completion request, embedded using `serde(flatten)`.\n- `common`: Common extension fields (ignore_eos, min_tokens) at root level, embedded using `serde(flatten)`.\n- `nvext`: The optional NVIDIA extension field. See [`NvExt`] for more details.\n Note: If ignore_eos is specified in both common and nvext, the common (root-level) value takes precedence." - }, - "NvCreateCompletionRequest": { - "allOf": [ - { - "$ref": "#/components/schemas/CreateCompletionRequest" - }, - { - "$ref": "#/components/schemas/CommonExt" - }, - { - "type": "object", - "properties": { - "metadata": {}, - "nvext": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/NvExt" - } - ] - } - }, - "additionalProperties": { - "description": "Catch-all for unsupported fields - checked during validation" - } - } - ] - }, - "NvCreateEmbeddingRequest": { - "allOf": [ - { - "$ref": "#/components/schemas/CreateEmbeddingRequest" - }, - { - "type": "object", - "properties": { - "nvext": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/NvExt" - } - ] - } - } - } - ] - }, - "NvCreateResponse": { - "allOf": [ - { - "$ref": "#/components/schemas/CreateResponse", - "description": "Flattened CreateResponse fields (model, input, temperature, etc.)" - }, - { - "type": "object", - "properties": { - "nvext": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/NvExt" - } - ] - } - } - } - ] - }, - "NvExt": { - "type": "object", - "description": "NVIDIA LLM extensions to the OpenAI API", - "properties": { - "annotations": { - "type": [ - "array", - "null" - ], - "items": { - "type": "string" - }, - "description": "Annotations\nUser requests triggers which result in the request issue back out-of-band information in the SSE\nstream using the `event:` field." - }, - "backend_instance_id": { - "type": [ - "integer", - "null" - ], - "format": "int64", - "description": "Targeted backend instance ID for the request\nIf set, the request will be routed to backend instance with the given ID.\nIf not set, the request will be routed to the best matching instance.", - "minimum": 0 - }, - "extra_fields": { - "type": [ - "array", - "null" - ], - "items": { - "type": "string" - }, - "description": "Extra fields to be included in the response's nvext\nThis is a list of field names that should be populated in the response\nSupported fields: \"worker_id\"" - }, - "greed_sampling": { - "type": [ - "boolean", - "null" - ], - "description": "If true, sampling will be forced to be greedy.\nThe backend is responsible for selecting the correct backend-specific options to\nimplement this." - }, - "max_thinking_tokens": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "Maximum number of thinking tokens allowed\nNOTE: Currently passed through to backends as a no-op for future implementation", - "minimum": 0 - }, - "token_data": { - "type": [ - "array", - "null" - ], - "items": { - "type": "integer", - "format": "int32", - "minimum": 0 - }, - "description": "Pre-tokenized data to use instead of tokenizing the prompt\nIf provided along with backend_instance_id, these tokens will be used directly\nand tokenization will be skipped." - }, - "use_raw_prompt": { - "type": [ - "boolean", - "null" - ], - "description": "If true, the preproessor will try to bypass the prompt template and pass the prompt directly to\nto the tokenizer." - } - } - }, - "PredictionContent": { - "oneOf": [ - { - "type": "object", - "description": "The type of the predicted content you want to provide. This type is\ncurrently always `content`.", - "required": [ - "content", - "type" - ], - "properties": { - "content": { - "$ref": "#/components/schemas/PredictionContentContent", - "description": "The type of the predicted content you want to provide. This type is\ncurrently always `content`." - }, - "type": { - "type": "string", - "enum": [ - "content" - ] - } - } - } - ], - "description": "Static predicted output content, such as the content of a text file that is being regenerated." - }, - "PredictionContentContent": { - "oneOf": [ - { - "type": "string", - "description": "The content used for a Predicted Output. This is often the text of a file you are regenerating with minor changes." - }, - { - "type": "array", - "items": { - "$ref": "#/components/schemas/ChatCompletionRequestMessageContentPartText" - }, - "description": "An array of content parts with a defined type. Supported options differ based on the [model](https://platform.openai.com/docs/models) being used to generate the response. Can contain text inputs." - } - ], - "description": "The content that should be matched when generating a model response. If generated tokens would match this content, the entire model response can be returned much more quickly." - }, - "Prompt": { - "oneOf": [ - { - "type": "string" - }, - { - "type": "array", - "items": { - "type": "string" - } - }, - { - "type": "array", - "items": { - "type": "integer", - "format": "int32", - "minimum": 0 - } - }, - { - "type": "array", - "items": { - "type": "array", - "items": { - "type": "integer", - "format": "int32", - "minimum": 0 - } - } - } - ] - }, - "PromptConfig": { - "type": "object", - "description": "Service tier request options.", - "required": [ - "id" - ], - "properties": { - "id": { - "type": "string", - "description": "The unique identifier of the prompt template to use." - }, - "variables": { - "type": [ - "object", - "null" - ], - "description": "Optional map of values to substitute in for variables in your prompt. The substitution\nvalues can either be strings, or other Response input types like images or files.\nFor now only supporting Strings.", - "additionalProperties": { - "type": "string" - }, - "propertyNames": { - "type": "string" - } - }, - "version": { - "type": [ - "string", - "null" - ], - "description": "Optional version of the prompt template." - } - } - }, - "ReasoningConfig": { - "type": "object", - "description": "o-series reasoning settings.", - "properties": { - "effort": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ReasoningEffort", - "description": "Constrain effort on reasoning." - } - ] - }, - "summary": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ReasoningSummary", - "description": "Summary mode for reasoning." - } - ] - } - } - }, - "ReasoningEffort": { - "type": "string", - "enum": [ - "minimal", - "low", - "medium", - "high" - ] - }, - "ReasoningSummary": { - "type": "string", - "enum": [ - "auto", - "concise", - "detailed" - ] - }, - "ResponseFormat": { - "oneOf": [ - { - "type": "object", - "description": "The type of response format being defined: `text`", - "required": [ - "type" - ], - "properties": { - "type": { - "type": "string", - "enum": [ - "text" - ] - } - } - }, - { - "type": "object", - "description": "The type of response format being defined: `json_object`", - "required": [ - "type" - ], - "properties": { - "type": { - "type": "string", - "enum": [ - "json_object" - ] - } - } - }, - { - "type": "object", - "description": "The type of response format being defined: `json_schema`", - "required": [ - "json_schema", - "type" - ], - "properties": { - "json_schema": { - "$ref": "#/components/schemas/ResponseFormatJsonSchema" - }, - "type": { - "type": "string", - "enum": [ - "json_schema" - ] - } - } - } - ] - }, - "ResponseFormatJsonSchema": { - "type": "object", - "required": [ - "name" - ], - "properties": { - "description": { - "type": [ - "string", - "null" - ], - "description": "A description of what the response format is for, used by the model to determine how to respond in the format." - }, - "name": { - "type": "string", - "description": "The name of the response format. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64." - }, - "schema": { - "description": "The schema for the response format, described as a JSON Schema object." - }, - "strict": { - "type": [ - "boolean", - "null" - ], - "description": "Whether to enable strict schema adherence when generating the output. If set to true, the model will always follow the exact schema defined in the `schema` field. Only a subset of JSON Schema is supported when `strict` is `true`. To learn more, read the [Structured Outputs guide](https://platform.openai.com/docs/guides/structured-outputs)." - } - } - }, - "ServiceTier": { - "type": "string", - "description": "Service tier request options.", - "enum": [ - "auto", - "default", - "flex" - ] - }, - "Stop": { - "oneOf": [ - { - "type": "string" - }, - { - "type": "array", - "items": { - "type": "string" - } - } - ] - }, - "TextConfig": { - "type": "object", - "description": "Configuration for text response format.", - "required": [ - "format" - ], - "properties": { - "format": { - "$ref": "#/components/schemas/TextResponseFormat", - "description": "Defines the format: plain text, JSON object, or JSON schema." - } - } - }, - "TextResponseFormat": { - "oneOf": [ - { - "type": "object", - "description": "The type of response format being defined: `text`", - "required": [ - "type" - ], - "properties": { - "type": { - "type": "string", - "enum": [ - "text" - ] - } - } - }, - { - "type": "object", - "description": "The type of response format being defined: `json_object`", - "required": [ - "type" - ], - "properties": { - "type": { - "type": "string", - "enum": [ - "json_object" - ] - } - } - }, - { - "allOf": [ - { - "$ref": "#/components/schemas/ResponseFormatJsonSchema", - "description": "The type of response format being defined: `json_schema`" - }, - { - "type": "object", - "required": [ - "type" - ], - "properties": { - "type": { - "type": "string", - "enum": [ - "json_schema" - ] - } - } - } - ], - "description": "The type of response format being defined: `json_schema`" - } - ] - }, - "Truncation": { - "type": "string", - "description": "Truncation strategies.", - "enum": [ - "auto", - "disabled" - ] - }, - "VideoUrl": { - "type": "object", - "required": [ - "url" - ], - "properties": { - "detail": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ImageDetail", - "description": "Specifies the detail level of the video processing." - } - ] - }, - "url": { - "type": "string", - "format": "uri", - "description": "Either a URL of the video or the base64 encoded video data." - }, - "uuid": { - "type": [ - "string", - "null" - ], - "format": "uuid", - "description": "Optional unique identifier for the video." - } - } - }, - "WebSearchContextSize": { - "type": "string", - "description": "The amount of context window space to use for the search.", - "enum": [ - "low", - "medium", - "high" - ] - }, - "WebSearchLocation": { - "type": "object", - "description": "Approximate location parameters for the search.", - "properties": { - "city": { - "type": [ - "string", - "null" - ], - "description": "Free text input for the city of the user, e.g. `San Francisco`." - }, - "country": { - "type": [ - "string", - "null" - ], - "description": "The two-letter [ISO country code](https://en.wikipedia.org/wiki/ISO_3166-1) of the user, e.g. `US`." - }, - "region": { - "type": [ - "string", - "null" - ], - "description": "Free text input for the region of the user, e.g. `California`." - }, - "timezone": { - "type": [ - "string", - "null" - ], - "description": "The [IANA timezone](https://timeapi.io/documentation/iana-timezones) of the user, e.g. `America/Los_Angeles`." - } - } - }, - "WebSearchOptions": { - "type": "object", - "description": "Options for the web search tool.", - "properties": { - "search_context_size": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/WebSearchContextSize", - "description": "High level guidance for the amount of context window space to use for the search. One of `low`, `medium`, or `high`. `medium` is the default." - } - ] - }, - "user_location": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/WebSearchUserLocation", - "description": "Approximate location parameters for the search." - } - ] - } - } - }, - "WebSearchUserLocation": { - "type": "object", - "required": [ - "type", - "approximate" - ], - "properties": { - "approximate": { - "$ref": "#/components/schemas/WebSearchLocation" - }, - "type": { - "$ref": "#/components/schemas/WebSearchUserLocationType" - } - } - }, - "WebSearchUserLocationType": { - "type": "string", - "enum": [ - "approximate" - ] - } - } - } -} \ No newline at end of file diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst index 1408966ce8c..1bd956a6123 100644 --- a/docs/hidden_toctree.rst +++ b/docs/hidden_toctree.rst @@ -41,8 +41,18 @@ agents/tool-calling.md development/jail_stream.md - router/router_examples.md - planner/load_planner.md + components/planner/README.md + components/planner/planner_guide.md + components/planner/planner_examples.md + components/kvbm/README.md + components/kvbm/kvbm_guide.md + components/router/README.md + components/router/router_guide.md + components/router/router_examples.md + components/frontend/frontend_guide.md + design_docs/kvbm_design.md + integrations/flexkv_integration.md + integrations/sglang_hicache.md fault_tolerance/README.md fault_tolerance/request_migration.md fault_tolerance/request_cancellation.md @@ -63,7 +73,6 @@ backends/sglang/gpt-oss.md backends/sglang/diffusion-lm.md backends/sglang/profiling.md - backends/sglang/sgl-hicache-example.md backends/sglang/sglang-disaggregation.md backends/sglang/prometheus.md @@ -79,7 +88,6 @@ backends/vllm/multi-node.md backends/vllm/prometheus.md backends/vllm/prompt-embeddings.md - backends/vllm/speculative_decoding.md features/speculative_decoding/README.md features/speculative_decoding/speculative_decoding_vllm.md @@ -88,15 +96,5 @@ mocker/mocker.md - multimodal/index.md - multimodal/vllm.md - multimodal/sglang.md - multimodal/trtllm.md - - frontends/kserve.md - _sections/frontends.rst - .. TODO: architecture/distributed_runtime.md and architecture/dynamo_flow.md have some outdated names/references and need a refresh. -.. TODO: Add an OpenAI frontend doc to complement the KServe GRPC doc - in the Frontends section. diff --git a/docs/index.rst b/docs/index.rst index 5fe9ac14f9c..147e1f24926 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -58,8 +58,8 @@ Quickstart :hidden: :caption: User Guides - KV Cache Offloading - KV Aware Routing + KV Cache Offloading + KV Aware Routing Tool Calling Multimodality Support LoRA Adapters @@ -76,11 +76,11 @@ Quickstart :caption: Components Backends <_sections/backends> - Frontends <_sections/frontends> - Router - Planner + Frontend + Router + Planner Profiler - KVBM + KVBM .. toctree:: :hidden: diff --git a/docs/integrations/kv_events_custom_engines.md b/docs/integrations/kv_events_custom_engines.md index 3b854a15c72..c88c8d56850 100644 --- a/docs/integrations/kv_events_custom_engines.md +++ b/docs/integrations/kv_events_custom_engines.md @@ -285,6 +285,6 @@ Each event in the payload is a dictionary with `type` field (`BlockStored`, `Blo ## See Also -- **[Router README](../router/README.md)**: Quick start guide for the KV Router -- **[Router Guide](../router/router_guide.md)**: Configuration, tuning, and production setup +- **[Router README](../components/router/README.md)**: Quick start guide for the KV Router +- **[Router Guide](../components/router/router_guide.md)**: Configuration, tuning, and production setup - **[Router Design](../design_docs/router_design.md)**: Architecture details and event transport modes diff --git a/docs/kubernetes/README.md b/docs/kubernetes/README.md index 8f2c3913157..76c7348394c 100644 --- a/docs/kubernetes/README.md +++ b/docs/kubernetes/README.md @@ -117,7 +117,7 @@ kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE} curl http://localhost:8000/v1/models ``` -For SLA-based autoscaling, see [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md). +For SLA-based autoscaling, see [SLA Planner Guide](/docs/components/planner/planner_guide.md). ## Understanding Dynamo's Custom Resources diff --git a/docs/kubernetes/autoscaling.md b/docs/kubernetes/autoscaling.md index ef9a257201f..b3e6c2a1e76 100644 --- a/docs/kubernetes/autoscaling.md +++ b/docs/kubernetes/autoscaling.md @@ -163,14 +163,14 @@ Planner is deployed as a service component within your DGD. It: **Deployment:** -The recommended way to deploy Planner is via `DynamoGraphDeploymentRequest` (DGDR). See the [SLA Planner Quick Start](../planner/sla_planner_quickstart.md) for complete instructions. +The recommended way to deploy Planner is via `DynamoGraphDeploymentRequest` (DGDR). See the [SLA Planner Quick Start](../components/planner/planner_guide.md) for complete instructions. Example configurations with Planner: - `examples/backends/vllm/deploy/disagg_planner.yaml` - `examples/backends/sglang/deploy/disagg_planner.yaml` - `examples/backends/trtllm/deploy/disagg_planner.yaml` -For more details, see the [SLA Planner documentation](../planner/sla_planner.md). +For more details, see the [SLA Planner documentation](../components/planner/planner_guide.md). ## Autoscaling with Kubernetes HPA @@ -725,7 +725,7 @@ If you see unstable scaling: - [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) - [KEDA Documentation](https://keda.sh/) - [Prometheus Adapter](https://github.com/kubernetes-sigs/prometheus-adapter) -- [Planner Documentation](../planner/sla_planner.md) +- [Planner Documentation](../components/planner/planner_guide.md) - [Dynamo Metrics Reference](../observability/metrics.md) - [Prometheus and Grafana Setup](../observability/prometheus-grafana.md) diff --git a/docs/kubernetes/installation_guide.md b/docs/kubernetes/installation_guide.md index 73db2adc823..fd8078ad18a 100644 --- a/docs/kubernetes/installation_guide.md +++ b/docs/kubernetes/installation_guide.md @@ -292,7 +292,7 @@ kubectl get pods -n ${NAMESPACE} 3. **Optional:** - [Set up Prometheus & Grafana](./observability/metrics.md) - - [SLA Planner Quickstart Guide](../planner/sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling) + - [SLA Planner Guide](../components/planner/planner_guide.md) (for SLA-aware scheduling and autoscaling) ## Troubleshooting diff --git a/docs/kvbm/kvbm_intro.rst b/docs/kvbm/kvbm_intro.rst deleted file mode 100644 index 6dd7acd4774..00000000000 --- a/docs/kvbm/kvbm_intro.rst +++ /dev/null @@ -1,67 +0,0 @@ -.. - SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. - SPDX-License-Identifier: Apache-2.0 - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. - -KV Block Manager -================ -The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM, SGLang, and TRT-LLM. - -It offers: - -* A **unified memory API** that spans GPU memory (future), pinned host memory, remote RDMA-accessible memory, local or distributed pool of SSDs and remote file/object/cloud storage systems. -* Support for evolving **block lifecycles** (allocate → register → match) with event-based state transitions that storage can subscribe to. -* Integration with **NIXL**, a dynamic memory exchange layer used for remote registration, sharing, and access of memory blocks over RDMA/NVLink. - -The Dynamo KV Block Manager serves as a reference implementation that emphasizes modularity and extensibility. Its pluggable design enables developers to customize components and optimize for specific performance, memory, and deployment needs. - -.. list-table:: - :widths: 20 5 75 - :header-rows: 1 - - * - - - - - Feature - * - **Backend** - - ✅ - - Local - * - - - ✅ - - Kubernetes - * - **LLM Framework** - - ✅ - - vLLM - * - - - ✅ - - TensorRT-LLM - * - - - ❌ - - SGLang - * - **Serving Type** - - ✅ - - Aggregated - * - - - ✅ - - Disaggregated - -.. toctree:: - :hidden: - - Overview - Quick Start - User Guide - Design - LMCache Integration <../integrations/lmcache_integration.md> - FlexKV Integration <../integrations/flexkv_integration.md> - SGLang HiCache <../integrations/sglang_hicache.md> \ No newline at end of file diff --git a/docs/multimodal/index.md b/docs/multimodal/index.md deleted file mode 100644 index 4bc9799745a..00000000000 --- a/docs/multimodal/index.md +++ /dev/null @@ -1,218 +0,0 @@ - - -> [!NOTE] -> **This content has moved.** The canonical location for this documentation is now -> [docs/features/multimodal/](../features/multimodal/README.md). -> This file will be removed in a future release. - -# Multimodal Inference in Dynamo - -Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models. - -> [!IMPORTANT] -> **Security Requirement**: Multimodal processing must be explicitly enabled at startup. -> See the relevant documentation for each backend for the necessary flags. -> -> This prevents unintended processing of multimodal data from untrusted sources. - -## Backend Documentation - -```{toctree} -:maxdepth: 1 - -vLLM Multimodal -TensorRT-LLM Multimodal -SGLang Multimodal -``` - -## Support Matrix - -### Backend Capabilities - -| Stack | E/PD | E/P/D | EP/D | EPD | Image | Video | Audio | -|-------|------|-------|------|-----|-------|-------|-------| -| **[vLLM](vllm.md)** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🧪 | -| **[TRT-LLM](trtllm.md)** | ❌ | 🚧* | ✅ | ✅ | ✅ | ❌ | ❌ | -| **[SGLang](sglang.md)** | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | - -\* E/P/D supported in TRT-LLM with pre-computed embeddings only; image URL support is WIP ([PR #4668](https://github.com/ai-dynamo/dynamo/pull/4668)) - -**Pattern Key:** - -- **EPD** - All-in-one worker (Simple Aggregated) -- **E/PD** - Separate encode, combined prefill+decode -- **E/P/D** - All stages separate -- **EP/D** - Combined encode+prefill, separate decode - -**Status:** ✅ Supported | 🚧 WIP | 🧪 Experimental | ❌ Not supported - -### Input Format Support - -| Format | vLLM | TRT-LLM | SGLang | -|--------|------|---------|--------| -| HTTP/HTTPS URL | ✅ | ✅ | ✅ | -| Data URL (Base64) | ✅ | ❌ | ❌ | -| Pre-computed Embeddings (.pt) | ❌ | ✅ | ❌ | - -## Architecture Patterns - -Dynamo supports several deployment patterns for multimodal inference based on two dimensions: - -1. **Encoding**: Is media encoding handled inline (within prefill) or by a separate **Encode Worker**? - - *Inline*: Simpler setup, encoding happens in the prefill worker - - *Separate (EPD)*: Dedicated encode worker transfers embeddings via **NIXL (RDMA)**, enabling independent scaling - -2. **Prefill/Decode**: Are prefill and decode in the same worker or separate? - - *Aggregated*: Single worker handles both prefill and decode - - *Disaggregated*: Separate workers for prefill and decode, with KV cache transfer between them - -These combine into four deployment patterns: - -### EPD - Simple Aggregated - -All processing happens within a single worker - the simplest setup. - -```text -HTTP Frontend (Rust) - ↓ -Worker (Python) - ↓ image load + encode + prefill + decode -Response -``` - -| Component | Purpose | -|-----------|---------| -| Frontend (Rust) | HTTP entry point, tokenization, image URL preprocessing | -| Worker | Complete inference pipeline (encode + prefill + decode) | - -**When to use:** Quick setup, smaller models, development/testing. - -### E/PD - Encode Separate - -Encoding happens in a separate worker; prefill and decode share the same engine. - -```text -HTTP Frontend (Rust) - ↓ -Processor (Python) - ↓ tokenizes, extracts media URL -Encode Worker (Python) - ↓ downloads media, generates embeddings, NIXL transfer -PD Worker (Python) - ↓ receives embeddings via NIXL, prefill + decode -Response -``` - -| Component | Purpose | -|-----------|---------| -| Frontend (Rust) | HTTP entry point | -| Processor (Python) | Tokenization, extracts media URLs | -| Encode Worker | Media encoding, embeddings generation | -| PD Worker | Prefill + Decode with embeddings | - -**When to use:** Offload vision encoding to separate GPU, scale encode workers independently. - -### E/P/D - Full Disaggregation - -Full disaggregation with separate workers for encoding, prefill, and decode. -There are two variants of this workflow: -- Prefill-first, used by vLLM -- Decode-first, used by SGlang - -Prefill-first: - -```text -HTTP Frontend (Rust) - ↓ -Processor (Python) - ↓ tokenizes, extracts media URL -Encode Worker (Python) - ↓ downloads media, generates embeddings, NIXL transfer -Prefill Worker (Python) - ↓ receives embeddings via NIXL, prefill only, KV cache transfer -Decode Worker (Python) - ↓ decode only, token generation -Response -``` - -OR - -Decode-first: - -```text -HTTP Frontend (Rust) - ↓ -Processor (Python) - ↓ tokenizes, extracts media URL -Encode Worker (Python) - ↓ downloads media, generates embeddings, NIXL transfer -Decode Worker (Python) - ↓ Bootstraps prefill worker -Prefill Worker (Python) - ↓ receives embeddings via NIXL, prefill only, KV cache transfer -Decode Worker (Python) - ↓ decode only, token generation -Response -``` - -| Component | Purpose | -|-----------|---------| -| Frontend (Rust) | HTTP entry point | -| Processor (Python) | Tokenization, extracts media URLs | -| Encode Worker | Media encoding, embeddings generation | -| Prefill Worker | Prefill only, transfers KV cache | -| Decode Worker | Decode only, token generation | - -**When to use:** Maximum optimization, multi-node deployment, independent scaling of each phase. - -### EP/D - Traditional Disaggregated - -Encoding is combined with prefill, with decode separate. - -```text -HTTP Frontend (Rust) - ↓ -Processor (Python) - ↓ tokenizes, extracts media URL -Encode+Prefill Worker (Python) - ↓ downloads media, encodes inline, prefill, KV cache transfer -Decode Worker (Python) - ↓ decode only, token generation -Response -``` - -| Component | Purpose | -|-----------|---------| -| Frontend (Rust) | HTTP entry point | -| Processor (Python) | Tokenization, extracts media URLs (vLLM only) | -| Encode+Prefill Worker | Combined encoding and prefill | -| Decode Worker | Decode only, token generation | - -> **Note:** TRT-LLM's EP/D mode skips the Python Processor - the Rust frontend handles tokenization and routes directly to the Prefill worker. -> For multimodal requests, the Python prefill worker still re-tokenizes/builds inputs; Rust token_ids are ignored. - -**When to use:** Models without pre-computed embedding support (Llama 4), or TRT-LLM disaggregated deployment. - -## Example Workflows - -You can find example workflows and reference implementations for deploying multimodal models in: - -- [vLLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch) -- [TRT-LLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/launch) -- [SGLang multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch) -- [Advanced multimodal examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/launch) (video, audio) diff --git a/docs/multimodal/sglang.md b/docs/multimodal/sglang.md deleted file mode 100644 index 93ad0741c55..00000000000 --- a/docs/multimodal/sglang.md +++ /dev/null @@ -1,433 +0,0 @@ - - -# SGLang Multimodal - -This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. SGLang multimodal supports **EPD**, **E/PD**, and **E/P/D** flows, with NIXL (RDMA) for zero-copy tensor transfer in disaggregated modes. - -## Support Matrix - -| Modality | Input Format | Aggregated | Disaggregated | Notes | -|----------|--------------|------------|---------------|-------| -| **Image** | HTTP/HTTPS URL | Yes | Yes | Vision encoder generates embeddings | -| **Image** | Data URL (Base64) | No | No | | -| **Video** | HTTP/HTTPS URL | No | No | | -| **Audio** | HTTP/HTTPS URL | No | No | | - -### Supported URL Formats - -| Format | Example | Description | -|--------|---------|-------------| -| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files | - -## Deployment Patterns - -SGLang supports EPD, E/PD, and E/P/D patterns. See [Multimodal Architecture Patterns](index.md#architecture-patterns) for detailed explanations. - -| Pattern | Supported | Launch Script | Notes | -|---------|-----------|---------------|-------| -| EPD (Simple Aggregated) | ✅ | `agg.sh` | Internal encoding | -| E/PD (Encode Separate) | ✅ | `multimodal_epd.sh` | Vision encoder separate | -| E/P/D (Full Disaggregation) | ✅ | `multimodal_disagg.sh` | KV cache via bootstrap | -| EP/D (Traditional Disaggregated) | ❌ | N/A | Not supported | - -### Component Flags - -| Component | Flag | Purpose | -|-----------|------|---------| -| Processor | `--multimodal-processor` | HTTP entry, OpenAI→SGLang conversion | -| Encode Worker | `--multimodal-encode-worker` | Vision encoder, embeddings generation | -| PD Worker | `--multimodal-worker` | Prefill + Decode with embeddings | -| Decode Worker | `--multimodal-worker --serving-mode=decode` | Entry point for disaggregation | -| Prefill Worker | `--multimodal-worker --serving-mode=prefill` | Called by Decode, bootstrap coordination | - -### SGLang-Specific Characteristics - -- **Vision Encoder in Python**: Encode worker loads vision model (AutoModel) and image processor (AutoImageProcessor) -- **Token Expansion**: Single `<|image_pad|>` token replaced with N tokens based on embedding shape -- **NIXL Transfer**: Embeddings transferred from Encoder → PD Worker using NIXL -- **No Rust Processing**: All tokenization and image handling happens in Python - -## Use the Latest Release - -We recommend using the latest stable release of dynamo to avoid breaking changes: - -[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest) - -You can find the [latest release](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with: - -```bash -git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) -``` - -## EPD Serving (Simple Aggregated) - -### Components - -- worker: [DecodeWorkerHandler](../../components/src/dynamo/sglang/request_handlers/llm/decode_handler.py) handles encoding, prefilling, and decoding in a single process. - -### Workflow - -The `DecodeWorkerHandler` receives multimodal requests with image URLs and passes them directly to SGLang's engine. SGLang's internal `mm_data_processor` handles image fetching, loading, encoding, and token expansion. - -```mermaid -flowchart LR - HTTP --> worker - worker --tokenized text + image_urls--> SGLang[SGLang Engine] -``` - -### Launch - -```bash -cd $DYNAMO_HOME/examples/backends/sglang -./launch/agg.sh --model Qwen/Qwen2.5-VL-7B-Instruct --chat-template qwen2-vl -``` - -**Client:** - -```bash -curl http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "Qwen/Qwen2.5-VL-7B-Instruct", - "messages": [ - { - "role": "user", - "content": [ - { - "type": "text", - "text": "Describe the image." - }, - { - "type": "image_url", - "image_url": { - "url": "http://images.cocodataset.org/test2017/000000155781.jpg" - } - } - ] - } - ], - "max_tokens": 50, - "stream": false - }' | jq -``` - -## E/PD Serving (Encode Separate) - -### Components - -- workers: - - [MultimodalEncodeWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding - - [MultimodalWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling and decoding. -- processor: [MultimodalProcessorHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py) - - tokenizes the prompt using the chat template - - passes the text and image url to the MultimodalEncodeWorker. - -### Workflow - -The `MultimodalEncodeWorker` downloads and encodes the image and passes the embeddings to the MultimodalWorker. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface. The `MultimodalWorker` then prefills and decodes the prompt in the same engine, as in the [LLM aggregated serving](../backends/sglang/README.md) example. Only the processor is registered to the Dynamo frontend as an available endpoint. Workers do NOT register - they are internal components and communicate via NATS. - -```mermaid -flowchart LR - HTTP --> processor - processor --tokenized request + image_url--> encode_worker - encode_worker --request + embeddings--> worker - - worker -.-> encode_worker - encode_worker -.-> processor - processor -.-> HTTP -``` - - -### Launch - -```bash -cd $DYNAMO_HOME/examples/backends/sglang -./launch/multimodal_epd.sh -``` - -**Client:** - -```bash -curl http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "Qwen/Qwen2.5-VL-7B-Instruct", - "messages": [ - { - "role": "user", - "content": [ - { - "type": "text", - "text": "Describe the image." - }, - { - "type": "image_url", - "image_url": { - "url": "http://images.cocodataset.org/test2017/000000155781.jpg" - } - } - ] - } - ], - "max_tokens": 50, - "stream": false - }' | jq -``` - -## E/P/D Serving (Full Disaggregation) - -### Components - -- workers: - - [MultimodalEncodeWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py) for encoding - - [MultimodalWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for decoding - - [MultimodalPrefillWorkerHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py) for prefilling -- processor: [MultimodalProcessorHandler](../../components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py) tokenizes the prompt and passes it to the MultimodalEncodeWorker. - -### Workflow - -In models like Qwen2.5-VL, embeddings are only required during the prefill stage. The image embeddings are transferred via NIXL from the Encode Worker to the Decode Worker (the entry point for disaggregation), which then coordinates with the Prefill Worker. The Prefill Worker processes the embeddings and forwards the KV cache back to the Decode Worker for token generation. - -```mermaid -flowchart LR - HTTP --> processor - processor --tokenized request + image_url--> encode_worker - encode_worker --request + embeddings--> worker - worker --request + embeddings--> prefill_worker - - prefill_worker --KV Cache--> worker - encode_worker -.-> processor - worker -.-> encode_worker - processor -.-> HTTP -``` - -### Launch - -```bash -cd $DYNAMO_HOME/examples/backends/sglang -./launch/multimodal_disagg.sh -``` - -**Client:** - -```bash -curl http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "Qwen/Qwen2.5-VL-7B-Instruct", - "messages": [ - { - "role": "user", - "content": [ - { - "type": "text", - "text": "Describe the image." - }, - { - "type": "image_url", - "image_url": { - "url": "http://images.cocodataset.org/test2017/000000155781.jpg" - } - } - ] - } - ], - "max_tokens": 50, - "stream": false - }' | jq -``` - -## Bootstrap Coordination - -SGLang disaggregation uses a bootstrap mechanism for P->D coordination: - -### Request Flow (Important) - -```text -Client → Frontend → Processor → Encode → DECODE Worker → Prefill Worker - ↑ - Entry point for disaggregation! -``` - -### Bootstrap Process - -1. **Decode Worker** receives request from Encode Worker -2. **Decode Worker** calls Prefill Worker via NATS to request bootstrap info -3. **Prefill Worker** generates `{host, port, room}` and returns immediately -4. **Both workers** connect to same "room" using bootstrap coordinates -5. **SGLang internally** transfers KV cache state via bootstrap connection (not NIXL) - -### Key Difference from vLLM - -- vLLM: Frontend → Prefill → Decode (Prefill is entry point) -- SGLang: Frontend → Processor → Encode → **Decode → Prefill** (Decode is entry point) - -## Inter-Component Communication - -### Control Flow (NATS) - -All component-to-component communication happens via NATS: - -#### E/PD Mode (Encode Separate) - -```text -Processor → Encode Worker → PD Worker - (NATS) (NATS + NIXL embeddings) -``` - -#### E/P/D Mode (Full Disaggregation) - -```text -Processor → Encode Worker → DECODE Worker → Prefill Worker - (NATS) (NATS) (NATS) - ↓ - Decode requests bootstrap - ↓ - Prefill returns {host, port, room} - ↓ - Both connect via bootstrap - ↓ - SGLang internal KV cache transfer -``` - -### Detailed Message Flow - -```text -Processor → Encode Worker: - - NATS round_robin with SglangMultimodalRequest - - Contains: tokenized input_ids, image URL, sampling params - -Encode Worker → Decode/PD Worker: - - NATS round_robin to "backend" component - - Contains: expanded token_ids, NIXL metadata, embeddings shape - - NIXL transfer: embeddings tensor - -Decode Worker → Prefill Worker (disagg only): - - NATS call to "prefill" component - - Decode requests bootstrap coordinates - - Prefill returns: {bootstrap_host, bootstrap_port, bootstrap_room} - -Prefill ↔ Decode (via bootstrap): - - SGLang internal connection (not NATS) - - KV cache state shared via bootstrap mechanism -``` - -### Data Transfer (NIXL) - -NIXL is used only for embedding transfer: - -```python -# Encode Worker -descriptor = connect.Descriptor(precomputed_embeddings) -with connector.create_readable(descriptor) as readable: - request.serialized_request = readable.metadata() - await pd_worker_client.round_robin(request) - await readable.wait_for_completion() - -# PD Worker -embeddings = torch.empty(request.embeddings_shape, dtype=torch.float16) -descriptor = connect.Descriptor(embeddings) -read_op = await connector.begin_read(request.serialized_request, descriptor) -await read_op.wait_for_completion() -``` - -## Vision Encoding Details - -### Encode Worker Components - -The encode worker loads and runs the vision model in Python: - -```python -self.image_processor = AutoImageProcessor.from_pretrained( - model_path, trust_remote_code=True -) -self.vision_model = AutoModel.from_pretrained( - model_path, - device_map="auto", - torch_dtype=torch.float16, - trust_remote_code=True -) -``` - -### Token Expansion Process - -1. Processor inserts single image token (e.g., `<|image_pad|>`) -2. Encode worker generates embeddings: `shape = (batch, num_patches, hidden_dim)` -3. Encode worker replaces single token with `num_patches` tokens -4. Downstream worker receives expanded token sequence - -Example: - -```python -# Before: ["Hello", "<|image_pad|>", "world"] -# After: ["Hello", "<|image_pad|>", "<|image_pad|>", ...(576 tokens), "world"] -``` - -## Chat Template Processing - -SGLang uses its own chat template system: - -```python -from sglang.srt.parser.conversation import chat_templates - -conv = chat_templates["qwen2-vl"].copy() -conv.append_message(conv.roles[0], f"{conv.image_token} Describe this image") -processed = tokenizer(text=conv.get_prompt(), return_tensors="pt") -``` - -Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc. - -## NIXL Usage - -| Use Case | NIXL Used? | Data Transfer | Notes | -|----------|------------|---------------|-------| -| EPD (Simple Aggregated) | No | N/A | All processing internal to SGLang | -| E/PD (Encode Separate) | Yes | Encoder → PD (embeddings) | Vision encoder separate | -| E/P/D (Full Disaggregation) | Yes | Encoder → Prefill (embeddings) | KV cache via SGLang bootstrap | - -**Key Difference:** SGLang P/D uses bootstrap mechanism, not NIXL for KV cache like vLLM. - -## Known Limitations - -- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported -- **No pre-computed embeddings** - Cannot use `.pt`, `.pth`, `.bin` embedding files; vision encoder runs for every request -- **No video support** - No video encoder implementation -- **No audio support** - No audio encoder implementation -- **Only Processor registers with Dynamo** - Workers are internal components, frontend routes to Processor only -- **Disaggregated routing** - Decode Worker is the entry point (calls Prefill), cannot route directly to Prefill workers -- **Limited model generalization** - Token expansion logic is model-specific; adding new models may require implementation updates - -## Supported Models - -SGLang multimodal **only supports image-based vision-language models**: - -- **Qwen2-VL** / **Qwen2.5-VL** (primary support) -- Models with `AutoImageProcessor` and vision tower -- Models compatible with SGLang's image embedding format - -## Key Files - -| File | Description | -|------|-------------| -| `components/src/dynamo/sglang/main.py` | Component initialization, only Processor registers | -| `components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py` | Processor implementation, OpenAI→SGLang | -| `components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py` | Vision encoder, embeddings generation | -| `components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py` | PD/Prefill/Decode workers, NIXL read | -| `components/src/dynamo/sglang/multimodal_utils/multimodal_chat_processor.py` | Chat template processing | -| `components/src/dynamo/sglang/protocol.py` | Request/response data structures | -| `components/src/dynamo/sglang/register.py` | Registration logic (only called for Processor) | diff --git a/docs/multimodal/trtllm.md b/docs/multimodal/trtllm.md deleted file mode 100644 index d5bbc0159dc..00000000000 --- a/docs/multimodal/trtllm.md +++ /dev/null @@ -1,476 +0,0 @@ - - -# TensorRT-LLM Multimodal - -This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo. - -You can provide multimodal inputs in the following ways: -- By sending image URLs -- By providing paths to pre-computed embedding files - -> **Note:** You should provide **either image URLs or embedding file paths** in a single request. - -## Support Matrix - -| Modality | Input Format | Aggregated | Disaggregated | Notes | -|----------|--------------|------------|---------------|-------| -| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models | -| **Image** | Pre-computed Embeddings (.pt, .pth, .bin) | Yes | Yes | Direct embedding files | -| **Video** | HTTP/HTTPS URL | No | No | Not implemented | -| **Audio** | HTTP/HTTPS URL | No | No | Not implemented | - -### Supported URL Formats - -| Format | Example | Description | -|--------|---------|-------------| -| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files | -| **Pre-computed Embeddings** | `/path/to/embedding.pt` | Local embedding files (.pt, .pth, .bin) | - -## Deployment Patterns - -TRT-LLM supports aggregated and traditional disaggregated patterns. See [Architecture Patterns](index.md#architecture-patterns) for detailed explanations. - -| Pattern | Supported | Launch Script | Notes | -|---------|-----------|---------------|-------| -| Aggregated | ✅ | `agg.sh` | Easiest setup, single worker | -| EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal.sh` | Prefill handles encoding, 2 workers | -| E/P/D (Full - Image URLs) | ✅ | `epd_multimodal_image_and_embeddings.sh` | Standalone encoder with `MultimodalEncoder`, 3 workers | -| E/P/D (Full - Pre-computed Embeddings) | ✅ | `epd_multimodal_image_and_embeddings.sh` | Standalone encoder with NIXL transfer, 3 workers | -| E/P/D (Large Models) | ✅ | `epd_disagg.sh` | For Llama-4 Scout/Maverick, multi-node | - -### Component Flags - -| Component | Flag | Purpose | -|-----------|------|---------| -| Worker | `--modality multimodal` | Complete pipeline (aggregated) | -| Prefill Worker | `--disaggregation-mode prefill` | Image processing + Prefill (multimodal tokenization happens here) | -| Decode Worker | `--disaggregation-mode decode` | Decode only | -| Encode Worker | `--disaggregation-mode encode` | Image encoding (E/P/D flow) | - -## Aggregated Serving - -Quick steps to launch Llama-4 Maverick BF16 in aggregated mode: - -```bash -cd $DYNAMO_HOME - -export AGG_ENGINE_ARGS=./examples/backends/trtllm/engine_configs/llama4/multimodal/agg.yaml -export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct" -export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct" -./examples/backends/trtllm/launch/agg.sh -``` - -**Client:** -```bash -curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ - "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct", - "messages": [ - { - "role": "user", - "content": [ - { - "type": "text", - "text": "Describe the image" - }, - { - "type": "image_url", - "image_url": { - "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png" - } - } - ] - } - ], - "stream": false, - "max_tokens": 160 -}' -``` - -## Disaggregated Serving - -Example using `Qwen/Qwen2-VL-7B-Instruct`: - -```bash -cd $DYNAMO_HOME - -export MODEL_PATH="Qwen/Qwen2-VL-7B-Instruct" -export SERVED_MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct" -export PREFILL_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/prefill.yaml" -export DECODE_ENGINE_ARGS="examples/backends/trtllm/engine_configs/qwen2-vl-7b-instruct/decode.yaml" -export MODALITY="multimodal" - -./examples/backends/trtllm/launch/disagg.sh -``` - -```bash -curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ - "model": "Qwen/Qwen2-VL-7B-Instruct", - "messages": [ - { - "role": "user", - "content": [ - { - "type": "text", - "text": "Describe the image" - }, - { - "type": "image_url", - "image_url": { - "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png" - } - } - ] - } - ], - "stream": false, - "max_tokens": 160 -}' -``` - -For a large model like `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, a multi-node setup is required for disaggregated serving (see [Multi-node Deployment](#multi-node-deployment-slurm) below), while aggregated serving can run on a single node. This is because the model with a disaggregated configuration is too large to fit on a single node's GPUs. For instance, running this model in disaggregated mode requires 2 nodes with 8xH200 GPUs or 4 nodes with 4xGB200 GPUs. - -## Full E/P/D Flow (Image URLs) - -For high-performance multimodal inference, Dynamo supports a standalone encoder with an **Encode-Prefill-Decode (E/P/D)** flow using TRT-LLM's `MultimodalEncoder`. This separates the vision encoding stage from prefill and decode, enabling better GPU utilization and scalability. - -### Supported Input Formats - -| Format | Example | Description | -|--------|---------|-------------| -| **HTTP/HTTPS URL** | `https://example.com/image.jpg` | Remote image files | -| **Base64 Data URL** | `data:image/jpeg;base64,...` | Inline base64-encoded images | - -### How It Works - -In the full E/P/D flow: - -1. **Encode Worker**: Runs TRT-LLM's `MultimodalEncoder.generate()` to process image URLs through the vision encoder and projector -2. **Prefill Worker**: Receives `disaggregated_params` containing multimodal embedding handles, processes context and generates KV cache -3. **Decode Worker**: Performs streaming token generation using the KV cache - -The encode worker uses TRT-LLM's `MultimodalEncoder` class (which inherits from `BaseLLM`) and only requires the model path and batch size - no KV cache configuration is needed since it only runs the vision encoder + projector. - -### How to Launch - -```bash -cd $DYNAMO_HOME - -# Launch 3-worker E/P/D flow with image URL support -./examples/backends/trtllm/launch/epd_multimodal_image_and_embeddings.sh -``` - -### Example Request - -```bash -curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ - "model": "llava-v1.6-mistral-7b-hf", - "messages": [ - { - "role": "user", - "content": [ - {"type": "text", "text": "Describe the image"}, - { - "type": "image_url", - "image_url": { - "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png" - } - } - ] - } - ], - "max_tokens": 160 -}' -``` - -### E/P/D Architecture (Image URLs) - -```mermaid -sequenceDiagram - participant Client - participant Frontend - participant PrefillWorker as "Prefill Worker" - participant EncodeWorker as "Encode Worker" - participant DecodeWorker as "Decode Worker" - - Client->>Frontend: POST /v1/chat/completions (image URL) - Frontend->>PrefillWorker: Route to prefill worker - PrefillWorker->>EncodeWorker: Send request (image URL) - Note over EncodeWorker: MultimodalEncoder.generate()
runs vision encoder + projector - EncodeWorker->>PrefillWorker: Return disaggregated_params
(multimodal_embedding_handles) - Note over PrefillWorker: Process context with embeddings
Generate KV cache - PrefillWorker->>Frontend: Return prefill response - Frontend->>DecodeWorker: Route to decode worker - DecodeWorker->>Frontend: Stream response chunks - Frontend->>Client: Stream response -``` - -### Key Differences from EP/D (Traditional Disaggregated) - -| Aspect | EP/D (Traditional) | E/P/D (Full) | -|--------|-------------------|--------------| -| **Encoding** | Prefill worker handles image encoding | Dedicated encode worker | -| **Prefill Load** | Higher (encoding + prefill) | Lower (prefill only) | -| **Use Case** | Simpler setup | Better scalability for vision-heavy workloads | -| **Launch Script** | `disagg_multimodal.sh` | `epd_multimodal_image_and_embeddings.sh` | - -## Pre-computed Embeddings with E/P/D Flow - -For high-performance multimodal inference, Dynamo supports pre-computed embeddings with an **Encode-Prefill-Decode (E/P/D)** flow using **NIXL (RDMA)** for zero-copy tensor transfer. - -### Supported File Types - -- `.pt` - PyTorch tensor files -- `.pth` - PyTorch checkpoint files -- `.bin` - Binary tensor files - -### Embedding File Formats - -TRT-LLM supports two formats for embedding files: - -**1. Simple Tensor Format** - -Direct tensor saved as `.pt` file containing only the embedding tensor: - -```python -embedding_tensor = torch.rand(1, 576, 4096) # [batch, seq_len, hidden_dim] -torch.save(embedding_tensor, "embedding.pt") -``` - -**2. Dictionary Format with Auxiliary Data** - -Dictionary containing multiple keys, used by models like Llama-4 that require additional metadata: - -```python -embedding_dict = { - "mm_embeddings": torch.rand(1, 576, 4096), - "special_tokens": [128256, 128257], - "image_token_offsets": [[0, 576]], - # ... other model-specific metadata -} -torch.save(embedding_dict, "llama4_embedding.pt") -``` - -- **Simple tensors**: Loaded directly and passed to `mm_embeddings` parameter -- **Dictionary format**: `mm_embeddings` key extracted as main tensor, other keys preserved as auxiliary data - -### How to Launch - -```bash -cd $DYNAMO_HOME/examples/backends/trtllm - -# Launch 3-worker E/P/D flow with NIXL -./launch/epd_disagg.sh -``` - -> **Note:** This script is designed for 8-node H200 with `Llama-4-Scout-17B-16E-Instruct` model and assumes you have a model-specific embedding file ready. - -### Configuration - -```bash -# Encode endpoint for Prefill → Encode communication -export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate" - -# Security: Allowed directory for embedding files (default: /tmp) -export ALLOWED_LOCAL_MEDIA_PATH="/tmp" - -# Security: Max file size to prevent DoS attacks (default: 50MB) -export MAX_FILE_SIZE_MB=50 -``` - -### Example Request with Pre-computed Embeddings - -```bash -curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ - "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct", - "messages": [ - { - "role": "user", - "content": [ - {"type": "text", "text": "Describe the image"}, - {"type": "image_url", "image_url": {"url": "/path/to/embedding.pt"}} - ] - } - ], - "max_tokens": 160 -}' -``` - -### E/P/D Architecture - -The E/P/D flow implements a **3-worker architecture**: - -- **Encode Worker**: Loads pre-computed embeddings, transfers via NIXL -- **Prefill Worker**: Receives embeddings, handles context processing and KV-cache generation -- **Decode Worker**: Performs streaming token generation - -```mermaid -sequenceDiagram - participant Client - participant Frontend - participant PrefillWorker as "Prefill Worker" - participant EncodeWorker as "Encode Worker" - participant DecodeWorker as "Decode Worker" - participant NIXL as "NIXL (RDMA)" - - Client->>Frontend: POST /v1/chat/completions - Frontend->>PrefillWorker: Route to prefill worker - PrefillWorker->>EncodeWorker: Send request (embedding paths) - EncodeWorker->>NIXL: Create readable operation - EncodeWorker->>PrefillWorker: Send metadata + NIXL info - PrefillWorker->>NIXL: Begin read operation - NIXL-->>PrefillWorker: Zero-copy transfer complete - PrefillWorker->>Frontend: Return prefill response - Frontend->>DecodeWorker: Route to decode worker - DecodeWorker->>Frontend: Stream response chunks - Frontend->>Client: Stream response -``` - -## Multi-node Deployment (Slurm) - -This section demonstrates how to deploy large multimodal models that require a multi-node setup using Slurm. - -> **Note:** The scripts referenced in this section can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/). - -### Environment Setup - -Assuming you have allocated your nodes via `salloc` and are inside an interactive shell: - -```bash -# Container image (build using docs/backends/trtllm/README.md#build-container) -export IMAGE="" - -# Host:container path pairs for mounting -export MOUNTS="${PWD}/../../../../:/mnt" - -# Model configuration -export MODEL_PATH="meta-llama/Llama-4-Maverick-17B-128E-Instruct" -export SERVED_MODEL_NAME="meta-llama/Llama-4-Maverick-17B-128E-Instruct" -export MODALITY=${MODALITY:-"multimodal"} -``` - -### Multi-node Disaggregated Launch - -For 4 4xGB200 nodes (2 for prefill, 2 for decode): - -```bash -# Customize parallelism to match your engine configs -# export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/prefill.yaml" -# export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/multimodal/decode.yaml" -# export NUM_PREFILL_NODES=2 -# export NUM_DECODE_NODES=2 -# export NUM_GPUS_PER_NODE=4 - -# Launches frontend + etcd/nats on head node, plus prefill and decode workers -./srun_disaggregated.sh -``` - -### Understanding the Output - -1. `srun_disaggregated.sh` launches three srun jobs: frontend, prefill worker, and decode worker -2. The OpenAI frontend will dynamically discover workers as they register: - ``` - INFO dynamo_run::input::http: Watching for remote model at models - INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 - ``` -3. TRT-LLM workers output progress from each MPI rank while loading -4. When ready, the frontend logs: - ``` - INFO dynamo_llm::discovery::watcher: added model model_name="meta-llama/Llama-4-Maverick-17B-128E-Instruct" - ``` - -### Cleanup - -```bash -pkill srun -``` - -## NIXL Usage - -| Use Case | Script | NIXL Used? | Data Transfer | -|----------|--------|------------|---------------| -| Aggregated | `agg.sh` | No | All in one worker | -| EP/D (Traditional Disaggregated) | `disagg_multimodal.sh` | Optional | Prefill → Decode (KV cache via UCX or NIXL) | -| E/P/D (Image URLs) | `epd_multimodal_image_and_embeddings.sh` | No | Encoder → Prefill (handles via params), Prefill → Decode (KV cache) | -| E/P/D (Pre-computed Embeddings) | `epd_multimodal_image_and_embeddings.sh` | Yes | Encoder → Prefill (embeddings via NIXL RDMA) | -| E/P/D (Large Models) | `epd_disagg.sh` | Yes | Encoder → Prefill (embeddings via NIXL), Prefill → Decode (KV cache) | - -> **Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture. - -## ModelInput Types and Registration - -TRT-LLM workers register with Dynamo using: - -| ModelInput Type | Preprocessing | Use Case | -|-----------------|---------------|----------| -| `ModelInput.Tokens` | Rust frontend may tokenize, but multimodal flows re-tokenize and build inputs in the Python worker; Rust token_ids are ignored | All TRT-LLM workers | - -```python -# TRT-LLM Worker - Register with Tokens -await register_llm( - ModelInput.Tokens, # Rust does minimal preprocessing - model_type, # ModelType.Chat or ModelType.Prefill - generate_endpoint, - model_name, - ... -) -``` - -## Inter-Component Communication - -| Transfer Stage | Message | NIXL Transfer | -|----------------|---------|---------------| -| **Frontend → Prefill** | Request with image URL or embedding path | No | -| **Prefill → Encode (Image URL)** | Request with image URL | No | -| **Encode → Prefill (Image URL)** | `ep_disaggregated_params` with `multimodal_embedding_handles`, processed prompt, and token IDs | No | -| **Prefill → Encode (Embedding Path)** | Request with embedding file path | No | -| **Encode → Prefill (Embedding Path)** | NIXL readable metadata + shape/dtype + auxiliary data | Yes (Embeddings tensor via RDMA) | -| **Prefill → Decode** | `disaggregated_params` with `_epd_metadata` (prompt, token IDs) | Configurable (KV cache: NIXL default, UCX optional) | - -## Known Limitations - -- **No video support** - No video encoder implementation -- **No audio support** - No audio encoder implementation -- **Multimodal preprocessing/tokenization happens in Python** - Rust may forward token_ids, but multimodal requests are parsed and re-tokenized in the Python worker -- **Multi-node H100 limitation** - Loading `meta-llama/Llama-4-Maverick-17B-128E-Instruct` with 8 nodes of H100 with TP=16 is not possible due to head count divisibility (`num_attention_heads: 40` not divisible by `tp_size: 16`) -- **llava-v1.6-mistral-7b-hf model crash** - Known issue with TRTLLM backend compatibilty with `TensorRT LLM version: 1.2.0rc6.post1`. To use Llava model download revision `revision='52320fb52229` locally using HF. -- **Embeddings file crash** - Known issue with TRTLLM backend compatibilty with `TensorRT LLM version: 1.2.0rc6.post1`. Embedding file parsing crashes in `attach_multimodal_embeddings(`. To be fixed in next TRTLLM upgrade. - -## Supported Models - -Multimodal models listed in [TensorRT-LLM supported models](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/models/supported-models.md) are supported by Dynamo. - -Common examples: -- **Llama 4 Vision models** (Maverick, Scout) - Recommended for large-scale deployments -- **LLaVA models** (e.g., `llava-hf/llava-v1.6-mistral-7b-hf`) - Default model for E/P/D examples -- **Qwen2-VL models** - Supported in traditional disaggregated mode -- Other vision-language models with TRT-LLM support - -## Key Files - -| File | Description | -|------|-------------| -| `components/src/dynamo/trtllm/main.py` | Worker initialization and setup | -| `components/src/dynamo/trtllm/engine.py` | TensorRTLLMEngine wrapper (LLM and MultimodalEncoder) | -| `components/src/dynamo/trtllm/constants.py` | DisaggregationMode enum (AGGREGATED, PREFILL, DECODE, ENCODE) | -| `components/src/dynamo/trtllm/encode_helper.py` | Encode worker request processing (embedding-path and full EPD flows) | -| `components/src/dynamo/trtllm/multimodal_processor.py` | Multimodal request processing | -| `components/src/dynamo/trtllm/request_handlers/handlers.py` | Request handlers (EncodeHandler, PrefillHandler, DecodeHandler) | -| `components/src/dynamo/trtllm/request_handlers/handler_base.py` | Base handler with disaggregated params encoding/decoding | -| `components/src/dynamo/trtllm/utils/disagg_utils.py` | DisaggregatedParamsCodec for network transfer | -| `components/src/dynamo/trtllm/utils/trtllm_utils.py` | Command-line argument parsing | - diff --git a/docs/multimodal/vllm.md b/docs/multimodal/vllm.md deleted file mode 100644 index 76ac72614e4..00000000000 --- a/docs/multimodal/vllm.md +++ /dev/null @@ -1,522 +0,0 @@ - - -# vLLM Multimodal - -This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo. - -> [!IMPORTANT] -> **Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`. -> This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64). - -## Support Matrix - -| Modality | Input Format | Aggregated | Disaggregated | Notes | -|----------|--------------|------------|---------------|-------| -| **Image** | HTTP/HTTPS URL | Yes | Yes | Full support for all image models | -| **Image** | Data URL (Base64) | Yes | Yes | Inline base64-encoded images | -| **Video** | HTTP/HTTPS URL | Yes | Yes | Frame extraction and processing | -| **Audio** | HTTP/HTTPS URL | Yes | Yes | Experimental - requires audio dependencies | - -### Supported URL Formats - -| Format | Example | Description | -|--------|---------|-------------| -| **HTTP/HTTPS** | `http://example.com/image.jpg` | Remote media files | -| **Data URL** | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data | - -## Deployment Patterns - -vLLM supports all multimodal deployment patterns. See [Architecture Patterns](index.md#architecture-patterns) for detailed explanations. - -| Pattern | Supported | Launch Script | Notes | -|---------|-----------|---------------|-------| -| EPD (Simple Aggregated) | ✅ | `agg_multimodal.sh` | Easiest setup | -| E/PD (Encode Separate) | ✅ | `agg_multimodal_epd.sh` | Separate encode worker | -| E/P/D (Full Disaggregation) | ✅ | `disagg_multimodal_epd.sh` | All stages separate | -| EP/D (Traditional Disaggregated) | ✅ | `disagg_multimodal_llama.sh` | For Llama 4 models | -| E/PD (EC Connector) | ✅ | `agg_multimodal_ec_connector.sh` | vLLM-native encoder with ECConnector | - -### Component Flags - -| Component | Flag | Purpose | -|-----------|------|---------| -| Processor | `--multimodal-processor` | HTTP entry, tokenization | -| Encode Worker | `--multimodal-encode-worker` | Media encoding | -| PD Worker | `--multimodal-worker` | Prefill + Decode | -| Prefill Worker | `--multimodal-worker --is-prefill-worker` | Prefill only | -| Decode Worker | `--multimodal-decode-worker` | Decode only | -| Encode+Prefill Worker | `--multimodal-encode-prefill-worker --is-prefill-worker` | Combined (Llama 4) | -| vLLM Native Encoder | `--vllm-native-encoder-worker` | vLLM-native encoding with ECConnector | - -## Use the Latest Release - -We recommend using the latest stable release of dynamo to avoid breaking changes: - -[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest) - -You can find the [latest release](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with: - -```bash -git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) -``` - -## Image Serving - -### E/PD Serving (Encode Separate) - -**Components:** - -- workers: [EncodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding and [MultimodalPDWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling and decoding. -- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler. -- frontend: HTTP endpoint to handle incoming requests. - -**Workflow:** - -The EncodeWorkerHandler encodes the image and passes the embeddings to the MultimodalPDWorkerHandler via NATS and RDMA. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface. - -```mermaid -flowchart LR - HTTP --> processor - processor --> HTTP - processor --image_url--> encode_worker - encode_worker --> processor - encode_worker --embeddings--> pd_worker - pd_worker --> encode_worker -``` - -> **Note:** Aggregated serving supports LLaVA 1.5 7B and Qwen2.5-VL-7B-Instruct. Disaggregated serving is currently only confirmed for LLaVA. - -**Launch:** - -```bash -cd $DYNAMO_HOME/examples/backends/vllm -# Serve a LLaVA 1.5 7B model: -bash launch/agg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf -# Serve a Qwen2.5-VL model: -bash launch/agg_multimodal_epd.sh --model Qwen/Qwen2.5-VL-7B-Instruct -``` - -**Client:** - -```bash -curl http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "llava-hf/llava-1.5-7b-hf", - "messages": [ - { - "role": "user", - "content": [ - { - "type": "text", - "text": "What is in this image?" - }, - { - "type": "image_url", - "image_url": { - "url": "http://images.cocodataset.org/test2017/000000155781.jpg" - } - } - ] - } - ], - "max_tokens": 300, - "temperature": 0.0, - "stream": false - }' -``` - -### E/P/D Serving (Full Disaggregation) - -**Components:** - -- workers: [EncodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) for encoding, [MultimodalDecodeWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for decoding, and [MultimodalPDWorkerHandler](../../components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) for prefilling. -- processor: Tokenizes the prompt and passes it to the EncodeWorkerHandler. -- frontend: HTTP endpoint to handle incoming requests. - -**Workflow:** - -For the LLaVA model, embeddings are only required during the prefill stage. The EncodeWorkerHandler is connected directly to the prefill worker, encoding the image and passing embeddings via NATS and RDMA. The prefill worker performs the prefilling step and forwards the KV cache to the decode worker. - -```mermaid -flowchart LR - HTTP --> processor - processor --> HTTP - processor --image_url--> encode_worker - encode_worker --> processor - encode_worker --embeddings--> prefill_worker - prefill_worker --> encode_worker - prefill_worker --> decode_worker - decode_worker --> prefill_worker -``` - -**Launch:** - -```bash -cd $DYNAMO_HOME/examples/backends/vllm -bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf -``` - -> [!NOTE] Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported. - -## ECConnector Serving - -ECConnector is vLLM's native connector for transferring multimodal embeddings via an Embedding Cache. The encoder worker acts as a **producer** (writes embeddings), while the PD worker acts as a **consumer** (reads embeddings). - -**Workflow:** - -```mermaid -flowchart LR - HTTP --> processor[EC Processor] - processor --image_url--> encoder[vLLM Native Encoder
Producer] - encoder --writes--> cache[(Embedding Cache)] - cache --reads--> pd[PD Worker
Consumer] - pd --> processor - processor --> HTTP -``` - -**Launch:** - -```bash -cd $DYNAMO_HOME/examples/backends/vllm -bash launch/agg_multimodal_ec_connector.sh --model llava-hf/llava-1.5-7b-hf - -# Custom storage path for Embedding Cache -bash launch/agg_multimodal_ec_connector.sh --ec-storage-path /shared/encoder-cache -``` - -**Client:** Same as [E/PD Serving](#epd-serving-encode-separate) - -## Llama 4 Serving - -The Llama 4 model family is natively multimodal. Unlike LLaVA, they do not directly consume image embeddings as input (see the [vLLM support matrix](https://docs.vllm.ai/en/latest/models/supported_models.html#text-generation_1)). Therefore, the encoder worker is not used and encoding is done alongside prefill. - -Example model: `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8` on H100x8. - -### Llama 4 Aggregated Serving - -**Workflow:** - -```mermaid -flowchart LR - HTTP --> processor - processor --> HTTP - processor --image_url--> pd_worker - pd_worker --> processor -``` - -**Launch:** - -```bash -cd $DYNAMO_HOME/examples/backends/vllm -bash launch/agg_multimodal_llama.sh -``` - -**Client:** - -```bash -curl http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8", - "messages": [ - { - "role": "user", - "content": [ - { - "type": "text", - "text": "What is in this image?" - }, - { - "type": "image_url", - "image_url": { - "url": "http://images.cocodataset.org/test2017/000000155781.jpg" - } - } - ] - } - ], - "max_tokens": 300, - "temperature": 0.0, - "stream": false - }' -``` - -### Llama 4 Disaggregated Serving - -**Workflow:** - -```mermaid -flowchart LR - HTTP --> processor - processor --> HTTP - processor --image_url--> prefill_worker - prefill_worker --> processor - prefill_worker --> decode_worker - decode_worker --> prefill_worker -``` - -**Launch:** - -```bash -cd $DYNAMO_HOME/examples/backends/vllm -bash launch/disagg_multimodal_llama.sh --head-node - -# On a separate node with NATS_SERVER and ETCD_ENDPOINTS pointing to head node: -cd $DYNAMO_HOME/examples/backends/vllm -bash launch/disagg_multimodal_llama.sh -``` - -## Video Serving - -### Video Aggregated Serving - -**Components:** - -- workers: [VideoEncodeWorker](../../examples/multimodal/components/video_encode_worker.py) for decoding video into frames, and [VllmPDWorker](../../examples/multimodal/components/worker.py) for prefilling and decoding. -- processor: Tokenizes the prompt and passes it to the VideoEncodeWorker. -- frontend: HTTP endpoint to handle incoming requests. - -**Workflow:** - -The VideoEncodeWorker decodes the video into frames. Unlike the image pipeline which generates embeddings, this pipeline passes raw frames directly to the VllmPDWorker via NATS and RDMA. - -```mermaid -flowchart LR - HTTP --> processor - processor --> HTTP - processor --video_url--> video_encode_worker - video_encode_worker --> processor - video_encode_worker --frames--> pd_worker - pd_worker --> video_encode_worker -``` - -**Launch:** - -```bash -cd $DYNAMO_HOME/examples/multimodal -bash launch/video_agg.sh -``` - -**Client:** - -```bash -curl http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "llava-hf/LLaVA-NeXT-Video-7B-hf", - "messages": [ - { - "role": "user", - "content": [ - { - "type": "text", - "text": "Describe the video in detail" - }, - { - "type": "video_url", - "video_url": { - "url": "https://storage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4" - } - } - ] - } - ], - "max_tokens": 300, - "stream": false - }' | jq -``` - -### Video Disaggregated Serving - -**Workflow:** - -For the LLaVA-NeXT-Video-7B model, frames are only required during the prefill stage. The VideoEncodeWorker is connected directly to the prefill worker, decoding the video into frames and passing them via RDMA. - -```mermaid -flowchart LR - HTTP --> processor - processor --> HTTP - processor --video_url--> video_encode_worker - video_encode_worker --> processor - video_encode_worker --frames--> prefill_worker - prefill_worker --> video_encode_worker - prefill_worker --> decode_worker - decode_worker --> prefill_worker -``` - -**Launch:** - -```bash -cd $DYNAMO_HOME/examples/multimodal -bash launch/video_disagg.sh -``` - -## Audio Serving - -### Audio Aggregated Serving - -**Components:** - -- workers: [AudioEncodeWorker](../../examples/multimodal/components/audio_encode_worker.py) for decoding audio into embeddings, and [VllmPDWorker](../../examples/multimodal/components/worker.py) for prefilling and decoding. -- processor: Tokenizes the prompt and passes it to the AudioEncodeWorker. -- frontend: HTTP endpoint to handle incoming requests. - -**Workflow:** - -```mermaid -flowchart LR - HTTP --> processor - processor --> HTTP - processor --audio_url--> audio_encode_worker - audio_encode_worker --> processor - audio_encode_worker --embeddings--> pd_worker - pd_worker --> audio_encode_worker -``` - -**Launch:** - -```bash -pip install vllm["audio"] accelerate # multimodal audio models dependency -cd $DYNAMO_HOME/examples/multimodal -bash launch/audio_agg.sh -``` - -**Client:** - -```bash -curl http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "Qwen/Qwen2-Audio-7B-Instruct", - "messages": [ - { - "role": "user", - "content": [ - { - "type": "text", - "text": "What is recited in the audio?" - }, - { - "type": "audio_url", - "audio_url": { - "url": "https://raw.githubusercontent.com/yuekaizhang/Triton-ASR-Client/main/datasets/mini_en/wav/1221-135766-0002.wav" - } - } - ] - } - ], - "max_tokens": 6000, - "temperature": 0.8, - "stream": false - }' | jq -``` - -### Audio Disaggregated Serving - -**Workflow:** - -For the Qwen2-Audio model, audio embeddings are only required during the prefill stage. The AudioEncodeWorker is connected directly to the prefill worker. - -```mermaid -flowchart LR - HTTP --> processor - processor --> HTTP - processor --audio_url--> audio_encode_worker - audio_encode_worker --> processor - audio_encode_worker --embeddings--> prefill_worker - prefill_worker --> audio_encode_worker - prefill_worker --> decode_worker - decode_worker --> prefill_worker -``` - -**Launch:** - -```bash -pip install vllm["audio"] accelerate # multimodal audio models dependency -cd $DYNAMO_HOME/examples/multimodal -bash launch/audio_disagg.sh -``` - -## NIXL Usage - -| Use Case | Script | NIXL Used? | Data Transfer | -|----------|--------|------------|---------------| -| EPD (Simple Aggregated) | `agg_multimodal.sh` | No | All in one worker | -| E/PD (Encode Separate) | `agg_multimodal_epd.sh` | Yes | Encoder → PD (embeddings) | -| E/P/D (Full Disaggregation) | `disagg_multimodal_epd.sh` | Yes | Encoder → Prefill (embeddings), Prefill → Decode (KV cache) | -| EP/D (Llama 4) | `disagg_multimodal_llama.sh` | Yes | Prefill → Decode (KV cache) | -| E/PD (EC Connector) | `agg_multimodal_ec_connector.sh` | No | ECConnector via Embedding Cache | - -## ModelInput Types and Registration - -Dynamo's Rust SDK supports two input types that determine how the HTTP frontend preprocesses requests: - -| ModelInput Type | Preprocessing | Use Case | -|-----------------|---------------|----------| -| `ModelInput.Text` | None (raw text passed through) | Components that tokenize themselves | -| `ModelInput.Tokens` | Rust SDK would tokenize (but bypassed in multimodal) | Components expecting pre-tokenized input | - -**Registration Pattern:** - -```python -# Processor - Entry point from HTTP frontend -await register_llm( - ModelInput.Text, # Frontend sends raw text - ModelType.Chat, - generate_endpoint, - model_name, - ... -) - -# Workers - Internal components -await register_llm( - ModelInput.Tokens, # Expect pre-tokenized input - ModelType.Chat, # or ModelType.Prefill for prefill workers - generate_endpoint, - model_name, - ... -) -``` - -## Known Limitations - -- **Disaggregated flows require Python Processor** - All multimodal disaggregation requires the Python Processor component (`ModelInput.Text`). - -## Supported Models - -The following models have been tested with Dynamo's vLLM multimodal backend: - -- **Qwen2.5-VL** - `Qwen/Qwen2.5-VL-7B-Instruct` -- **Qwen3-VL** - `Qwen/Qwen3-VL-30B-A3B-Instruct-FP8` -- **LLaVA 1.5** - `llava-hf/llava-1.5-7b-hf` -- **Llama 4 Maverick** - `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8` -- **LLaVA Next Video** - `llava-hf/LLaVA-NeXT-Video-7B-hf` -- **Qwen2-Audio** - `Qwen/Qwen2-Audio-7B-Instruct` - -For a complete list of multimodal models supported by vLLM, see [vLLM Supported Multimodal Models](https://docs.vllm.ai/en/latest/models/supported_models/#list-of-multimodal-language-models). Models listed there should work with Simple Aggregated Mode but may not be explicitly tested. - -## Key Files - -| File | Description | -|------|-------------| -| `components/src/dynamo/vllm/main.py` | Worker initialization and setup | -| `components/src/dynamo/vllm/args.py` | Command-line argument parsing | -| `components/src/dynamo/vllm/multimodal_handlers/processor_handler.py` | Processor implementation | -| `components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py` | Encode worker implementations (custom and vLLM-native) | -| `components/src/dynamo/vllm/multimodal_handlers/worker_handler.py` | PD/Prefill/Decode worker implementation | diff --git a/docs/performance/aiconfigurator.md b/docs/performance/aiconfigurator.md index 91528bf5e82..69e8f549877 100644 --- a/docs/performance/aiconfigurator.md +++ b/docs/performance/aiconfigurator.md @@ -151,5 +151,5 @@ docker run -it --rm nvcr.io/nvidia/aiconfigurator:latest \ ## Learn More - [Dynamo Installation Guide](/docs/kubernetes/installation_guide.md) -- [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md) +- [SLA Planner Guide](/docs/components/planner/planner_guide.md) - [Benchmarking Guide](/docs/benchmarks/benchmarking.md) \ No newline at end of file diff --git a/docs/planner/load_planner.md b/docs/planner/load_planner.md deleted file mode 100644 index 9ae1bbdc0aa..00000000000 --- a/docs/planner/load_planner.md +++ /dev/null @@ -1,57 +0,0 @@ -# Load-based Planner - -This document covers load-based planner in `examples/llm/components/planner.py`. - -> [!WARNING] -> Load-based planner is inoperable as vllm, sglang, and trtllm examples all do not use prefill queues. Please use SLA planner for now. - -> [!WARNING] -> Bare metal deployment with local connector is deprecated. The only option to deploy load-based planner is via k8s. We will update the examples in this document soon. - -## Load-based Scaling Up/Down Prefill/Decode Workers - -To adjust the number of prefill/decode workers, planner monitors the following metrics: -* Prefill worker: planner monitors the number of requests pending in the prefill queue to estimate the prefill workload. -* Decode/aggregated worker: planner monitors the average KV cache utilization rate to estimate the decode/aggregated workload. - -Every `metric-pulling-interval`, planner gathers the aforementioned metrics. Every `adjustment-interval`, planner compares the aggregated metrics in this interval with pre-set thresholds and decide to scale up/down prefill/decode workers. To avoid over-compensation, planner only changes the number of workers by 1 in one adjustment interval. In addition, when the number of workers is being adjusted, the planner blocks the metric pulling and adjustment. - -To scale up a prefill/decode worker, planner just need to launch the worker in the correct namespace. The auto-discovery mechanism picks up the workers and add them to the routers. To scale down a prefill worker, planner send a SIGTERM signal to the prefill worker. The prefill worker store the signal and exit when it finishes the current request pulled from the prefill queue. This ensures that no remote prefill request is dropped. To scale down a decode worker, planner revokes the etcd lease of the decode worker. When the etcd lease is revoked, the corresponding decode worker is immediately removed from the router and won't get any new requests. The decode worker then finishes all the current requests in their original stream and exits gracefully. - -There are two additional rules set by planner to prevent over-compensation: -1. After a new decode worker is added, since it needs time to populate the kv cache, planner doesn't scale down the number of decode workers in the next `NEW_DECODE_WORKER_GRACE_PERIOD=3` adjustment intervals. -1. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval. - -## SLA-based Scaling Up/Down Prefill/Decode Workers - -See [SLA-Driven Profiling](../benchmarks/sla_driven_profiling.md) for more details. - -## Usage - -The planner integration with the new frontend + worker architecture is currently a work in progress. This documentation will be updated with the new deployment patterns and code examples once the planner component has been fully adapted to the new workflow. - -Configuration options: -* `namespace` (str, default: "dynamo"): Target namespace for planner operations -* `environment` (str, default: "local"): Target environment (local, kubernetes) -* `no-operation` (bool, default: false): Run in observation mode only -* `log-dir` (str, default: None): Tensorboard log directory -* `adjustment-interval` (int, default: 30): Seconds between adjustments -* `metric-pulling-interval` (int, default: 1): Seconds between metric pulls -* `max-gpu-budget` (int, default: 8): Maximum GPUs for all workers -* `min-gpu-budget` (int, default: 1): Minimum GPUs per worker type -* `decode-kv-scale-up-threshold` (float, default: 0.9): KV cache threshold for scale-up -* `decode-kv-scale-down-threshold` (float, default: 0.5): KV cache threshold for scale-down -* `prefill-queue-scale-up-threshold` (float, default: 0.5): Queue threshold for scale-up -* `prefill-queue-scale-down-threshold` (float, default: 0.2): Queue threshold for scale-down -* `decode-engine-num-gpu` (int, default: 1): GPUs per decode engine -* `prefill-engine-num-gpu` (int, default: 1): GPUs per prefill engine - -Run as standalone process: -```bash -PYTHONPATH=/workspace/examples/llm python components/planner.py --namespace=dynamo --served-model-name=vllm --no-operation --log-dir=log/planner -``` - -Monitor metrics with Tensorboard: -```bash -tensorboard --logdir= -``` diff --git a/docs/planner/planner_intro.rst b/docs/planner/planner_intro.rst deleted file mode 100644 index 478ce8feccf..00000000000 --- a/docs/planner/planner_intro.rst +++ /dev/null @@ -1,85 +0,0 @@ -.. - SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. - SPDX-License-Identifier: Apache-2.0 - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. - -Planner -======= - -The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently. - -Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size: - -Key features include: - -* **SLA-based scaling** that uses predictive modeling and performance interpolation to proactively meet TTFT and ITL targets -* **Graceful scaling** that ensures no requests are dropped during scale-down operations - -.. admonition:: 🚀 Quick Start - :class: seealso - - **New to SLA Planner?** Start with the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md) for a complete, step-by-step workflow. - - **Prerequisites**: SLA-based planner requires pre-deployment profiling (2-4 hours on real silicon or a few minutes using simulator) before deployment. The Quick Start guide includes everything you need. - -.. list-table:: - :widths: 20 5 75 - :header-rows: 1 - - * - - - - - Feature - * - **Backend** - - ❌ - - Local - * - - - ✅ - - Kubernetes - * - **LLM Framework** - - ✅ - - vLLM - * - - - ✅ - - TensorRT-LLM - * - - - ✅ - - SGLang - * - **Serving Type** - - ✅ - - Aggregated - * - - - ✅ - - Disaggregated - * - **Planner Actions** - - ❌ - - Load-based scaling up/down prefill/decode workers - * - - - ✅ - - SLA-based scaling up/down prefill/decode workers [1]_ - * - - - ❌ - - Adjusting engine knobs - -.. [1] Supported with some limitations. - -.. toctree:: - :hidden: - - Overview - Planner README - Planner Guide - Planner Examples - SLA Planner Quick Start - SLA-Driven Profiling <../benchmarks/sla_driven_profiling.md> - SLA-based Planner diff --git a/docs/planner/sla_planner.md b/docs/planner/sla_planner.md deleted file mode 100644 index bb7d7c82a3e..00000000000 --- a/docs/planner/sla_planner.md +++ /dev/null @@ -1,203 +0,0 @@ -# SLA-based Planner - -> [!TIP] -> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Profiling + Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md). - -This document covers information regarding the SLA-based planner in `examples/common/utils/planner_core.py`. - -The SLA (Service Level Agreement)-based planner is an intelligent autoscaling system that monitors system performance and adjusts the number of prefill and decode workers to meet specified TTFT and ITL targets. Unlike the load-based planner that scales based on resource utilization thresholds, the SLA planner uses predictive modeling and performance interpolation to proactively scale the workers. - -> [!NOTE] -> Currently, SLA-based planner only supports disaggregated setup. - -> [!WARNING] -> Bare metal deployment with local connector is deprecated. Please deploy the SLA planner in k8s. - -## Architecture Overview - -**Components:** -- **Frontend**: Serves requests and exposes `/metrics` -- **Prometheus**: Scrapes frontend metrics every 5s (by default, can be updated in the podmonitor manifest) -- **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval -- **Workers**: prefill and backend workers handle inference - -The adjustment interval can be defined in the planner manifest as an argument. The default interval value can be found in this [file](/components/src/dynamo/planner/defaults.py). - -```mermaid -flowchart LR - Frontend --"/metrics"--> Prometheus - Planner --"query API"--> Prometheus - Planner --"scaling decisions"--> Workers - Frontend -.->|"requests"| Workers -``` - -## Features - -* **SLA-driven scaling**: Automatically scales prefill/decode workers to meet TTFT and ITL targets -* **Predictive load forecasting**: Uses ARIMA, Prophet, Kalman, or constant predictors to forecast future load -* **Performance interpolation**: Leverages profiling results data from pre-deployment profiling for accurate scaling decisions -* **Correction factors**: Adapts to real-world performance deviations from profiled data - -## Design - -The SLA planner consists of several key components: - -1. **Load Predictors**: Forecast future request patterns (number of requests, input/output sequence lengths) -2. **Performance Interpolators**: Estimate TTFT and ITL based on profiled performance data -3. **Correction Factors**: Adjust predictions based on observed vs. expected performance -4. **Scaling Logic**: Calculate optimal number of prefill/decode replicas to meet SLA targets - -## SLA-Driven Pre-Deployment Profiling - -**Prerequisite**: SLA-based planner requires pre-deployment profiling to be completed before deployment. The profiling process analyzes your model's performance characteristics to determine optimal tensor parallelism configurations and scaling parameters that the planner will use during operation. - -See [Pre-Deployment Profiling](../benchmarks/sla_driven_profiling.md) for detailed instructions on running the profiling process. - -## Load Prediction - -The SLA planner uses a load predictor to forecast the number of requests, ISL, and OSL in the next adjustment interval. Currently, four load prediction models are supported: - -### Constant Predictor -- **Use case**: Stable and long prediction interval -- **Behavior**: Assumes next load equals current load -- **Configuration**: `load-predictor: "constant"` - -### ARIMA Predictor -- **Use case**: Time-series data with trends and seasonality -- **Behavior**: Uses auto-ARIMA to fit optimal model parameters -- **Configuration**: `load-predictor: "arima"` -- **Tunable parameters**: - - `--load-predictor-log1p`: model `log1p(y)` instead of `y`. If not set, ARIMA starts in raw space, and if it collapses to `(0,d,0)`, it falls back to `log1p` automatically. - -### Kalman Predictor -- **Use case**: Low-latency online forecasting (observe 1 → predict 1) with smooth adaptation -- **Behavior**: Local linear trend Kalman filter (fast online updates; good default when ARIMA collapses to mean-only) -- **Configuration**: `load-predictor: "kalman"` -- **Tunable parameters**: - - `--kalman-q-level`: process noise for level (higher = more responsive) - - `--kalman-q-trend`: process noise for trend (higher = trend changes faster) - - `--kalman-r`: measurement noise (lower = trusts new measurements more) - - `--kalman-min-points`: minimum points before forecasting - - `--load-predictor-log1p`: model `log1p(y)` instead of `y` (often helps request-rate/count series) - -### Prophet Predictor -- **Use case**: Complex seasonal patterns and trend changes -- **Behavior**: Facebook's [Prophet](https://facebook.github.io/prophet/) model for time-series forecasting -- **Configuration**: `load-predictor: "prophet"` -- **Tunable parameters**: - - `--prophet-window-size`: bounds internal history to control refit cost - - `--load-predictor-log1p`: model `log1p(y)` instead of `y` - -### Warm-starting Load Predictors (Optional) -You can warm-start the load predictors with a mooncake-style JSONL trace file to provide historical context before live traffic is observed: - -- **CLI argument**: `--load-predictor-warmup-trace ` -- **Effect**: preloads the predictors with historical request-count / ISL / OSL samples extracted from the trace. - -## Scaling Algorithm - -SLA planner uses a sophisticated scaling algorithm. At each adjustment interval, SLA planner performs the following operations: - -### 1. Metric Collection -Every adjustment interval, collect: -- Average Time to First Token (TTFT) -- Average Inter-Token Latency (ITL) -- Request count and duration -- Input/Output sequence lengths - -### 2. Correction Factor Calculation -Using the collected metrics, SLA planner applies the interpolator to find out the expected TTFT/ITL and calibrate the interpolation model. This step is important because the actual TTFT/ITL can often be different than the ideal world: -- **TTFT**: actual TTFT heavily depends on request queueing and prefix cache hit rate (if use kv reuse). For example, if all requests arrives at the beginning of the adjustment interval, they queue heavily and TTFT will be significantly higher. If prefix cache hit rate is very high, the actual number of tokens in the prefill will be very low and TTFT will be significantly lower. -- **ITL**: actual ITL maybe affected by chunked small prefill request in decode engine. -- **Metric variances**: large variances in request rate, ISL, and OSL may lead to inaccurate estimation of the TTFT/ITL since SLA only consider the average when interpolating. - -SLA planner calculate the correction factor with -- **Prefill correction**: `actual_ttft / expected_ttft` -- **Decode correction**: `actual_itl / expected_itl` - -### 3. Load Prediction -SLA planner forecasts these metric in the next interval using the load predictor -- Number of requests -- Input sequence length -- Output sequence length - -### 4. Calculating Number of Replicas - -**Prefill replicas**: SLA planner assumes the prefill correction factor has linear affect on the prefill throughput per GPU as prefill is single-batched. -``` -predicted_load = next_requests * next_isl / interval * min(1, prefill_correction) -prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine) -``` - -**Decode replicas**: -``` -# 1. apply d_correction_factor to the ITL SLA -corrected_itl = self.args.itl / self.d_correction_factor -# 2. reversely find out what is best throughput/gpu that can achieve corrected_itl under the predicted context length -pred_decode_thpt_per_gpu = self.decode_interpolator.find_best_throughput_per_gpu( - itl=corrected_itl, - context_length=next_isl + next_osl / 2 -) -# 3. compute number of decode replicas needed -next_num_d = math.ceil(next_num_req * next_osl / self.args.adjustment_interval / pred_decode_thpt_per_gpu / self.args.decode_engine_num_gpu) -``` - -### 5. Scaling - -Finally, SLA planner applies the change by scaling up/down the number of prefill and decode workers to the calculated number of replica in the next interval. - -> [!NOTE] -> SLA-planner scales up/down the P/D engines non-blockingly. If `adjustment-interval` is too short, the previous scaling operations may not finish before the new scaling operations are issued. Make sure to set a large enough `adjustment-interval`. - -## Deploying - -For complete deployment instructions, see the [SLA Planner Quick Start Guide](/docs/planner/sla_planner_quickstart.md). - -> [!NOTE] -> The SLA planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with the number of requests, ISL, OSL, TTFT, and ITL in the correct format. The dynamo frontend provides these metrics automatically. - -### Virtual Deployment - -The SLA planner supports virtual deployment mode for customized environments (e.g., customized cluster) through the `VirtualConnector`. This connector enables the planner to communicate scaling decisions without directly managing the deployment infrastructure. - -The `VirtualConnector` acts as a bridge between the SLA planner and external deployment environments. Instead of directly scaling Kubernetes resources, it writes scaling decisions and waits for the deployment environment to acknowledge completion. - -#### Scaling Decision Flow - -1. **Decision Generation**: The planner calculates optimal worker counts -2. **Change Detection**: The planner skips scaling if the target counts match current counts, logging: `"No scaling needed (prefill=X, decode=Y)"` -3. **Readiness Check**: Before making new decisions, the planner verifies that previous scaling operations have completed by checking if `scaled_decision_id >= decision_id` -4. **Timeout Handling**: If a scaling decision isn't acknowledged within 30 minutes (1800 seconds), the planner proceeds with new decisions anyway -5. **Completion Tracking**: The planner can optionally wait for scaling completion confirmation (blocking mode) - -#### Configuration - -To use virtual deployment mode: - -```yaml -environment: "virtual" -backend: "vllm" # or "sglang" -``` - -#### Deployment Environment Requirements - -The external deployment environment must use `VirtualConnectorClient`: - -``` -from dynamo._core import DistributedRuntime, VirtualConnectorClient - -client = VirtualConnectorClient(distributed_runtime, namespace) -``` - -1. **Monitor Planner**: Continuously watch for scaling decisions: `await client.wait()`. This blocks until there is a change. -2. **Parse Decisions**: Read `num_prefill_workers` and `num_decode_workers` values: `decision = await client.get()` -3. **Execute Scaling**: Apply the scaling decisions to the actual deployment infrastructure -4. **Acknowledge Completion**: Mark the decision completed when scaling is finished: `await client.complete(decision)` - -A scaling decision (returned by `client.get()`) contains the following fields, which are -1 if not set yet: -- `num_prefill_workers`: Integer specifying the target number of prefill workers -- `num_decode_workers`: Integer specifying the target number of decode workers -- `decision_id`: Integer with incremental ID for each scaling decision - -See `components/planner/test/test_virtual_connector.py` for a full example. - diff --git a/docs/planner/sla_planner_quickstart.md b/docs/planner/sla_planner_quickstart.md deleted file mode 100644 index f3932a46de7..00000000000 --- a/docs/planner/sla_planner_quickstart.md +++ /dev/null @@ -1,521 +0,0 @@ -# SLA-Driven Profiling and Planner Deployment Quick Start Guide - -Complete workflow to deploy SLA-optimized Dynamo models using DynamoGraphDeploymentRequests (DGDR). This guide shows how to automatically profile models and deploy them with optimal configurations that meet your Service Level Agreements (SLAs). - -> [!IMPORTANT] -> **Prerequisites**: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the [Dynamo Platform installation](/docs/kubernetes/installation_guide.md). - -## Overview - -The DGDR workflow automates the entire process from SLA specification to deployment: - -1. **Define SLAs**: Specify performance requirements (TTFT, ITL) and model information in a DGDR Custom Resource -2. **Automatic Profiling**: The Dynamo Operator automatically profiles your model to find optimal configurations -3. **Auto-Deploy**: The system automatically deploys the optimal configuration that meets your SLAs - -```mermaid -flowchart TD - A[Create DGDR] --> B[DGDR Controller] - B --> C{Profiling Method} - C -->|Online| D[Run Profiling Job
2-4 hours] - C -->|Offline/AIC| E[AI Configurator
20-30 seconds] - D --> F[Generate DGD Config] - E --> F - F --> G[Auto-Deploy DGD] - G --> H[Monitor & Scale] - - style A fill:#e1f5fe - style D fill:#fff3e0 - style E fill:#e8f5e8 - style G fill:#f3e5f5 - style H fill:#fff8e1 -``` - -## What is a DynamoGraphDeploymentRequest (DGDR)? - -A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that serves as the primary interface for users to request model deployments with specific performance and resource constraints. Think of it as a "deployment order" where you specify: - -- **What** model you want to deploy (`model`) -- **How** it should perform (SLA targets: `ttft`, `itl`) -- **Where** it should run (optional GPU preferences) -- **Which** backend to use (`backend`: vllm, sglang, or trtllm) -- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`) - -The Dynamo Operator watches for DGDRs and automatically: -1. Discovers available GPU resources in your cluster -2. Runs profiling (online or offline) to find optimal configurations -3. Generates an optimized DynamoGraphDeployment (DGD) configuration -4. Deploys the DGD to your cluster - -**Key Benefits:** -- **Declarative**: Specify what you want, not how to achieve it -- **Automated**: No manual profiling job setup or result processing -- **SLA-Driven**: Ensures deployments meet your performance requirements -- **Integrated**: Works seamlessly with the Dynamo Operator - -## Prerequisites - -Before creating a DGDR, ensure: -- **Dynamo platform installed** with the operator running (see [Installation Guide](/docs/kubernetes/installation_guide.md)) -- **[kube-prometheus-stack](/docs/kubernetes/observability/metrics.md) installed and running** (required for SLA planner) -- **Image pull secrets configured** if using private registries (typically `nvcr-imagepullsecret` for NVIDIA images) -- **Sufficient GPU resources** available in your cluster for profiling -- **Runtime images available** that contain both profiler and runtime components - -### Container Images - -Each DGDR requires you to specify container images for the profiling and deployment process: - -**profilingConfig.profilerImage** (Required): -Specifies the container image used for the profiling job itself. This image must contain the profiler code and dependencies needed for SLA-based profiling. - -**deploymentOverrides.workersImage** (Optional): -Specifies the container image used for DynamoGraphDeployment worker components (frontend, workers, planner). This image is used for: -- Temporary DGDs created during online profiling (for performance measurements) -- The final DGD deployed after profiling completes - -If `workersImage` is omitted, the image from the base config file (e.g., `disagg.yaml`) is used. You may use our public images (0.6.1 and later) or build and push your own. - -```yaml -spec: - profilingConfig: - profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" - deploymentOverrides: - workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" # Optional -``` - -## Quick Start: Deploy with DGDR - -### Step 1: Create Your DGDR - -Dynamo provides sample DGDR configurations in `benchmarks/profiler/deploy/`. You can use these as starting points: - -**Available Sample DGDRs:** -- **`profile_sla_dgdr.yaml`**: Standard online profiling for dense models -- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator -- **`profile_sla_moe_dgdr.yaml`**: Online profiling for MoE models (SGLang) - -Or, you can create your own DGDR for your own needs. - -> **Important - Profiling Config Cases**: Prior to 0.8.1, any fields under `profilingConfig.config` are represented in snake_case. Starting 0.8.1, fields under `profilingConfig.config` are represented in camelCase for uniformity. There is backwards compatibility to snake_case, but as all example DGDRs are using camelCase, anyone using a release prior to 0.8.1 must manually update the configs under the examples to have snake_case config fields. - -> [!TIP] -> For detailed explanations of all configuration options (SLA, hardware, sweep, AIC, planner), see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference). - -### Step 2: Apply the DGDR - -The rest of this quickstart will use the DGDR sample that uses AIC profiling. If you use a different DGDR file and/or name, be sure to adjust the commands accordingly. - -```bash -export NAMESPACE=your-namespace -kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE -``` - -The Dynamo Operator will immediately begin processing your request. - -### Step 3: Monitor Progress - -Watch the DGDR status: - -```bash -# View status -kubectl get dgdr -n $NAMESPACE - -# Detailed status -kubectl describe dgdr sla-aic -n $NAMESPACE - -# Watch profiling job logs -kubectl logs -f job/profile-sla-aic -n $NAMESPACE -``` - -**DGDR Status States:** -- `Pending`: Initial state, preparing to profile -- `Profiling`: Running profiling job (20-30 seconds for AIC, 2-4 hours for online) -- `Deploying`: Generating and applying DGD configuration -- `Ready`: DGD successfully deployed and running -- `Failed`: Error occurred (check events for details) - -> [!NOTE] -> With AI Configurator, profiling completes in **20-30 seconds**! This is much faster than online profiling which takes 2-4 hours. - -### Step 4: Access Your Deployment - -Once the DGDR reaches `Ready` state, your model is deployed and ready to serve: - -```bash -# Find the frontend service -kubectl get svc -n $NAMESPACE | grep trtllm-disagg - -# Port-forward to access locally -kubectl port-forward svc/trtllm-disagg-frontend 8000:8000 -n $NAMESPACE - -# Test the endpoint -curl http://localhost:8000/v1/models -``` - -### Step 5 (Optional): Access the Planner Grafana Dashboard - -If you want to monitor the SLA Planner's decision-making in real-time, you can deploy the Planner Grafana dashboard. - -```bash -kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml -``` - -Follow the instructions in [Dynamo Metrics Collection on Kubernetes](../kubernetes/observability/metrics.md) to access the Grafana UI and select the **Dynamo Planner Dashboard**. - -The dashboard displays: -- **Worker Counts & GPU Usage**: Current prefill/decode worker counts and cumulative GPU hours -- **Observed Metrics**: Real-time TTFT, ITL, request rate, and sequence lengths from Prometheus -- **Predicted Metrics**: Planner's load predictions and recommended replica counts -- **Correction Factors**: How the planner adjusts predictions based on observed vs expected performance - -> [!TIP] -> Use the **Namespace** dropdown at the top of the dashboard to filter metrics for your specific deployment namespace. - -## DGDR Configuration Details - -### Required Fields - -| Field | Type | Description | -|-------|------|-------------| -| `spec.model` | string | Model identifier (e.g., "meta-llama/Llama-3-70b") | -| `spec.backend` | enum | Inference backend: `vllm`, `sglang`, or `trtllm` | -| `spec.profilingConfig.profilerImage` | string | Container image for profiling job | -| `spec.profilingConfig.config.sla` | object | SLA targets (isl, osl, ttft, itl) | - -### Optional Fields - -| Field | Type | Description | -|-------|------|-------------| -| `spec.deploymentOverrides.workersImage` | string | Container image for DGD worker components. If omitted, uses image from base config file. | -| `spec.autoApply` | boolean | Automatically deploy DGD after profiling (default: false) | -| `spec.deploymentOverrides` | object | Customize metadata (name, namespace, labels, annotations) and image for auto-created DGD | - -### SLA Configuration - -The `sla` section defines performance requirements and workload characteristics: - -```yaml -sla: - isl: 3000 # Average input sequence length (tokens) - osl: 150 # Average output sequence length (tokens) - ttft: 200 # Target Time To First Token (milliseconds, float) - itl: 20 # Target Inter-Token Latency (milliseconds, float) -``` - -**Choosing SLA Values:** -- **ISL/OSL**: Based on your expected traffic patterns -- **TTFT**: First token latency target (lower = more GPUs needed) -- **ITL**: Token generation latency target (lower = more GPUs needed) -- **Trade-offs**: Tighter SLAs require more GPU resources - -### Profiling Methods - -Choose between **online profiling** (real measurements, 2-4 hours) or **offline profiling** with AI Configurator (estimated, 20-30 seconds): - -```yaml -# Online Profiling (Default) -sweep: - useAiConfigurator: false - -# Offline Profiling (AI Configurator) -sweep: - useAiConfigurator: true - aicSystem: h200_sxm - aicHfId: Qwen/Qwen3-32B - aicBackendVersion: "0.20.0" -``` - -> [!NOTE] -> For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-methods). - -### Hardware Configuration - -For details on hardware configuration and GPU discovery options, see [Hardware Configuration in SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md#hardware-configuration). - -### Advanced Configuration - -#### Using Existing DGD Configs (Recommended for Custom Setups) - -If you have an existing DynamoGraphDeployment config (e.g., from `examples/backends/*/deploy/disagg.yaml` or custom recipes), you can reference it via ConfigMap: - -**Step 1: Create ConfigMap from your DGD config file:** - -```bash -kubectl create configmap deepseek-r1-config \ - --from-file=disagg.yaml=/path/to/your/disagg.yaml \ - --namespace $NAMESPACE \ - --dry-run=client -o yaml | kubectl apply -f - -``` - -**Step 2: Reference the ConfigMap in your DGDR:** - -```yaml -apiVersion: nvidia.com/v1alpha1 -kind: DynamoGraphDeploymentRequest -metadata: - name: deepseek-r1 -spec: - model: deepseek-ai/DeepSeek-R1 - backend: sglang - - profilingConfig: - profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1" - configMapRef: - name: deepseek-r1-config - key: disagg.yaml # Must match the key used in --from-file - config: - sla: - isl: 4000 - osl: 500 - ttft: 300 - itl: 10 - sweep: - useAiConfigurator: true - aicSystem: h200_sxm - aicHfId: deepseek-ai/DeepSeek-V3 - aicBackendVersion: "0.20.0" - - deploymentOverrides: - workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1" - - autoApply: true -``` - -> **What's happening**: The profiler uses the DGD config from the ConfigMap as a **base template**, then optimizes it based on your SLA targets. The controller automatically injects `spec.model` into `deployment.model` and `spec.backend` into `engine.backend` in the final configuration. - -#### Inline Configuration (Simple Use Cases) - -For simple use cases without a custom DGD config, provide profiler configuration directly. The profiler will auto-generate a basic DGD configuration from your `model` and `backend`: - -```yaml -profilingConfig: - config: - # SLA targets (required for profiling) - sla: - isl: 8000 # Input sequence length - osl: 200 # Output sequence length - ttft: 200.0 # Time To First Token (ms) - itl: 10.0 # Inter-Token Latency (ms) - - # Hardware constraints (optional) - hardware: - minNumGpusPerEngine: 2 - maxNumGpusPerEngine: 8 - gpuType: h200_sxm - - # Profiling sweep settings (optional) - sweep: - prefillInterpolationGranularity: 16 # Number of samples for prefill ISL sweep - decodeInterpolationGranularity: 6 # Number of samples for decode sweep -``` - -> **Note**: `engine.config` is a **file path** to a DGD YAML file, not inline configuration. Use ConfigMapRef (recommended) or leave it unset to auto-generate. - -#### Planner Configuration Passthrough -Add planner-specific settings: - -```yaml -profilingConfig: - config: - planner: - plannerMinEndpoint: 2 -``` - -## Understanding Profiling Results - -For details about the profiling process, performance plots, and interpolation data, see [SLA-Driven Profiling Documentation](/docs/benchmarks/sla_driven_profiling.md#profiling-process-details). - -## Advanced Topics - -### Mocker Deployment - -Instead of a real DGD that uses GPU resources, you can deploy a mocker deployment that uses simulated engines rather than GPUs. Mocker is available in all backend images and uses profiling data to simulate realistic GPU timing behavior. It is useful for: -- Large-scale experiments without GPU resources -- Testing Planner behavior and infrastructure -- Validating deployment configurations - -To deploy mocker instead of the real backend, set `useMocker: true`: - -```yaml -spec: - model: - backend: trtllm # Real backend for profiling (vllm, sglang, or trtllm) - useMocker: true # Deploy mocker instead of real backend - - profilingConfig: - profilerImage: "nvcr.io/nvidia/dynamo/trtllm-runtime:" - ... - autoApply: true -``` - -Profiling still runs against the real backend (via GPUs or AIC) to collect performance data. The mocker deployment then uses this data to simulate realistic timing behavior. - -### Using a Model Cache PVC (0.8.1 or later) - -Starting in Dynamo 0.8.1, for large models, you can use a pre-populated PVC containing model weights instead of downloading from HuggingFace. See [Model Cache PVC](/docs/benchmarks/sla_driven_profiling.md#model-cache-pvc-advanced) for configuration details. - -### DGDR Immutability - -DGDRs are **immutable** - if you need to update SLAs or configuration: - -1. Delete the existing DGDR: `kubectl delete dgdr sla-aic` -2. Create a new DGDR with updated specifications - -### Manual Deployment Control - -There are two ways to manually control deployment after profiling: - -#### Option 1: Use DGDR-Generated Configuration (Recommended) - -Disable auto-deployment to review the generated DGD before applying: - -```yaml -spec: - autoApply: false -``` - -Then manually extract and apply the generated DGD: - -```bash -# Extract generated DGD from DGDR status -kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' | kubectl apply -f - - -# Or save to file first for review/modification -kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml - -vi my-dgd.yaml -kubectl apply -f my-dgd.yaml -n $NAMESPACE -``` - -The generated DGD includes optimized configurations and the SLA planner component. The required `planner-profile-data` ConfigMap is automatically created when profiling completes, so the DGD will deploy successfully. - -#### Option 2: Use Standalone Planner Templates (Advanced) - -For advanced use cases, you can manually deploy using the standalone planner templates in `examples/backends/*/deploy/disagg_planner.yaml`: - -```bash -# After profiling completes, profiling data is automatically stored in ConfigMaps - -# OPTIONAL: Inspect profiling results stored in ConfigMaps -# View the generated DGD configuration -kubectl get configmap dgdr-output- -n $NAMESPACE -o yaml - -# View the planner profiling data (JSON format) -kubectl get configmap planner-profile-data -n $NAMESPACE -o yaml - -# Update the PROMETHEUS_ENDPOINT environment variable in the planner template -# to match your cluster's Prometheus service location (see comments in the template) - -# Update backend planner manifest as needed, then deploy -kubectl apply -f examples/backends//deploy/disagg_planner.yaml -n $NAMESPACE -``` - -> **Note**: The standalone templates are provided as examples and may need customization for your model and requirements. The DGDR-generated configuration (Option 1) is recommended as it's automatically tuned to your profiling results and SLA targets. -> -> **Important - Prometheus Configuration**: The planner queries Prometheus to get frontend request metrics for scaling decisions. If you see errors like "Failed to resolve prometheus service", ensure the `PROMETHEUS_ENDPOINT` environment variable in your planner configuration correctly points to your Prometheus service. See the comments in the example templates for details. - -### Relationship to DynamoGraphDeployment (DGD) - -- **DGDR**: High-level "intent" - what you want deployed -- **DGD**: Low-level "implementation" - how it's deployed - -The DGDR controller generates a DGD that: -- Uses optimal TP configurations from profiling -- Includes SLA planner for autoscaling -- Has deployment and engine settings tuned for your SLAs - -The generated DGD is tracked via labels: -```yaml -metadata: - labels: - dgdr.nvidia.com/name: sla-aic - dgdr.nvidia.com/namespace: your-namespace -``` - -### Accessing Detailed Profiling Artifacts - -By default, profiling jobs save essential data to ConfigMaps for planner integration. For advanced users who need access to detailed artifacts (logs, performance plots, AIPerf results, etc), configure the DGDR to use `dynamo-pvc`. This is optional and will not affect the functionality of profiler or Planner. - -**What's available in ConfigMaps (always created):** -- Generated DGD configuration -- Profiling data for Planner (`.json` files) - -**What's available in PVC if attached to DGDR (optional):** -- Performance plots (PNGs) -- DGD configuration and logs of all services for each profiled deployment -- AIPerf profiling artifacts for each AIPerf run -- Raw profiling data (`.npz` files) -- Profiler log - -**Setup:** - -1. Set up the benchmarking PVC: -```bash -export NAMESPACE=your-namespace -deploy/utils/setup_benchmarking_resources.sh -``` - -2. Add `outputPVC` to your DGDR's `profilingConfig`: -```yaml -spec: - profilingConfig: - outputPVC: "dynamo-pvc" - config: - # ... rest of config -``` - -3. After profiling completes, access results: -```bash -kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE -kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s -kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results -kubectl delete pod pvc-access-pod -n $NAMESPACE -``` - -## Troubleshooting - -### Quick Diagnostics - -```bash -# Check DGDR status and events -kubectl describe dgdr sla-aic -n $NAMESPACE - -# Check operator logs -kubectl logs -n $NAMESPACE -l app.kubernetes.io/name=dynamo-operator --tail=100 - -# Check profiling job logs -kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE -``` - -### Common Issues - -| Issue | Quick Fix | -|-------|-----------| -| **DGDR stuck in Pending** | Check GPU availability: `kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'` | -| **Image pull errors** | Verify secret exists: `kubectl get secret nvcr-imagepullsecret -n $NAMESPACE` | -| **Profiling fails** | Check job logs: `kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE` | -| **SLA cannot be met** | Relax TTFT/ITL targets or add more GPUs | -| **DGD not deployed** | Verify `autoApply: true` in DGDR spec | - -> [!NOTE] -> For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](/docs/benchmarks/sla_driven_profiling.md#troubleshooting). - -## Configuration Reference - -For comprehensive documentation of all DGDR configuration options, see the [DGDR Configuration Reference](/docs/benchmarks/sla_driven_profiling.md#dgdr-configuration-reference). - -This includes detailed explanations of: -- **SLA Configuration**: ISL, OSL, TTFT, ITL with use cases and trade-offs -- **Hardware Configuration**: GPU constraints and search space control -- **Sweep Configuration**: Profiling behavior and interpolation settings -- **AI Configurator Configuration**: System types, model mappings, backend versions -- **Planner Configuration**: Autoscaling and adjustment parameters -- **Complete Examples**: Full DGDRs for online, offline (AIC), and MoE profiling - -## Related Documentation - -- [DGDR API Reference](/docs/kubernetes/api_reference.md) -- [Pre-Deployment Profiling Details](/docs/benchmarks/sla_driven_profiling.md) -- [SLA Planner Architecture](/docs/planner/sla_planner.md) -- [Dynamo Operator Guide](/docs/kubernetes/dynamo_operator.md) diff --git a/docs/reference/feature-matrix.md b/docs/reference/feature-matrix.md index bdc22c150b9..84b6ba978a0 100644 --- a/docs/reference/feature-matrix.md +++ b/docs/reference/feature-matrix.md @@ -119,19 +119,19 @@ TensorRT-LLM delivers maximum inference performance and optimization, with full [disagg]: docs/design_docs/disagg_serving.md -[kv-routing]: docs/router/README.md -[planner]: docs/planner/planner_intro.rst -[kvbm]: docs/kvbm/kvbm_intro.rst +[kv-routing]: docs/components/router/router_guide.md +[planner]: docs/components/planner/README.md +[kvbm]: docs/components/kvbm/README.md [migration]: docs/fault_tolerance/request_migration.md [tools]: docs/agents/tool-calling.md -[mm]: docs/multimodal/index.md -[mm-vllm]: docs/multimodal/vllm.md -[mm-trtllm]: docs/multimodal/trtllm.md -[mm-sglang]: docs/multimodal/sglang.md +[mm]: docs/features/multimodal/README.md +[mm-vllm]: docs/features/multimodal/multimodal_vllm.md +[mm-trtllm]: docs/features/multimodal/multimodal_trtllm.md +[mm-sglang]: docs/features/multimodal/multimodal_sglang.md [lora]: docs/kubernetes/deployment/dynamomodel-guide.md -[vllm-spec]: docs/backends/vllm/speculative_decoding.md +[vllm-spec]: docs/features/speculative_decoding/speculative_decoding_vllm.md [trtllm-eagle]: docs/backends/trtllm/llama4_plus_eagle.md diff --git a/examples/backends/trtllm/deploy/README.md b/examples/backends/trtllm/deploy/README.md index 834ea4544b1..de76a512109 100644 --- a/examples/backends/trtllm/deploy/README.md +++ b/examples/backends/trtllm/deploy/README.md @@ -53,7 +53,7 @@ Advanced disaggregated deployment with SLA-based automatic scaling. - `TRTLLMPrefillWorker`: Specialized prefill-only worker > [!NOTE] -> This deployment requires pre-deployment profiling to be completed first. See [Pre-Deployment Profiling](../../../../docs/benchmarks/sla_driven_profiling.md) for detailed instructions. +> This deployment requires pre-deployment profiling to be completed first. See [Pre-Deployment Profiling](../../../../docs/components/profiler/profiler_guide.md) for detailed instructions. ## CRD Structure @@ -266,7 +266,7 @@ Configure the `model` name and `host` based on your deployment. - **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md) - **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation_guide.md) - **Examples**: [Deployment Examples](../../../../docs/examples/README.md) -- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/router/README.md) +- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/components/router/README.md) - **Multinode Deployment**: [Multinode Examples](../../../../docs/backends/trtllm/multinode/multinode-examples.md) - **Speculative Decoding**: [Llama 4 + Eagle Guide](../../../../docs/backends/trtllm/llama4_plus_eagle.md) - **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) diff --git a/examples/backends/vllm/deploy/README.md b/examples/backends/vllm/deploy/README.md index 4fc72dbb0fb..a3939107411 100644 --- a/examples/backends/vllm/deploy/README.md +++ b/examples/backends/vllm/deploy/README.md @@ -109,7 +109,7 @@ We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/ ### Pre-Deployment Profiling (SLA Planner Only) -If using the SLA Planner deployment (`disagg_planner.yaml`), follow the [pre-deployment profiling guide](../../../../docs/benchmarks/sla_driven_profiling.md) to run pre-deployment profiling. +If using the SLA Planner deployment (`disagg_planner.yaml`), follow the [pre-deployment profiling guide](../../../../docs/components/profiler/profiler_guide.md) to run pre-deployment profiling. ## Usage @@ -247,9 +247,9 @@ args: - **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/deployment/create_deployment.md) - **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md) - **Platform Setup**: [Dynamo Kubernetes Platform Installation](../../../../docs/kubernetes/installation_guide.md) -- **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/planner/sla_planner_quickstart.md) +- **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/components/planner/planner_guide.md) - **Examples**: [Deployment Examples](../../../../docs/examples/README.md) -- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/router/README.md) +- **Architecture Docs**: [Disaggregated Serving](../../../../docs/design_docs/disagg_serving.md), [KV-Aware Routing](../../../../docs/components/router/README.md) ## Troubleshooting diff --git a/examples/basics/multinode/README.md b/examples/basics/multinode/README.md index 0076cfe3f67..a93b8db3a00 100644 --- a/examples/basics/multinode/README.md +++ b/examples/basics/multinode/README.md @@ -5,7 +5,7 @@ This example demonstrates running Dynamo across multiple nodes with **KV-aware r For more information about the core concepts, see: - [Dynamo Disaggregated Serving](../../../docs/design_docs/disagg_serving.md) -- [KV Cache Routing](../../../docs/router/README.md) +- [KV Cache Routing](../../../docs/components/router/README.md) ## Architecture Overview @@ -65,7 +65,7 @@ This is particularly beneficial for: - **Similar queries**: Common prefixes are computed once and reused - **Batch processing**: Related requests can be routed to workers with shared context -For detailed technical information about how KV routing works, see the [Router Guide](../../../docs/router/router_guide.md). +For detailed technical information about how KV routing works, see the [Router Guide](../../../docs/components/router/router_guide.md). ## Prerequisites @@ -475,7 +475,7 @@ python -m dynamo.frontend \ --router-temperature 0.0 # Temperature for probabilistic routing (0 = deterministic) ``` -For more advanced configuration options including custom worker selection, block size tuning, and alternative indexing strategies, see the [Router Guide](../../../docs/router/router_guide.md). +For more advanced configuration options including custom worker selection, block size tuning, and alternative indexing strategies, see the [Router Guide](../../../docs/components/router/router_guide.md). ## Cleanup diff --git a/lib/bindings/kvbm/README.md b/lib/bindings/kvbm/README.md index 40f796ee3d8..e04e17c5933 100644 --- a/lib/bindings/kvbm/README.md +++ b/lib/bindings/kvbm/README.md @@ -114,7 +114,7 @@ DYN_KVBM_CPU_CACHE_GB=100 vllm serve \ Qwen/Qwen3-8B ``` -For more detailed integration with dynamo, disaggregated serving support and benchmarking, please check [vllm-setup](../../../docs/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-vllm) +For more detailed integration with dynamo, disaggregated serving support and benchmarking, please check [vllm-setup](../../../docs/components/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-vllm) ### TensorRT-LLM @@ -136,11 +136,11 @@ DYN_KVBM_CPU_CACHE_GB=100 trtllm-serve Qwen/Qwen3-8B \ --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml ``` -For more detailed integration with dynamo and benchmarking, please check [trtllm-setup](../../../docs/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) +For more detailed integration with dynamo and benchmarking, please check [trtllm-setup](../../../docs/components/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) ## 📚 Docs -- [Architecture](../../../docs/kvbm/README.md#architecture) -- [Design Deepdive](../../../docs/kvbm/kvbm_design.md) +- [Architecture](../../../docs/components/kvbm/README.md#architecture) +- [Design Deepdive](../../../docs/design_docs/kvbm_design.md) - [NIXL Overview](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md) diff --git a/tests/planner/README.md b/tests/planner/README.md index d81a80bcda7..af02d96b5f8 100644 --- a/tests/planner/README.md +++ b/tests/planner/README.md @@ -23,7 +23,7 @@ Use the pre-configured test deployment with sample profiling data, we provide th ### Option B: Use Your Own Profiling Results -1. Run pre-deployment profiling for your specific setup. See the [pre-deployment profiling documentation](../../docs/benchmarks/sla_driven_profiling.md) for detailed instructions. +1. Run pre-deployment profiling for your specific setup. See the [pre-deployment profiling documentation](../../docs/components/profiler/profiler_guide.md) for detailed instructions. ## Interpolator Testing @@ -166,7 +166,7 @@ Test complete scaling behavior including Kubernetes deployment and load generati **Prerequisites:** - **[kube-prometheus-stack](../../docs/kubernetes/observability/metrics.md) installed and running.** The SLA planner requires Prometheus to observe metrics and make scaling decisions. -- Ensure the Dynamo operator was installed with the Prometheus endpoint configured (see [SLA Planner Quickstart Guide](../../docs/planner/sla_planner_quickstart.md#prerequisites) for details). +- Ensure the Dynamo operator was installed with the Prometheus endpoint configured (see [SLA Planner Quickstart Guide](../../docs/components/planner/planner_guide.md#prerequisites) for details). **Prepare the test deployment manifest:** @@ -209,7 +209,7 @@ Remove `volumes` and `volumeMounts`: - name: planner-profile-data configMap: # Must be pre-created before deployment by the profiler - # See docs/planner/sla_planner_quickstart.md for more details + # See docs/components/planner/planner_guide.md for more details name: planner-profile-data ```