Skip to content

Commit 7752ce2

Browse files
athreeshclaudedagil-nvidiatedzhouhkcursoragent
authored
docs: planner 3-tier documentation restructure (#5876)
Signed-off-by: athreesh <anish.maddipoti@utexas.edu> Signed-off-by: dagil-nvidia <dagil@nvidia.com> Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: dagil-nvidia <dagil@nvidia.com> Co-authored-by: Hongkuan Zhou <tedzhouhk@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 8aa7335 commit 7752ce2

7 files changed

Lines changed: 1194 additions & 1 deletion

File tree

components/src/dynamo/planner/README.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,9 @@ See the License for the specific language governing permissions and
1515
limitations under the License.
1616
-->
1717

18-
Please refer to [planner docs](../../../../docs/planner/planner_intro.rst) for planner documentation.
18+
# Planner
19+
20+
SLA-driven autoscaling controller for Dynamo inference graphs.
21+
22+
- **User docs**: [docs/planner/](/docs/planner/) (deployment, configuration, examples)
23+
- **Design docs**: [docs/design_docs/planner_design.md](/docs/design_docs/planner_design.md) (architecture, algorithms)

docs/design_docs/planner_design.md

Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
# Planner Design
2+
3+
> **Tier 3 design documentation** for contributors and architects. For user-facing docs, see [docs/planner/](/docs/planner/).
4+
5+
## Overview
6+
7+
The Planner is Dynamo's autoscaling controller. It observes system metrics, predicts future load, and adjusts prefill/decode worker replica counts to proactively meet SLA targets. This document covers the internal architecture, algorithms, and design trade-offs.
8+
9+
## Architecture
10+
11+
```text
12+
┌──────────────────────────────────────────────────────────┐
13+
│ Planner Component │
14+
│ │
15+
│ ┌───────────────┐ ┌───────────────┐ ┌────────────────┐ │
16+
│ │ Metric │ │ Load │ │ Performance │ │
17+
│ │ Collector │ │ Predictor │ │ Interpolator │ │
18+
│ │ (Prometheus) │ │ (ARIMA/etc.) │ │ (JSON data) │ │
19+
│ └───────┬───────┘ └───────┬───────┘ └───────┬────────┘ │
20+
│ │ │ │ │
21+
│ ▼ ▼ ▼ │
22+
│ ┌───────────────────────────────────────────────────┐ │
23+
│ │ Scaling Algorithm │ │
24+
│ └───────────────────────┬───────────────────────────┘ │
25+
│ │ │
26+
│ ┌───────────────────────▼───────────────────────────┐ │
27+
│ │ Connector Layer │ │
28+
│ │ ┌───────────────────┐ ┌───────────────────────┐ │ │
29+
│ │ │ KubernetesConn. │ │ VirtualConn. │ │ │
30+
│ │ │ (PATCH DGD) │ │ (Runtime bridge) │ │ │
31+
│ │ └───────────────────┘ └───────────────────────┘ │ │
32+
│ └───────────────────────────────────────────────────┘ │
33+
└──────────────────────────────────────────────────────────┘
34+
```
35+
36+
## Scaling Algorithm
37+
38+
### Step 1: Metric Collection
39+
40+
Every `adjustment_interval` seconds, the planner queries Prometheus for:
41+
42+
- Average TTFT and ITL over the interval
43+
- Total request count
44+
- Average input sequence length (ISL) and output sequence length (OSL)
45+
46+
The Prometheus query targets the Frontend's `/metrics` endpoint, which exposes histograms and counters.
47+
48+
### Step 2: Correction Factor Calculation
49+
50+
The planner maintains correction factors that adapt profiling-based predictions to real-world behavior:
51+
52+
```text
53+
prefill_correction = actual_ttft / expected_ttft
54+
decode_correction = actual_itl / expected_itl
55+
```
56+
57+
These factors account for hard to model factors such as:
58+
59+
- **Request queueing**: Bursty traffic causes higher TTFT than profiled steady-state
60+
- **Prefix cache hits**: KV reuse reduces effective prefill tokens, lowering actual TTFT
61+
- **Chunked prefill in decode**: Small prefills processed in decode engine affect ITL
62+
- **Metric variance**: Average ISL/OSL may not represent the actual distribution
63+
64+
The correction factors are applied as multipliers to the next scaling decision. Setting `--no-correction` disables this for debugging or when cold-start artifacts dominate.
65+
66+
### Step 3: Load Prediction
67+
68+
The planner forecasts three values for the next interval:
69+
70+
- `next_num_req`: Number of requests
71+
- `next_isl`: Average input sequence length
72+
- `next_osl`: Average output sequence length
73+
74+
Four predictor implementations are available:
75+
76+
77+
| Predictor | Algorithm | Best For |
78+
| ------------ | ---------------------------------------- | -------------------------------- |
79+
| **Constant** | `next = current` | Stable workloads, long intervals |
80+
| **ARIMA** | Auto-ARIMA with optional log1p transform | Trending/seasonal patterns |
81+
| **Kalman** | Local linear trend Kalman filter | Bursty traffics |
82+
| **Prophet** | Facebook Prophet time-series model | Complex seasonality |
83+
84+
85+
All predictors support warm-starting from trace files (`--load-predictor-warmup-trace`).
86+
87+
### Step 4: Replica Calculation
88+
89+
**Prefill replicas:**
90+
91+
```python
92+
predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
93+
prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)
94+
```
95+
96+
The prefill correction factor has a linear effect on throughput because prefill is single-batched.
97+
98+
**Decode replicas:**
99+
100+
```python
101+
# Apply correction to the ITL SLA target
102+
corrected_itl = target_itl / decode_correction_factor
103+
104+
# Find best throughput/GPU that achieves corrected ITL at predicted context length
105+
throughput_per_gpu = decode_interpolator.find_best_throughput_per_gpu(
106+
itl=corrected_itl,
107+
context_length=next_isl + next_osl / 2
108+
)
109+
110+
# Calculate required replicas
111+
decode_replicas = ceil(next_num_req * next_osl / interval / throughput_per_gpu / gpus_per_engine)
112+
```
113+
114+
### Step 5: Scaling Execution
115+
116+
The planner calls `connector.set_component_replicas()` with the calculated targets. Scaling is non-blocking by default: the planner continues monitoring while replicas are adjusting.
117+
118+
## Connector Design
119+
120+
### Interface
121+
122+
```python
123+
class PlannerConnector(ABC):
124+
async def add_component(self, component_name)
125+
async def remove_component(self, component_name)
126+
# Extended interface (not on ABC, but implemented by both connectors):
127+
async def set_component_replicas(self, targets, blocking)
128+
async def validate_deployment(self, ...)
129+
async def wait_for_deployment_ready(self)
130+
```
131+
132+
### KubernetesConnector
133+
134+
Directly PATCHes the DGD resource to update replica counts. The operator watches for DGD changes and reconciles component deployments.
135+
136+
**Design decisions:**
137+
138+
- Uses `DYN_PARENT_DGD_K8S_NAME` to find its parent DGD (injected by operator)
139+
- Resolves services by `subComponentType` field (prefill/decode), with fallback to legacy component names
140+
- Validates deployment structure on startup: checks that prefill and decode services exist and model names match
141+
142+
### VirtualConnector
143+
144+
For non-native environments (e.g., custom orchestrators). Writes scaling decisions to the distributed runtime via `VirtualConnectorCoordinator` (Rust binding). External systems use `VirtualConnectorClient` to poll decisions and report completion.
145+
146+
**Scaling decision flow:**
147+
148+
1. Planner writes `(num_prefill, num_decode, decision_id)` to runtime
149+
2. External system reads decision via `client.wait()`
150+
3. External system executes scaling
151+
4. External system reports completion via `client.complete(decision)`
152+
5. Planner sees `scaled_decision_id >= decision_id` and proceeds
153+
154+
**Timeout**: If scaling isn't acknowledged within 1800s (configurable), the planner proceeds with new decisions anyway.
155+
156+
## Performance Interpolation
157+
158+
The planner uses pre-deployment profiling data (NPZ files) to map (throughput, ISL/OSL, context_length) -> (TTFT, ITL). This data comes from the SLA-driven profiling process (either online GPU profiling or AI Configurator estimation).
159+
160+
Two interpolators are maintained:
161+
162+
- **Prefill interpolator**: Maps (throughput_per_gpu, ISL) -> TTFT
163+
- **Decode interpolator**: Maps (throughput_per_gpu, context_length) -> ITL
164+
165+
The interpolators use the profiling sweep granularity to determine precision. Finer granularity means more profiling samples but more accurate interpolation.
166+
167+
## Initialization
168+
169+
The planner starts with a 30-second delay (`INIT_PLANNER_START_DELAY`) to allow other components (frontend, workers) to register and stabilize. This is a known workaround (marked TODO in code) that should be replaced with a proper readiness check.
170+
171+
After the delay:
172+
173+
1. Initialize the connector (K8s or Virtual based on `--environment`)
174+
2. Validate deployment structure
175+
3. Load profiling results
176+
4. Build interpolators
177+
5. Initialize load predictor
178+
6. Enter main scaling loop
179+
180+
## Performance Considerations
181+
182+
- **Adjustment interval sizing**: The interval must be long enough for scaling operations to complete. If `adjustment_interval` is shorter than the time to add/remove a worker (which includes pod scheduling, model loading, and registration), scaling decisions will overlap. Default of 180s is conservative; workloads with fast model loading can use shorter intervals.
183+
- **Correction factor stability**: Correction factors are recalculated each interval. During traffic transitions (e.g., ramp-up), they can oscillate. The `--no-correction` flag disables correction for scenarios where cold-start artifacts dominate and distort the factor.
184+
- **Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.
185+
- **Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback.
186+
187+
## Known Limitations
188+
189+
1. **30-second startup delay**: Hardcoded wait for component registration. It should be replaced with runtime readiness probing.
190+
2. **Adjustment interval vs scaling latency**: If `adjustment_interval` < time to scale, scaling decisions can pile up. The planner logs warnings but doesn't queue.
191+
3. **Average-based interpolation**: The planner uses average ISL/OSL, which may not represent bimodal or heavy-tailed distributions well.
192+
4. **Single DGD scope**: Each planner instance manages exactly one DGD. Multi-model/multi-DGD coordination is not supported.
193+
5. **Load-based planner deprecated**: The load-based code path exists but is non-functional with current backends (no prefill queue metrics).
194+
195+
## Future Work
196+
197+
- Support aggregated (non-disaggregated) scaling mode for single-worker deployments
198+
- Multi-DGD coordination for shared-cluster scenarios
199+
- Distribution-aware interpolation (beyond mean ISL/OSL)
200+
- Adaptive adjustment interval based on observed scaling latency
201+
202+
## File Map
203+
204+
205+
| File | Size | Purpose |
206+
| ---------------------------- | ---- | ----------------------------------------------------- |
207+
| `planner_core.py` | 36k | Main scaling loop, algorithm implementation |
208+
| `perf_interpolation.py` | 13k | NPZ data loading and throughput/latency interpolation |
209+
| `load_predictor.py` | 16k | ARIMA, Prophet, Kalman, Constant predictors |
210+
| `pre_swept_results_utils.py` | 12k | Pre-computed H100/H200 profiling data loader |
211+
| `kubernetes_connector.py` | 11k | K8s API integration for DGD scaling |
212+
| `kube.py` | 7.4k | Low-level K8s client wrapper |
213+
| `exceptions.py` | 7.2k | Custom exception hierarchy |
214+
| `prometheus.py` | 7.3k | Prometheus query builder and client |
215+
| `defaults.py` | 8.1k | Default configs, backend name mappings |
216+
| `planner_argparse.py` | 6.2k | CLI argument definitions |
217+
218+

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,3 +88,4 @@ Quickstart
8888
Distributed Runtime <design_docs/distributed_runtime.md>
8989
Request Plane <design_docs/request_plane.md>
9090
Event Plane <design_docs/event_plane.md>
91+
Planner Design <design_docs/planner_design.md>

docs/planner/README.md

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
# Planner
19+
20+
The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
21+
22+
> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](sla_planner_quickstart.md) for a complete workflow including profiling and deployment.
23+
24+
## Feature Matrix
25+
26+
| Category | Feature | Status |
27+
|----------|---------|--------|
28+
| **Backend** | Local (bare metal) | Deprecated |
29+
| | Kubernetes | Supported |
30+
| **LLM Framework** | vLLM | Supported |
31+
| | TensorRT-LLM | Supported |
32+
| | SGLang | Supported |
33+
| **Serving Type** | Aggregated | Unsupported |
34+
| | Disaggregated | Supported |
35+
| **Scaling Mode** | SLA-based (TTFT/ITL targets) | Supported (primary) |
36+
| | Load-based (KV cache/queue thresholds) | Deprecated |
37+
| **Load Predictors** | ARIMA | Supported |
38+
| | Prophet | Supported |
39+
| | Kalman filter | Supported |
40+
| | Constant (current = next) | Supported |
41+
| **Connectors** | KubernetesConnector (native DGD scaling) | Supported |
42+
| | VirtualConnector (external environments) | Supported |
43+
44+
## Quick Start
45+
46+
### Prerequisites
47+
48+
- Dynamo platform installed on Kubernetes ([Installation Guide](/docs/kubernetes/installation_guide.md))
49+
- kube-prometheus-stack installed ([Metrics Setup](/docs/kubernetes/observability/metrics.md))
50+
- Pre-deployment profiling completed ([Profiling Guide](/docs/benchmarks/sla_driven_profiling.md))
51+
52+
### Deploy with DGDR (Recommended)
53+
54+
The fastest path to a planner-enabled deployment is through a DynamoGraphDeploymentRequest:
55+
56+
```bash
57+
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
58+
```
59+
60+
This automatically profiles your model and deploys with the SLA planner. See [SLA Planner Quick Start](sla_planner_quickstart.md) for the full workflow.
61+
62+
### Deploy with DGD (Manual)
63+
64+
For manual control, use the disaggregated planner templates:
65+
66+
```bash
67+
# After profiling is complete
68+
kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
69+
```
70+
71+
## Documentation
72+
73+
| Document | Description |
74+
|----------|-------------|
75+
| [Planner Guide](planner_guide.md) | Deployment, configuration, integration, troubleshooting |
76+
| [Planner Examples](planner_examples.md) | DGDR YAML examples, sample configurations, advanced patterns |
77+
| [SLA Planner Quick Start](sla_planner_quickstart.md) | End-to-end DGDR workflow: define SLAs, profile, deploy, monitor |
78+
| [SLA-based Planner](sla_planner.md) | Scaling algorithm, correction factors, load prediction details |
79+
| [Load-based Planner](load_planner.md) | Legacy load-based scaling (deprecated) |
80+
| [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md) | Pre-deployment profiling process and configuration |
81+
| [Planner Design](/docs/design_docs/planner_design.md) | Architecture deep-dive for contributors |
82+
83+
## Configuration Reference
84+
85+
### Key Arguments
86+
87+
| Argument | Default | Description |
88+
|----------|---------|-------------|
89+
| `--namespace` | `$DYN_NAMESPACE` or `dynamo` | Dynamo logical namespace |
90+
| `--backend` | `vllm` | Backend framework (`vllm`, `sglang`, `trtllm`) |
91+
| `--environment` | `kubernetes` | Deployment environment |
92+
| `--adjustment-interval` | `180` | Seconds between scaling decisions |
93+
| `--ttft` | `500.0` | Target Time To First Token (ms) |
94+
| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
95+
| `--isl` | `3000` | Expected average input sequence length |
96+
| `--osl` | `150` | Expected average output sequence length |
97+
| `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
98+
| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
99+
| `--min-endpoint` | `1` | Minimum replicas per worker type |
100+
| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
101+
| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
102+
| `--no-operation` | `false` | Observation mode (no actual scaling) |
103+
| `--no-correction` | `false` | Disable correction factors |
104+
| `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
105+
106+
### Environment Variables
107+
108+
| Variable | Default | Description |
109+
|----------|---------|-------------|
110+
| `DYN_NAMESPACE` | `dynamo` | Dynamo logical namespace |
111+
| `DYN_PARENT_DGD_K8S_NAME` | (required) | Parent DGD K8s resource name |
112+
| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` | Prometheus URL |
113+
| `PLANNER_PROMETHEUS_PORT` | `0` (disabled) | Port for planner's own Prometheus metrics |
114+
115+
## Monitoring
116+
117+
### Grafana Dashboard
118+
119+
Deploy the planner dashboard:
120+
121+
```bash
122+
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
123+
```
124+
125+
The dashboard shows:
126+
- Worker counts and GPU usage over time
127+
- Observed TTFT, ITL, request rate, sequence lengths
128+
- Predicted load and recommended replica counts
129+
- Correction factors (actual vs. expected performance)
130+
131+
### Prometheus Metrics
132+
133+
The planner queries the frontend's `/metrics` endpoint via Prometheus. Required metrics:
134+
- Request count and duration
135+
- TTFT and ITL distributions
136+
- Input/output sequence lengths

0 commit comments

Comments
 (0)