ai-dynamo
diff --git a/‎components/src/dynamo/planner/README.md‎
Lines changed: 6 additions & 1 deletion b/‎components/src/dynamo/planner/README.md‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎docs/design_docs/planner_design.md‎
Lines changed: 218 additions & 0 deletions b/‎docs/design_docs/planner_design.md‎
Lines changed: 218 additions & 0 deletions
diff --git a/‎docs/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/planner/README.md‎
Lines changed: 136 additions & 0 deletions b/‎docs/planner/README.md‎
Lines changed: 136 additions & 0 deletions
@@ -15,4 +15,9 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
 
-Please refer to [planner docs](../../../../docs/planner/planner_intro.rst) for planner documentation.
+# Planner
+
+SLA-driven autoscaling controller for Dynamo inference graphs.
+
+- **User docs**: [docs/planner/](/docs/planner/) (deployment, configuration, examples)
+- **Design docs**: [docs/design_docs/planner_design.md](/docs/design_docs/planner_design.md) (architecture, algorithms)
@@ -0,0 +1,218 @@
+# Planner Design
+
+> **Tier 3 design documentation** for contributors and architects. For user-facing docs, see [docs/planner/](/docs/planner/).
+
+## Overview
+
+The Planner is Dynamo's autoscaling controller. It observes system metrics, predicts future load, and adjusts prefill/decode worker replica counts to proactively meet SLA targets. This document covers the internal architecture, algorithms, and design trade-offs.
+
+## Architecture
+
+```text
+┌──────────────────────────────────────────────────────────┐
+│                    Planner Component                     │
+│                                                          │
+│  ┌───────────────┐ ┌───────────────┐ ┌────────────────┐  │
+│  │    Metric     │ │     Load      │ │  Performance   │  │
+│  │   Collector   │ │   Predictor   │ │  Interpolator  │  │
+│  │  (Prometheus) │ │ (ARIMA/etc.)  │ │  (JSON data)   │  │
+│  └───────┬───────┘ └───────┬───────┘ └───────┬────────┘  │
+│          │                 │                  │          │
+│          ▼                 ▼                  ▼          │
+│  ┌───────────────────────────────────────────────────┐   │
+│  │              Scaling Algorithm                    │   │
+│  └───────────────────────┬───────────────────────────┘   │
+│                          │                               │
+│  ┌───────────────────────▼───────────────────────────┐   │
+│  │               Connector Layer                     │   │
+│  │  ┌───────────────────┐  ┌───────────────────────┐ │   │
+│  │  │ KubernetesConn.   │  │   VirtualConn.        │ │   │
+│  │  │ (PATCH DGD)       │  │   (Runtime bridge)    │ │   │
+│  │  └───────────────────┘  └───────────────────────┘ │   │
+│  └───────────────────────────────────────────────────┘   │
+└──────────────────────────────────────────────────────────┘
+```
+
+## Scaling Algorithm
+
+### Step 1: Metric Collection
+
+Every `adjustment_interval` seconds, the planner queries Prometheus for:
+
+- Average TTFT and ITL over the interval
+- Total request count
+- Average input sequence length (ISL) and output sequence length (OSL)
+
+The Prometheus query targets the Frontend's `/metrics` endpoint, which exposes histograms and counters.
+
+### Step 2: Correction Factor Calculation
+
+The planner maintains correction factors that adapt profiling-based predictions to real-world behavior:
+
+```text
+prefill_correction = actual_ttft / expected_ttft
+decode_correction  = actual_itl  / expected_itl
+```
+
+These factors account for hard to model factors such as:
+
+- **Request queueing**: Bursty traffic causes higher TTFT than profiled steady-state
+- **Prefix cache hits**: KV reuse reduces effective prefill tokens, lowering actual TTFT
+- **Chunked prefill in decode**: Small prefills processed in decode engine affect ITL
+- **Metric variance**: Average ISL/OSL may not represent the actual distribution
+
+The correction factors are applied as multipliers to the next scaling decision. Setting `--no-correction` disables this for debugging or when cold-start artifacts dominate.
+
+### Step 3: Load Prediction
+
+The planner forecasts three values for the next interval:
+
+- `next_num_req`: Number of requests
+- `next_isl`: Average input sequence length
+- `next_osl`: Average output sequence length
+
+Four predictor implementations are available:
+
+
+| Predictor    | Algorithm                                | Best For                         |
+| ------------ | ---------------------------------------- | -------------------------------- |
+| **Constant** | `next = current`                         | Stable workloads, long intervals |
+| **ARIMA**    | Auto-ARIMA with optional log1p transform | Trending/seasonal patterns       |
+| **Kalman**   | Local linear trend Kalman filter         | Bursty traffics                  |
+| **Prophet**  | Facebook Prophet time-series model       | Complex seasonality              |
+
+
+All predictors support warm-starting from trace files (`--load-predictor-warmup-trace`).
+
+### Step 4: Replica Calculation
+
+**Prefill replicas:**
+
+```python
+predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
+prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)
+```
+
+The prefill correction factor has a linear effect on throughput because prefill is single-batched.
+
+**Decode replicas:**
+
+```python
+# Apply correction to the ITL SLA target
+corrected_itl = target_itl / decode_correction_factor
+
+# Find best throughput/GPU that achieves corrected ITL at predicted context length
+throughput_per_gpu = decode_interpolator.find_best_throughput_per_gpu(
+    itl=corrected_itl,
+    context_length=next_isl + next_osl / 2
+)
+
+# Calculate required replicas
+decode_replicas = ceil(next_num_req * next_osl / interval / throughput_per_gpu / gpus_per_engine)
+```
+
+### Step 5: Scaling Execution
+
+The planner calls `connector.set_component_replicas()` with the calculated targets. Scaling is non-blocking by default: the planner continues monitoring while replicas are adjusting.
+
+## Connector Design
+
+### Interface
+
+```python
+class PlannerConnector(ABC):
+    async def add_component(self, component_name)
+    async def remove_component(self, component_name)
+    # Extended interface (not on ABC, but implemented by both connectors):
+    async def set_component_replicas(self, targets, blocking)
+    async def validate_deployment(self, ...)
+    async def wait_for_deployment_ready(self)
+```
+
+### KubernetesConnector
+
+Directly PATCHes the DGD resource to update replica counts. The operator watches for DGD changes and reconciles component deployments.
+
+**Design decisions:**
+
+- Uses `DYN_PARENT_DGD_K8S_NAME` to find its parent DGD (injected by operator)
+- Resolves services by `subComponentType` field (prefill/decode), with fallback to legacy component names
+- Validates deployment structure on startup: checks that prefill and decode services exist and model names match
+
+### VirtualConnector
+
+For non-native environments (e.g., custom orchestrators). Writes scaling decisions to the distributed runtime via `VirtualConnectorCoordinator` (Rust binding). External systems use `VirtualConnectorClient` to poll decisions and report completion.
+
+**Scaling decision flow:**
+
+1. Planner writes `(num_prefill, num_decode, decision_id)` to runtime
+2. External system reads decision via `client.wait()`
+3. External system executes scaling
+4. External system reports completion via `client.complete(decision)`
+5. Planner sees `scaled_decision_id >= decision_id` and proceeds
+
+**Timeout**: If scaling isn't acknowledged within 1800s (configurable), the planner proceeds with new decisions anyway.
+
+## Performance Interpolation
+
+The planner uses pre-deployment profiling data (NPZ files) to map (throughput, ISL/OSL, context_length) -> (TTFT, ITL). This data comes from the SLA-driven profiling process (either online GPU profiling or AI Configurator estimation).
+
+Two interpolators are maintained:
+
+- **Prefill interpolator**: Maps (throughput_per_gpu, ISL) -> TTFT
+- **Decode interpolator**: Maps (throughput_per_gpu, context_length) -> ITL
+
+The interpolators use the profiling sweep granularity to determine precision. Finer granularity means more profiling samples but more accurate interpolation.
+
+## Initialization
+
+The planner starts with a 30-second delay (`INIT_PLANNER_START_DELAY`) to allow other components (frontend, workers) to register and stabilize. This is a known workaround (marked TODO in code) that should be replaced with a proper readiness check.
+
+After the delay:
+
+1. Initialize the connector (K8s or Virtual based on `--environment`)
+2. Validate deployment structure
+3. Load profiling results
+4. Build interpolators
+5. Initialize load predictor
+6. Enter main scaling loop
+
+## Performance Considerations
+
+- **Adjustment interval sizing**: The interval must be long enough for scaling operations to complete. If `adjustment_interval` is shorter than the time to add/remove a worker (which includes pod scheduling, model loading, and registration), scaling decisions will overlap. Default of 180s is conservative; workloads with fast model loading can use shorter intervals.
+- **Correction factor stability**: Correction factors are recalculated each interval. During traffic transitions (e.g., ramp-up), they can oscillate. The `--no-correction` flag disables correction for scenarios where cold-start artifacts dominate and distort the factor.
+- **Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.
+- **Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback.
+
+## Known Limitations
+
+1. **30-second startup delay**: Hardcoded wait for component registration. It should be replaced with runtime readiness probing.
+2. **Adjustment interval vs scaling latency**: If `adjustment_interval` < time to scale, scaling decisions can pile up. The planner logs warnings but doesn't queue.
+3. **Average-based interpolation**: The planner uses average ISL/OSL, which may not represent bimodal or heavy-tailed distributions well.
+4. **Single DGD scope**: Each planner instance manages exactly one DGD. Multi-model/multi-DGD coordination is not supported.
+5. **Load-based planner deprecated**: The load-based code path exists but is non-functional with current backends (no prefill queue metrics).
+
+## Future Work
+
+- Support aggregated (non-disaggregated) scaling mode for single-worker deployments
+- Multi-DGD coordination for shared-cluster scenarios
+- Distribution-aware interpolation (beyond mean ISL/OSL)
+- Adaptive adjustment interval based on observed scaling latency
+
+## File Map
+
+
+| File                         | Size | Purpose                                               |
+| ---------------------------- | ---- | ----------------------------------------------------- |
+| `planner_core.py`            | 36k  | Main scaling loop, algorithm implementation           |
+| `perf_interpolation.py`      | 13k  | NPZ data loading and throughput/latency interpolation |
+| `load_predictor.py`          | 16k  | ARIMA, Prophet, Kalman, Constant predictors           |
+| `pre_swept_results_utils.py` | 12k  | Pre-computed H100/H200 profiling data loader          |
+| `kubernetes_connector.py`    | 11k  | K8s API integration for DGD scaling                   |
+| `kube.py`                    | 7.4k | Low-level K8s client wrapper                          |
+| `exceptions.py`              | 7.2k | Custom exception hierarchy                            |
+| `prometheus.py`              | 7.3k | Prometheus query builder and client                   |
+| `defaults.py`                | 8.1k | Default configs, backend name mappings                |
+| `planner_argparse.py`        | 6.2k | CLI argument definitions                              |
+
+
@@ -88,3 +88,4 @@ Quickstart
    Distributed Runtime <design_docs/distributed_runtime.md>
    Request Plane <design_docs/request_plane.md>
    Event Plane <design_docs/event_plane.md>
+   Planner Design <design_docs/planner_design.md>
@@ -0,0 +1,136 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Planner
+
+The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
+
+> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](sla_planner_quickstart.md) for a complete workflow including profiling and deployment.
+
+## Feature Matrix
+
+| Category | Feature | Status |
+|----------|---------|--------|
+| **Backend** | Local (bare metal) | Deprecated |
+| | Kubernetes | Supported |
+| **LLM Framework** | vLLM | Supported |
+| | TensorRT-LLM | Supported |
+| | SGLang | Supported |
+| **Serving Type** | Aggregated | Unsupported |
+| | Disaggregated | Supported |
+| **Scaling Mode** | SLA-based (TTFT/ITL targets) | Supported (primary) |
+| | Load-based (KV cache/queue thresholds) | Deprecated |
+| **Load Predictors** | ARIMA | Supported |
+| | Prophet | Supported |
+| | Kalman filter | Supported |
+| | Constant (current = next) | Supported |
+| **Connectors** | KubernetesConnector (native DGD scaling) | Supported |
+| | VirtualConnector (external environments) | Supported |
+
+## Quick Start
+
+### Prerequisites
+
+- Dynamo platform installed on Kubernetes ([Installation Guide](/docs/kubernetes/installation_guide.md))
+- kube-prometheus-stack installed ([Metrics Setup](/docs/kubernetes/observability/metrics.md))
+- Pre-deployment profiling completed ([Profiling Guide](/docs/benchmarks/sla_driven_profiling.md))
+
+### Deploy with DGDR (Recommended)
+
+The fastest path to a planner-enabled deployment is through a DynamoGraphDeploymentRequest:
+
+```bash
+kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
+```
+
+This automatically profiles your model and deploys with the SLA planner. See [SLA Planner Quick Start](sla_planner_quickstart.md) for the full workflow.
+
+### Deploy with DGD (Manual)
+
+For manual control, use the disaggregated planner templates:
+
+```bash
+# After profiling is complete
+kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
+```
+
+## Documentation
+
+| Document | Description |
+|----------|-------------|
+| [Planner Guide](planner_guide.md) | Deployment, configuration, integration, troubleshooting |
+| [Planner Examples](planner_examples.md) | DGDR YAML examples, sample configurations, advanced patterns |
+| [SLA Planner Quick Start](sla_planner_quickstart.md) | End-to-end DGDR workflow: define SLAs, profile, deploy, monitor |
+| [SLA-based Planner](sla_planner.md) | Scaling algorithm, correction factors, load prediction details |
+| [Load-based Planner](load_planner.md) | Legacy load-based scaling (deprecated) |
+| [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md) | Pre-deployment profiling process and configuration |
+| [Planner Design](/docs/design_docs/planner_design.md) | Architecture deep-dive for contributors |
+
+## Configuration Reference
+
+### Key Arguments
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--namespace` | `$DYN_NAMESPACE` or `dynamo` | Dynamo logical namespace |
+| `--backend` | `vllm` | Backend framework (`vllm`, `sglang`, `trtllm`) |
+| `--environment` | `kubernetes` | Deployment environment |
+| `--adjustment-interval` | `180` | Seconds between scaling decisions |
+| `--ttft` | `500.0` | Target Time To First Token (ms) |
+| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
+| `--isl` | `3000` | Expected average input sequence length |
+| `--osl` | `150` | Expected average output sequence length |
+| `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
+| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
+| `--min-endpoint` | `1` | Minimum replicas per worker type |
+| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
+| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
+| `--no-operation` | `false` | Observation mode (no actual scaling) |
+| `--no-correction` | `false` | Disable correction factors |
+| `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
+
+### Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `DYN_NAMESPACE` | `dynamo` | Dynamo logical namespace |
+| `DYN_PARENT_DGD_K8S_NAME` | (required) | Parent DGD K8s resource name |
+| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` | Prometheus URL |
+| `PLANNER_PROMETHEUS_PORT` | `0` (disabled) | Port for planner's own Prometheus metrics |
+
+## Monitoring
+
+### Grafana Dashboard
+
+Deploy the planner dashboard:
+
+```bash
+kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
+```
+
+The dashboard shows:
+- Worker counts and GPU usage over time
+- Observed TTFT, ITL, request rate, sequence lengths
+- Predicted load and recommended replica counts
+- Correction factors (actual vs. expected performance)
+
+### Prometheus Metrics
+
+The planner queries the frontend's `/metrics` endpoint via Prometheus. Required metrics:
+- Request count and duration
+- TTFT and ITL distributions
+- Input/output sequence lengths