ai-dynamo
diff --git a/‎docs/design_docs/planner_design.md‎
Lines changed: 56 additions & 50 deletions b/‎docs/design_docs/planner_design.md‎
Lines changed: 56 additions & 50 deletions
diff --git a/‎docs/planner/planner_examples.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/planner/planner_examples.md‎
Lines changed: 1 addition & 1 deletion
@@ -8,41 +8,37 @@ The Planner is Dynamo's autoscaling controller. It observes system metrics, pred
 
 ## Architecture
 
-```
-┌─────────────────────────────────────────────────────────┐
+```text
+┌──────────────────────────────────────────────────────────┐
 │                    Planner Component                     │
-│                                                         │
-│  ┌──────────────┐  ┌───────────────┐  ┌──────────────┐ │
-│  │    Metric     │  │     Load      │  │  Performance │ │
-│  │   Collector   │  │   Predictor   │  │ Interpolator │ │
-│  │  (Prometheus) │  │ (ARIMA/etc.)  │  │   (JSON data) │ │
-│  └──────┬───────┘  └───────┬───────┘  └──────┬───────┘ │
-│         │                  │                  │         │
-│         ▼                  ▼                  ▼         │
-│  ┌─────────────────────────────────────────────────┐    │
-│  │              Scaling Algorithm                    │    │
-│  │  1. Collect metrics (TTFT, ITL, req count, ISL)  │    │
-│  │  2. Compute correction factors                   │    │
-│  │  3. Predict next-interval load                   │    │
-│  │  4. Calculate optimal replica counts             │    │
-│  │  5. Issue scaling decision                       │    │
-│  └──────────────────────┬──────────────────────────┘    │
-│                         │                               │
-│  ┌──────────────────────▼──────────────────────────┐    │
-│  │               Connector Layer                    │    │
-│  │  ┌──────────────────┐  ┌──────────────────────┐ │    │
-│  │  │ KubernetesConn.  │  │   VirtualConn.       │ │    │
-│  │  │ (PATCH DGD)      │  │ (Runtime bridge)     │ │    │
-│  │  └──────────────────┘  └──────────────────────┘ │    │
-│  └─────────────────────────────────────────────────┘    │
-└─────────────────────────────────────────────────────────┘
+│                                                          │
+│  ┌───────────────┐ ┌───────────────┐ ┌────────────────┐  │
+│  │    Metric     │ │     Load      │ │  Performance   │  │
+│  │   Collector   │ │   Predictor   │ │  Interpolator  │  │
+│  │  (Prometheus) │ │ (ARIMA/etc.)  │ │  (JSON data)   │  │
+│  └───────┬───────┘ └───────┬───────┘ └───────┬────────┘  │
+│          │                 │                  │          │
+│          ▼                 ▼                  ▼          │
+│  ┌───────────────────────────────────────────────────┐   │
+│  │              Scaling Algorithm                    │   │
+│  └───────────────────────┬───────────────────────────┘   │
+│                          │                               │
+│  ┌───────────────────────▼───────────────────────────┐   │
+│  │               Connector Layer                     │   │
+│  │  ┌───────────────────┐  ┌───────────────────────┐ │   │
+│  │  │ KubernetesConn.   │  │   VirtualConn.        │ │   │
+│  │  │ (PATCH DGD)       │  │   (Runtime bridge)    │ │   │
+│  │  └───────────────────┘  └───────────────────────┘ │   │
+│  └───────────────────────────────────────────────────┘   │
+└──────────────────────────────────────────────────────────┘
 ```
 
 ## Scaling Algorithm
 
 ### Step 1: Metric Collection
 
 Every `adjustment_interval` seconds, the planner queries Prometheus for:
+
 - Average TTFT and ITL over the interval
 - Total request count
 - Average input sequence length (ISL) and output sequence length (OSL)
@@ -53,12 +49,13 @@ The Prometheus query targets the Frontend's `/metrics` endpoint, which exposes h
 
 The planner maintains correction factors that adapt profiling-based predictions to real-world behavior:
 
-```
+```text
 prefill_correction = actual_ttft / expected_ttft
 decode_correction  = actual_itl  / expected_itl
 ```
 
 These factors account for hard to model factors such as:
+
 - **Request queueing**: Bursty traffic causes higher TTFT than profiled steady-state
 - **Prefix cache hits**: KV reuse reduces effective prefill tokens, lowering actual TTFT
 - **Chunked prefill in decode**: Small prefills processed in decode engine affect ITL
@@ -69,24 +66,28 @@ The correction factors are applied as multipliers to the next scaling decision.
 ### Step 3: Load Prediction
 
 The planner forecasts three values for the next interval:
+
 - `next_num_req`: Number of requests
 - `next_isl`: Average input sequence length
 - `next_osl`: Average output sequence length
 
 Four predictor implementations are available:
 
-| Predictor | Algorithm | Best For |
-|-----------|-----------|----------|
-| **Constant** | `next = current` | Stable workloads, long intervals |
-| **ARIMA** | Auto-ARIMA with optional log1p transform | Trending/seasonal patterns |
-| **Kalman** | Local linear trend Kalman filter | Bursty traffics |
-| **Prophet** | Facebook Prophet time-series model | Complex seasonality |
+
+| Predictor    | Algorithm                                | Best For                         |
+| ------------ | ---------------------------------------- | -------------------------------- |
+| **Constant** | `next = current`                         | Stable workloads, long intervals |
+| **ARIMA**    | Auto-ARIMA with optional log1p transform | Trending/seasonal patterns       |
+| **Kalman**   | Local linear trend Kalman filter         | Bursty traffics                  |
+| **Prophet**  | Facebook Prophet time-series model       | Complex seasonality              |
+
 
 All predictors support warm-starting from trace files (`--load-predictor-warmup-trace`).
 
 ### Step 4: Replica Calculation
 
 **Prefill replicas:**
+
 ```python
 predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
 prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)
@@ -95,6 +96,7 @@ prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engi
 The prefill correction factor has a linear effect on throughput because prefill is single-batched.
 
 **Decode replicas:**
+
 ```python
 # Apply correction to the ITL SLA target
 corrected_itl = target_itl / decode_correction_factor
@@ -132,6 +134,7 @@ class PlannerConnector(ABC):
 Directly PATCHes the DGD resource to update replica counts. The operator watches for DGD changes and reconciles component deployments.
 
 **Design decisions:**
+
 - Uses `DYN_PARENT_DGD_K8S_NAME` to find its parent DGD (injected by operator)
 - Resolves services by `subComponentType` field (prefill/decode), with fallback to legacy component names
 - Validates deployment structure on startup: checks that prefill and decode services exist and model names match
@@ -141,6 +144,7 @@ Directly PATCHes the DGD resource to update replica counts. The operator watches
 For non-native environments (e.g., custom orchestrators). Writes scaling decisions to the distributed runtime via `VirtualConnectorCoordinator` (Rust binding). External systems use `VirtualConnectorClient` to poll decisions and report completion.
 
 **Scaling decision flow:**
+
 1. Planner writes `(num_prefill, num_decode, decision_id)` to runtime
 2. External system reads decision via `client.wait()`
 3. External system executes scaling
@@ -154,6 +158,7 @@ For non-native environments (e.g., custom orchestrators). Writes scaling decisio
 The planner uses pre-deployment profiling data (NPZ files) to map (throughput, ISL/OSL, context_length) -> (TTFT, ITL). This data comes from the SLA-driven profiling process (either online GPU profiling or AI Configurator estimation).
 
 Two interpolators are maintained:
+
 - **Prefill interpolator**: Maps (throughput_per_gpu, ISL) -> TTFT
 - **Decode interpolator**: Maps (throughput_per_gpu, context_length) -> ITL
 
@@ -164,6 +169,7 @@ The interpolators use the profiling sweep granularity to determine precision. Fi
 The planner starts with a 30-second delay (`INIT_PLANNER_START_DELAY`) to allow other components (frontend, workers) to register and stabilize. This is a known workaround (marked TODO in code) that should be replaced with a proper readiness check.
 
 After the delay:
+
 1. Initialize the connector (K8s or Virtual based on `--environment`)
 2. Validate deployment structure
 3. Load profiling results
@@ -174,16 +180,13 @@ After the delay:
 ## Performance Considerations
 
 - **Adjustment interval sizing**: The interval must be long enough for scaling operations to complete. If `adjustment_interval` is shorter than the time to add/remove a worker (which includes pod scheduling, model loading, and registration), scaling decisions will overlap. Default of 180s is conservative; workloads with fast model loading can use shorter intervals.
-
 - **Correction factor stability**: Correction factors are recalculated each interval. During traffic transitions (e.g., ramp-up), they can oscillate. The `--no-correction` flag disables correction for scenarios where cold-start artifacts dominate and distort the factor.
-
 - **Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.
-
 - **Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback.
 
 ## Known Limitations
 
-1. **30-second startup delay**: Hardcoded wait for component registration. Should be replaced with runtime readiness probing.
+1. **30-second startup delay**: Hardcoded wait for component registration. It should be replaced with runtime readiness probing.
 2. **Adjustment interval vs scaling latency**: If `adjustment_interval` < time to scale, scaling decisions can pile up. The planner logs warnings but doesn't queue.
 3. **Average-based interpolation**: The planner uses average ISL/OSL, which may not represent bimodal or heavy-tailed distributions well.
 4. **Single DGD scope**: Each planner instance manages exactly one DGD. Multi-model/multi-DGD coordination is not supported.
@@ -198,15 +201,18 @@ After the delay:
 
 ## File Map
 
-| File | Size | Purpose |
-|------|------|---------|
-| `planner_core.py` | 36k | Main scaling loop, algorithm implementation |
-| `perf_interpolation.py` | 13k | NPZ data loading and throughput/latency interpolation |
-| `load_predictor.py` | 16k | ARIMA, Prophet, Kalman, Constant predictors |
-| `pre_swept_results_utils.py` | 12k | Pre-computed H100/H200 profiling data loader |
-| `kubernetes_connector.py` | 11k | K8s API integration for DGD scaling |
-| `kube.py` | 7.4k | Low-level K8s client wrapper |
-| `exceptions.py` | 7.2k | Custom exception hierarchy |
-| `prometheus.py` | 7.3k | Prometheus query builder and client |
-| `defaults.py` | 8.1k | Default configs, backend name mappings |
-| `planner_argparse.py` | 6.2k | CLI argument definitions |
+
+| File                         | Size | Purpose                                               |
+| ---------------------------- | ---- | ----------------------------------------------------- |
+| `planner_core.py`            | 36k  | Main scaling loop, algorithm implementation           |
+| `perf_interpolation.py`      | 13k  | NPZ data loading and throughput/latency interpolation |
+| `load_predictor.py`          | 16k  | ARIMA, Prophet, Kalman, Constant predictors           |
+| `pre_swept_results_utils.py` | 12k  | Pre-computed H100/H200 profiling data loader          |
+| `kubernetes_connector.py`    | 11k  | K8s API integration for DGD scaling                   |
+| `kube.py`                    | 7.4k | Low-level K8s client wrapper                          |
+| `exceptions.py`              | 7.2k | Custom exception hierarchy                            |
+| `prometheus.py`              | 7.3k | Prometheus query builder and client                   |
+| `defaults.py`                | 8.1k | Default configs, backend name mappings                |
+| `planner_argparse.py`        | 6.2k | CLI argument definitions                              |
+
+
@@ -229,7 +229,7 @@ Profiling runs against the real backend (via GPUs or AIC). The mocker deployment
 
 For large models, use a pre-populated PVC instead of downloading from HuggingFace:
 
-See [Model Cache PVC](/docs/benchmarks/sla_driven_profiling.md#model-cache-pvc-advanced) for configuration details.
+See [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md) for configuration details.
 
 ## Advanced Examples