|
| 1 | +# Planner Design |
| 2 | + |
| 3 | +> **Tier 3 design documentation** for contributors and architects. For user-facing docs, see [docs/planner/](/docs/planner/). |
| 4 | +
|
| 5 | +## Overview |
| 6 | + |
| 7 | +The Planner is Dynamo's autoscaling controller. It observes system metrics, predicts future load, and adjusts prefill/decode worker replica counts to proactively meet SLA targets. This document covers the internal architecture, algorithms, and design trade-offs. |
| 8 | + |
| 9 | +## Architecture |
| 10 | + |
| 11 | +```text |
| 12 | +┌──────────────────────────────────────────────────────────┐ |
| 13 | +│ Planner Component │ |
| 14 | +│ │ |
| 15 | +│ ┌───────────────┐ ┌───────────────┐ ┌────────────────┐ │ |
| 16 | +│ │ Metric │ │ Load │ │ Performance │ │ |
| 17 | +│ │ Collector │ │ Predictor │ │ Interpolator │ │ |
| 18 | +│ │ (Prometheus) │ │ (ARIMA/etc.) │ │ (JSON data) │ │ |
| 19 | +│ └───────┬───────┘ └───────┬───────┘ └───────┬────────┘ │ |
| 20 | +│ │ │ │ │ |
| 21 | +│ ▼ ▼ ▼ │ |
| 22 | +│ ┌───────────────────────────────────────────────────┐ │ |
| 23 | +│ │ Scaling Algorithm │ │ |
| 24 | +│ └───────────────────────┬───────────────────────────┘ │ |
| 25 | +│ │ │ |
| 26 | +│ ┌───────────────────────▼───────────────────────────┐ │ |
| 27 | +│ │ Connector Layer │ │ |
| 28 | +│ │ ┌───────────────────┐ ┌───────────────────────┐ │ │ |
| 29 | +│ │ │ KubernetesConn. │ │ VirtualConn. │ │ │ |
| 30 | +│ │ │ (PATCH DGD) │ │ (Runtime bridge) │ │ │ |
| 31 | +│ │ └───────────────────┘ └───────────────────────┘ │ │ |
| 32 | +│ └───────────────────────────────────────────────────┘ │ |
| 33 | +└──────────────────────────────────────────────────────────┘ |
| 34 | +``` |
| 35 | + |
| 36 | +## Scaling Algorithm |
| 37 | + |
| 38 | +### Step 1: Metric Collection |
| 39 | + |
| 40 | +Every `adjustment_interval` seconds, the planner queries Prometheus for: |
| 41 | + |
| 42 | +- Average TTFT and ITL over the interval |
| 43 | +- Total request count |
| 44 | +- Average input sequence length (ISL) and output sequence length (OSL) |
| 45 | + |
| 46 | +The Prometheus query targets the Frontend's `/metrics` endpoint, which exposes histograms and counters. |
| 47 | + |
| 48 | +### Step 2: Correction Factor Calculation |
| 49 | + |
| 50 | +The planner maintains correction factors that adapt profiling-based predictions to real-world behavior: |
| 51 | + |
| 52 | +```text |
| 53 | +prefill_correction = actual_ttft / expected_ttft |
| 54 | +decode_correction = actual_itl / expected_itl |
| 55 | +``` |
| 56 | + |
| 57 | +These factors account for hard to model factors such as: |
| 58 | + |
| 59 | +- **Request queueing**: Bursty traffic causes higher TTFT than profiled steady-state |
| 60 | +- **Prefix cache hits**: KV reuse reduces effective prefill tokens, lowering actual TTFT |
| 61 | +- **Chunked prefill in decode**: Small prefills processed in decode engine affect ITL |
| 62 | +- **Metric variance**: Average ISL/OSL may not represent the actual distribution |
| 63 | + |
| 64 | +The correction factors are applied as multipliers to the next scaling decision. Setting `--no-correction` disables this for debugging or when cold-start artifacts dominate. |
| 65 | + |
| 66 | +### Step 3: Load Prediction |
| 67 | + |
| 68 | +The planner forecasts three values for the next interval: |
| 69 | + |
| 70 | +- `next_num_req`: Number of requests |
| 71 | +- `next_isl`: Average input sequence length |
| 72 | +- `next_osl`: Average output sequence length |
| 73 | + |
| 74 | +Four predictor implementations are available: |
| 75 | + |
| 76 | + |
| 77 | +| Predictor | Algorithm | Best For | |
| 78 | +| ------------ | ---------------------------------------- | -------------------------------- | |
| 79 | +| **Constant** | `next = current` | Stable workloads, long intervals | |
| 80 | +| **ARIMA** | Auto-ARIMA with optional log1p transform | Trending/seasonal patterns | |
| 81 | +| **Kalman** | Local linear trend Kalman filter | Bursty traffics | |
| 82 | +| **Prophet** | Facebook Prophet time-series model | Complex seasonality | |
| 83 | + |
| 84 | + |
| 85 | +All predictors support warm-starting from trace files (`--load-predictor-warmup-trace`). |
| 86 | + |
| 87 | +### Step 4: Replica Calculation |
| 88 | + |
| 89 | +**Prefill replicas:** |
| 90 | + |
| 91 | +```python |
| 92 | +predicted_load = next_requests * next_isl / interval * min(1, prefill_correction) |
| 93 | +prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine) |
| 94 | +``` |
| 95 | + |
| 96 | +The prefill correction factor has a linear effect on throughput because prefill is single-batched. |
| 97 | + |
| 98 | +**Decode replicas:** |
| 99 | + |
| 100 | +```python |
| 101 | +# Apply correction to the ITL SLA target |
| 102 | +corrected_itl = target_itl / decode_correction_factor |
| 103 | + |
| 104 | +# Find best throughput/GPU that achieves corrected ITL at predicted context length |
| 105 | +throughput_per_gpu = decode_interpolator.find_best_throughput_per_gpu( |
| 106 | + itl=corrected_itl, |
| 107 | + context_length=next_isl + next_osl / 2 |
| 108 | +) |
| 109 | + |
| 110 | +# Calculate required replicas |
| 111 | +decode_replicas = ceil(next_num_req * next_osl / interval / throughput_per_gpu / gpus_per_engine) |
| 112 | +``` |
| 113 | + |
| 114 | +### Step 5: Scaling Execution |
| 115 | + |
| 116 | +The planner calls `connector.set_component_replicas()` with the calculated targets. Scaling is non-blocking by default: the planner continues monitoring while replicas are adjusting. |
| 117 | + |
| 118 | +## Connector Design |
| 119 | + |
| 120 | +### Interface |
| 121 | + |
| 122 | +```python |
| 123 | +class PlannerConnector(ABC): |
| 124 | + async def add_component(self, component_name) |
| 125 | + async def remove_component(self, component_name) |
| 126 | + # Extended interface (not on ABC, but implemented by both connectors): |
| 127 | + async def set_component_replicas(self, targets, blocking) |
| 128 | + async def validate_deployment(self, ...) |
| 129 | + async def wait_for_deployment_ready(self) |
| 130 | +``` |
| 131 | + |
| 132 | +### KubernetesConnector |
| 133 | + |
| 134 | +Directly PATCHes the DGD resource to update replica counts. The operator watches for DGD changes and reconciles component deployments. |
| 135 | + |
| 136 | +**Design decisions:** |
| 137 | + |
| 138 | +- Uses `DYN_PARENT_DGD_K8S_NAME` to find its parent DGD (injected by operator) |
| 139 | +- Resolves services by `subComponentType` field (prefill/decode), with fallback to legacy component names |
| 140 | +- Validates deployment structure on startup: checks that prefill and decode services exist and model names match |
| 141 | + |
| 142 | +### VirtualConnector |
| 143 | + |
| 144 | +For non-native environments (e.g., custom orchestrators). Writes scaling decisions to the distributed runtime via `VirtualConnectorCoordinator` (Rust binding). External systems use `VirtualConnectorClient` to poll decisions and report completion. |
| 145 | + |
| 146 | +**Scaling decision flow:** |
| 147 | + |
| 148 | +1. Planner writes `(num_prefill, num_decode, decision_id)` to runtime |
| 149 | +2. External system reads decision via `client.wait()` |
| 150 | +3. External system executes scaling |
| 151 | +4. External system reports completion via `client.complete(decision)` |
| 152 | +5. Planner sees `scaled_decision_id >= decision_id` and proceeds |
| 153 | + |
| 154 | +**Timeout**: If scaling isn't acknowledged within 1800s (configurable), the planner proceeds with new decisions anyway. |
| 155 | + |
| 156 | +## Performance Interpolation |
| 157 | + |
| 158 | +The planner uses pre-deployment profiling data (NPZ files) to map (throughput, ISL/OSL, context_length) -> (TTFT, ITL). This data comes from the SLA-driven profiling process (either online GPU profiling or AI Configurator estimation). |
| 159 | + |
| 160 | +Two interpolators are maintained: |
| 161 | + |
| 162 | +- **Prefill interpolator**: Maps (throughput_per_gpu, ISL) -> TTFT |
| 163 | +- **Decode interpolator**: Maps (throughput_per_gpu, context_length) -> ITL |
| 164 | + |
| 165 | +The interpolators use the profiling sweep granularity to determine precision. Finer granularity means more profiling samples but more accurate interpolation. |
| 166 | + |
| 167 | +## Initialization |
| 168 | + |
| 169 | +The planner starts with a 30-second delay (`INIT_PLANNER_START_DELAY`) to allow other components (frontend, workers) to register and stabilize. This is a known workaround (marked TODO in code) that should be replaced with a proper readiness check. |
| 170 | + |
| 171 | +After the delay: |
| 172 | + |
| 173 | +1. Initialize the connector (K8s or Virtual based on `--environment`) |
| 174 | +2. Validate deployment structure |
| 175 | +3. Load profiling results |
| 176 | +4. Build interpolators |
| 177 | +5. Initialize load predictor |
| 178 | +6. Enter main scaling loop |
| 179 | + |
| 180 | +## Performance Considerations |
| 181 | + |
| 182 | +- **Adjustment interval sizing**: The interval must be long enough for scaling operations to complete. If `adjustment_interval` is shorter than the time to add/remove a worker (which includes pod scheduling, model loading, and registration), scaling decisions will overlap. Default of 180s is conservative; workloads with fast model loading can use shorter intervals. |
| 183 | +- **Correction factor stability**: Correction factors are recalculated each interval. During traffic transitions (e.g., ramp-up), they can oscillate. The `--no-correction` flag disables correction for scenarios where cold-start artifacts dominate and distort the factor. |
| 184 | +- **Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration. |
| 185 | +- **Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback. |
| 186 | + |
| 187 | +## Known Limitations |
| 188 | + |
| 189 | +1. **30-second startup delay**: Hardcoded wait for component registration. It should be replaced with runtime readiness probing. |
| 190 | +2. **Adjustment interval vs scaling latency**: If `adjustment_interval` < time to scale, scaling decisions can pile up. The planner logs warnings but doesn't queue. |
| 191 | +3. **Average-based interpolation**: The planner uses average ISL/OSL, which may not represent bimodal or heavy-tailed distributions well. |
| 192 | +4. **Single DGD scope**: Each planner instance manages exactly one DGD. Multi-model/multi-DGD coordination is not supported. |
| 193 | +5. **Load-based planner deprecated**: The load-based code path exists but is non-functional with current backends (no prefill queue metrics). |
| 194 | + |
| 195 | +## Future Work |
| 196 | + |
| 197 | +- Support aggregated (non-disaggregated) scaling mode for single-worker deployments |
| 198 | +- Multi-DGD coordination for shared-cluster scenarios |
| 199 | +- Distribution-aware interpolation (beyond mean ISL/OSL) |
| 200 | +- Adaptive adjustment interval based on observed scaling latency |
| 201 | + |
| 202 | +## File Map |
| 203 | + |
| 204 | + |
| 205 | +| File | Size | Purpose | |
| 206 | +| ---------------------------- | ---- | ----------------------------------------------------- | |
| 207 | +| `planner_core.py` | 36k | Main scaling loop, algorithm implementation | |
| 208 | +| `perf_interpolation.py` | 13k | NPZ data loading and throughput/latency interpolation | |
| 209 | +| `load_predictor.py` | 16k | ARIMA, Prophet, Kalman, Constant predictors | |
| 210 | +| `pre_swept_results_utils.py` | 12k | Pre-computed H100/H200 profiling data loader | |
| 211 | +| `kubernetes_connector.py` | 11k | K8s API integration for DGD scaling | |
| 212 | +| `kube.py` | 7.4k | Low-level K8s client wrapper | |
| 213 | +| `exceptions.py` | 7.2k | Custom exception hierarchy | |
| 214 | +| `prometheus.py` | 7.3k | Prometheus query builder and client | |
| 215 | +| `defaults.py` | 8.1k | Default configs, backend name mappings | |
| 216 | +| `planner_argparse.py` | 6.2k | CLI argument definitions | |
| 217 | + |
| 218 | + |
0 commit comments