Skip to content

Commit d399361

Browse files
fix: address review comments and remove fern files
- Replace ASCII diagram with aligned version and add text lang tag - Add text lang tag to correction formula code fence - Fix sentence fragment in Known Limitations - Fix broken anchor link in planner_examples.md - Remove fern/ planner files (handled separately) - Restore fern/versions/next.yml to main state Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent a7bbf86 commit d399361

File tree

6 files changed

+65
-1115
lines changed

6 files changed

+65
-1115
lines changed

docs/design_docs/planner_design.md

Lines changed: 56 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -8,41 +8,37 @@ The Planner is Dynamo's autoscaling controller. It observes system metrics, pred
88

99
## Architecture
1010

11-
```
12-
┌─────────────────────────────────────────────────────────┐
11+
```text
12+
┌─────────────────────────────────────────────────────────
1313
│ Planner Component │
14-
│ │
15-
│ ┌──────────────┐ ┌───────────────┐ ┌──────────────┐ │
16-
│ │ Metric │ │ Load │ │ Performance │ │
17-
│ │ Collector │ │ Predictor │ │ Interpolator │ │
18-
│ │ (Prometheus) │ │ (ARIMA/etc.) │ │ (JSON data) │ │
19-
│ └──────┬───────┘ └───────┬───────┘ └──────┬───────┘ │
20-
│ │ │ │ │
21-
│ ▼ ▼ ▼ │
22-
│ ┌─────────────────────────────────────────────────┐ │
23-
│ │ Scaling Algorithm │ │
24-
│ │ 1. Collect metrics (TTFT, ITL, req count, ISL) │ │
25-
│ │ 2. Compute correction factors │ │
26-
│ │ 3. Predict next-interval load │ │
27-
│ │ 4. Calculate optimal replica counts │ │
28-
│ │ 5. Issue scaling decision │ │
29-
│ └──────────────────────┬──────────────────────────┘ │
30-
│ │ │
31-
│ ┌──────────────────────▼──────────────────────────┐ │
32-
│ │ Connector Layer │ │
33-
│ │ ┌──────────────────┐ ┌──────────────────────┐ │ │
34-
│ │ │ KubernetesConn. │ │ VirtualConn. │ │ │
35-
│ │ │ (PATCH DGD) │ │ (Runtime bridge) │ │ │
36-
│ │ └──────────────────┘ └──────────────────────┘ │ │
37-
│ └─────────────────────────────────────────────────┘ │
38-
└─────────────────────────────────────────────────────────┘
14+
│ │
15+
│ ┌───────────────┐ ┌───────────────┐ ┌────────────────┐ │
16+
│ │ Metric │ │ Load │ │ Performance │ │
17+
│ │ Collector │ │ Predictor │ │ Interpolator │ │
18+
│ │ (Prometheus) │ │ (ARIMA/etc.) │ │ (JSON data) │ │
19+
│ └───────┬───────┘ └───────┬───────┘ └───────┬────────┘ │
20+
│ │ │ │ │
21+
│ ▼ ▼ ▼ │
22+
│ ┌───────────────────────────────────────────────────┐ │
23+
│ │ Scaling Algorithm │ │
24+
│ └───────────────────────┬───────────────────────────┘ │
25+
│ │ │
26+
│ ┌───────────────────────▼───────────────────────────┐ │
27+
│ │ Connector Layer │ │
28+
│ │ ┌───────────────────┐ ┌───────────────────────┐ │ │
29+
│ │ │ KubernetesConn. │ │ VirtualConn. │ │ │
30+
│ │ │ (PATCH DGD) │ │ (Runtime bridge) │ │ │
31+
│ │ └───────────────────┘ └───────────────────────┘ │ │
32+
│ └───────────────────────────────────────────────────┘ │
33+
└──────────────────────────────────────────────────────────┘
3934
```
4035

4136
## Scaling Algorithm
4237

4338
### Step 1: Metric Collection
4439

4540
Every `adjustment_interval` seconds, the planner queries Prometheus for:
41+
4642
- Average TTFT and ITL over the interval
4743
- Total request count
4844
- Average input sequence length (ISL) and output sequence length (OSL)
@@ -53,12 +49,13 @@ The Prometheus query targets the Frontend's `/metrics` endpoint, which exposes h
5349

5450
The planner maintains correction factors that adapt profiling-based predictions to real-world behavior:
5551

56-
```
52+
```text
5753
prefill_correction = actual_ttft / expected_ttft
5854
decode_correction = actual_itl / expected_itl
5955
```
6056

6157
These factors account for hard to model factors such as:
58+
6259
- **Request queueing**: Bursty traffic causes higher TTFT than profiled steady-state
6360
- **Prefix cache hits**: KV reuse reduces effective prefill tokens, lowering actual TTFT
6461
- **Chunked prefill in decode**: Small prefills processed in decode engine affect ITL
@@ -69,24 +66,28 @@ The correction factors are applied as multipliers to the next scaling decision.
6966
### Step 3: Load Prediction
7067

7168
The planner forecasts three values for the next interval:
69+
7270
- `next_num_req`: Number of requests
7371
- `next_isl`: Average input sequence length
7472
- `next_osl`: Average output sequence length
7573

7674
Four predictor implementations are available:
7775

78-
| Predictor | Algorithm | Best For |
79-
|-----------|-----------|----------|
80-
| **Constant** | `next = current` | Stable workloads, long intervals |
81-
| **ARIMA** | Auto-ARIMA with optional log1p transform | Trending/seasonal patterns |
82-
| **Kalman** | Local linear trend Kalman filter | Bursty traffics |
83-
| **Prophet** | Facebook Prophet time-series model | Complex seasonality |
76+
77+
| Predictor | Algorithm | Best For |
78+
| ------------ | ---------------------------------------- | -------------------------------- |
79+
| **Constant** | `next = current` | Stable workloads, long intervals |
80+
| **ARIMA** | Auto-ARIMA with optional log1p transform | Trending/seasonal patterns |
81+
| **Kalman** | Local linear trend Kalman filter | Bursty traffics |
82+
| **Prophet** | Facebook Prophet time-series model | Complex seasonality |
83+
8484

8585
All predictors support warm-starting from trace files (`--load-predictor-warmup-trace`).
8686

8787
### Step 4: Replica Calculation
8888

8989
**Prefill replicas:**
90+
9091
```python
9192
predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
9293
prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)
@@ -95,6 +96,7 @@ prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engi
9596
The prefill correction factor has a linear effect on throughput because prefill is single-batched.
9697

9798
**Decode replicas:**
99+
98100
```python
99101
# Apply correction to the ITL SLA target
100102
corrected_itl = target_itl / decode_correction_factor
@@ -132,6 +134,7 @@ class PlannerConnector(ABC):
132134
Directly PATCHes the DGD resource to update replica counts. The operator watches for DGD changes and reconciles component deployments.
133135

134136
**Design decisions:**
137+
135138
- Uses `DYN_PARENT_DGD_K8S_NAME` to find its parent DGD (injected by operator)
136139
- Resolves services by `subComponentType` field (prefill/decode), with fallback to legacy component names
137140
- Validates deployment structure on startup: checks that prefill and decode services exist and model names match
@@ -141,6 +144,7 @@ Directly PATCHes the DGD resource to update replica counts. The operator watches
141144
For non-native environments (e.g., custom orchestrators). Writes scaling decisions to the distributed runtime via `VirtualConnectorCoordinator` (Rust binding). External systems use `VirtualConnectorClient` to poll decisions and report completion.
142145

143146
**Scaling decision flow:**
147+
144148
1. Planner writes `(num_prefill, num_decode, decision_id)` to runtime
145149
2. External system reads decision via `client.wait()`
146150
3. External system executes scaling
@@ -154,6 +158,7 @@ For non-native environments (e.g., custom orchestrators). Writes scaling decisio
154158
The planner uses pre-deployment profiling data (NPZ files) to map (throughput, ISL/OSL, context_length) -> (TTFT, ITL). This data comes from the SLA-driven profiling process (either online GPU profiling or AI Configurator estimation).
155159

156160
Two interpolators are maintained:
161+
157162
- **Prefill interpolator**: Maps (throughput_per_gpu, ISL) -> TTFT
158163
- **Decode interpolator**: Maps (throughput_per_gpu, context_length) -> ITL
159164

@@ -164,6 +169,7 @@ The interpolators use the profiling sweep granularity to determine precision. Fi
164169
The planner starts with a 30-second delay (`INIT_PLANNER_START_DELAY`) to allow other components (frontend, workers) to register and stabilize. This is a known workaround (marked TODO in code) that should be replaced with a proper readiness check.
165170

166171
After the delay:
172+
167173
1. Initialize the connector (K8s or Virtual based on `--environment`)
168174
2. Validate deployment structure
169175
3. Load profiling results
@@ -174,16 +180,13 @@ After the delay:
174180
## Performance Considerations
175181

176182
- **Adjustment interval sizing**: The interval must be long enough for scaling operations to complete. If `adjustment_interval` is shorter than the time to add/remove a worker (which includes pod scheduling, model loading, and registration), scaling decisions will overlap. Default of 180s is conservative; workloads with fast model loading can use shorter intervals.
177-
178183
- **Correction factor stability**: Correction factors are recalculated each interval. During traffic transitions (e.g., ramp-up), they can oscillate. The `--no-correction` flag disables correction for scenarios where cold-start artifacts dominate and distort the factor.
179-
180184
- **Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.
181-
182185
- **Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback.
183186

184187
## Known Limitations
185188

186-
1. **30-second startup delay**: Hardcoded wait for component registration. Should be replaced with runtime readiness probing.
189+
1. **30-second startup delay**: Hardcoded wait for component registration. It should be replaced with runtime readiness probing.
187190
2. **Adjustment interval vs scaling latency**: If `adjustment_interval` < time to scale, scaling decisions can pile up. The planner logs warnings but doesn't queue.
188191
3. **Average-based interpolation**: The planner uses average ISL/OSL, which may not represent bimodal or heavy-tailed distributions well.
189192
4. **Single DGD scope**: Each planner instance manages exactly one DGD. Multi-model/multi-DGD coordination is not supported.
@@ -198,15 +201,18 @@ After the delay:
198201

199202
## File Map
200203

201-
| File | Size | Purpose |
202-
|------|------|---------|
203-
| `planner_core.py` | 36k | Main scaling loop, algorithm implementation |
204-
| `perf_interpolation.py` | 13k | NPZ data loading and throughput/latency interpolation |
205-
| `load_predictor.py` | 16k | ARIMA, Prophet, Kalman, Constant predictors |
206-
| `pre_swept_results_utils.py` | 12k | Pre-computed H100/H200 profiling data loader |
207-
| `kubernetes_connector.py` | 11k | K8s API integration for DGD scaling |
208-
| `kube.py` | 7.4k | Low-level K8s client wrapper |
209-
| `exceptions.py` | 7.2k | Custom exception hierarchy |
210-
| `prometheus.py` | 7.3k | Prometheus query builder and client |
211-
| `defaults.py` | 8.1k | Default configs, backend name mappings |
212-
| `planner_argparse.py` | 6.2k | CLI argument definitions |
204+
205+
| File | Size | Purpose |
206+
| ---------------------------- | ---- | ----------------------------------------------------- |
207+
| `planner_core.py` | 36k | Main scaling loop, algorithm implementation |
208+
| `perf_interpolation.py` | 13k | NPZ data loading and throughput/latency interpolation |
209+
| `load_predictor.py` | 16k | ARIMA, Prophet, Kalman, Constant predictors |
210+
| `pre_swept_results_utils.py` | 12k | Pre-computed H100/H200 profiling data loader |
211+
| `kubernetes_connector.py` | 11k | K8s API integration for DGD scaling |
212+
| `kube.py` | 7.4k | Low-level K8s client wrapper |
213+
| `exceptions.py` | 7.2k | Custom exception hierarchy |
214+
| `prometheus.py` | 7.3k | Prometheus query builder and client |
215+
| `defaults.py` | 8.1k | Default configs, backend name mappings |
216+
| `planner_argparse.py` | 6.2k | CLI argument definitions |
217+
218+

docs/planner/planner_examples.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -229,7 +229,7 @@ Profiling runs against the real backend (via GPUs or AIC). The mocker deployment
229229

230230
For large models, use a pre-populated PVC instead of downloading from HuggingFace:
231231

232-
See [Model Cache PVC](/docs/benchmarks/sla_driven_profiling.md#model-cache-pvc-advanced) for configuration details.
232+
See [SLA-Driven Profiling](/docs/benchmarks/sla_driven_profiling.md) for configuration details.
233233

234234
## Advanced Examples
235235

0 commit comments

Comments
 (0)