You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: address review comments and remove fern files
- Replace ASCII diagram with aligned version and add text lang tag
- Add text lang tag to correction formula code fence
- Fix sentence fragment in Known Limitations
- Fix broken anchor link in planner_examples.md
- Remove fern/ planner files (handled separately)
- Restore fern/versions/next.yml to main state
Signed-off-by: Dan Gil <dagil@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Directly PATCHes the DGD resource to update replica counts. The operator watches for DGD changes and reconciles component deployments.
133
135
134
136
**Design decisions:**
137
+
135
138
- Uses `DYN_PARENT_DGD_K8S_NAME` to find its parent DGD (injected by operator)
136
139
- Resolves services by `subComponentType` field (prefill/decode), with fallback to legacy component names
137
140
- Validates deployment structure on startup: checks that prefill and decode services exist and model names match
@@ -141,6 +144,7 @@ Directly PATCHes the DGD resource to update replica counts. The operator watches
141
144
For non-native environments (e.g., custom orchestrators). Writes scaling decisions to the distributed runtime via `VirtualConnectorCoordinator` (Rust binding). External systems use `VirtualConnectorClient` to poll decisions and report completion.
142
145
143
146
**Scaling decision flow:**
147
+
144
148
1. Planner writes `(num_prefill, num_decode, decision_id)` to runtime
145
149
2. External system reads decision via `client.wait()`
The planner uses pre-deployment profiling data (NPZ files) to map (throughput, ISL/OSL, context_length) -> (TTFT, ITL). This data comes from the SLA-driven profiling process (either online GPU profiling or AI Configurator estimation).
@@ -164,6 +169,7 @@ The interpolators use the profiling sweep granularity to determine precision. Fi
164
169
The planner starts with a 30-second delay (`INIT_PLANNER_START_DELAY`) to allow other components (frontend, workers) to register and stabilize. This is a known workaround (marked TODO in code) that should be replaced with a proper readiness check.
165
170
166
171
After the delay:
172
+
167
173
1. Initialize the connector (K8s or Virtual based on `--environment`)
168
174
2. Validate deployment structure
169
175
3. Load profiling results
@@ -174,16 +180,13 @@ After the delay:
174
180
## Performance Considerations
175
181
176
182
-**Adjustment interval sizing**: The interval must be long enough for scaling operations to complete. If `adjustment_interval` is shorter than the time to add/remove a worker (which includes pod scheduling, model loading, and registration), scaling decisions will overlap. Default of 180s is conservative; workloads with fast model loading can use shorter intervals.
177
-
178
183
-**Correction factor stability**: Correction factors are recalculated each interval. During traffic transitions (e.g., ramp-up), they can oscillate. The `--no-correction` flag disables correction for scenarios where cold-start artifacts dominate and distort the factor.
179
-
180
184
-**Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.
181
-
182
185
-**Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback.
183
186
184
187
## Known Limitations
185
188
186
-
1.**30-second startup delay**: Hardcoded wait for component registration. Should be replaced with runtime readiness probing.
189
+
1.**30-second startup delay**: Hardcoded wait for component registration. It should be replaced with runtime readiness probing.
187
190
2.**Adjustment interval vs scaling latency**: If `adjustment_interval` < time to scale, scaling decisions can pile up. The planner logs warnings but doesn't queue.
188
191
3.**Average-based interpolation**: The planner uses average ISL/OSL, which may not represent bimodal or heavy-tailed distributions well.
189
192
4.**Single DGD scope**: Each planner instance manages exactly one DGD. Multi-model/multi-DGD coordination is not supported.
@@ -198,15 +201,18 @@ After the delay:
198
201
199
202
## File Map
200
203
201
-
| File | Size | Purpose |
202
-
|------|------|---------|
203
-
|`planner_core.py`| 36k | Main scaling loop, algorithm implementation |
204
-
|`perf_interpolation.py`| 13k | NPZ data loading and throughput/latency interpolation |
0 commit comments