|
| 1 | +# WASM Shared Runners Architecture |
| 2 | + |
| 3 | +> **Status:** Proposal |
| 4 | +> **Authors:** knative-serving-wasm maintainers |
| 5 | +> **Target:** v1alpha1 |
| 6 | +
|
| 7 | +## Executive Summary |
| 8 | + |
| 9 | +Today each `WasmModule` creates a dedicated Knative Service with its own runner |
| 10 | +pod. The cold start path is: schedule pod → pull runner image → download WASM |
| 11 | +module from OCI → compile WASM → serve. This puts WASM startup on par with |
| 12 | +(or worse than) regular containers, negating WASM's key advantage: tiny modules. |
| 13 | + |
| 14 | +**Insight:** Runner images are ~50-100 MB; WASM modules are ~100 KB-2 MB. |
| 15 | +A pool of long-lived runners can host many modules, with intelligent placement |
| 16 | +and on-demand loading in milliseconds — no pod scheduling needed. |
| 17 | + |
| 18 | +### Startup Comparison |
| 19 | + |
| 20 | +| Approach | State | What happens | Latency | |
| 21 | +|---|---|---|---| |
| 22 | +| Knative container | cold | Schedule pod → pull image → start process | ~3-10 s | |
| 23 | +| Knative container | warm | Reuse running pod | <10 ms | |
| 24 | +| WASM PoC | cold | Schedule pod → pull runner image → download .wasm → compile | ~2-5 s | |
| 25 | +| WASM PoC | warm | Reuse running pod with compiled module | <10 ms | |
| 26 | +| WASM shared runner | cold | Download .wasm into running pod → compile | <100 ms | |
| 27 | +| WASM shared runner | warm | Route to in-memory module | <10 ms | |
| 28 | + |
| 29 | +The shared runner pool **bypasses the Kubernetes scheduler entirely** for WASM |
| 30 | +module scaling. Runner pods are already running — modules are loaded/unloaded at |
| 31 | +runtime via the runner's control API, not by creating new pods. No scheduling, no |
| 32 | +image pulling, no container startup. Only a lightweight OCI fetch of a |
| 33 | +sub-megabyte module. This is the architectural advantage WASM was designed for. |
| 34 | + |
| 35 | +## Architecture Overview |
| 36 | + |
| 37 | +``` |
| 38 | +┌─────────────────────────────────────────────────────────────────┐ |
| 39 | +│ Kubernetes Cluster │ |
| 40 | +│ │ |
| 41 | +│ ┌──────────────┐ ┌─────────────────────────────────────┐ │ |
| 42 | +│ │ Controller │ │ Default Runner Pool │ │ |
| 43 | +│ │ │ │ ┌─────────┐ ┌─────────┐ │ │ |
| 44 | +│ │ - watches │─────▶│ │Runner 1 │ │Runner 2 │ ... │ │ |
| 45 | +│ │ WasmModule │ │ │ A, B, C │ │ D, E │ │ │ |
| 46 | +│ │ - schedules │ │ └─────────┘ └─────────┘ │ │ |
| 47 | +│ │ placement │ └─────────────────────────────────────┘ │ |
| 48 | +│ └──────────────┘ │ |
| 49 | +│ │ ┌─────────────────────────────────────┐ │ |
| 50 | +│ └─────────────▶│ Named Runner: team-x │ │ |
| 51 | +│ │ ┌─────────┐ │ │ |
| 52 | +│ │ │ F, G │ (isolated) │ │ |
| 53 | +│ │ └─────────┘ │ │ |
| 54 | +│ └─────────────────────────────────────┘ │ |
| 55 | +└─────────────────────────────────────────────────────────────────┘ |
| 56 | +``` |
| 57 | + |
| 58 | +### Runner Selection Model |
| 59 | + |
| 60 | +Users specify `spec.runner` in WasmModule: |
| 61 | + |
| 62 | +| Value | Behavior | |
| 63 | +|---|---| |
| 64 | +| `""` or `default` | Intelligent placement across the default runner pool | |
| 65 | +| `<name>` | Dedicated runner for isolation/compliance/custom config | |
| 66 | + |
| 67 | +**Default runner pool** — Multiple runner pods managed by the controller. Module |
| 68 | +placement is determined by a scheduler that considers: |
| 69 | +- Module size and declared memory limits |
| 70 | +- Current runner capacity and load |
| 71 | +- Historical telemetry (request patterns, memory usage) |
| 72 | +- Co-location affinity (modules that call each other) |
| 73 | +- Future: AI-based optimization for composition |
| 74 | + |
| 75 | +The scheduler can rebalance modules across runners without user intervention. |
| 76 | + |
| 77 | +**Named runners** — User-controlled, isolated runner instances for compliance, |
| 78 | +custom configuration, or guaranteed performance isolation. |
| 79 | + |
| 80 | +## Module Lifecycle |
| 81 | + |
| 82 | +Modules progress through distinct states, with bytes storage split into two tiers: |
| 83 | + |
| 84 | +``` |
| 85 | +┌──────────┐ ┌──────────┐ ┌────────┐ ┌────────┐ ┌──────────┐ ┌─────────┐ |
| 86 | +│ Unloaded │──▶│ Fetching │──▶│ Stored │──▶│ Loaded │──▶│ Compiled │──▶│ Running │ |
| 87 | +└──────────┘ └──────────┘ └────────┘ └────────┘ └──────────┘ └─────────┘ |
| 88 | + ▲ │ │ │ │ │ |
| 89 | + │ └──────────────┴────────────┴─────────────┴─────────────┘ |
| 90 | + │ │ (eviction) |
| 91 | + │ ▼ |
| 92 | + │ (CR update) ┌─────────┐ |
| 93 | + └───────────────────────│ Error │ |
| 94 | + └─────────┘ |
| 95 | +``` |
| 96 | + |
| 97 | +| State | Storage | Latency to serve | Survives restart | |
| 98 | +|---|---|---|---| |
| 99 | +| **Unloaded** | None | ~100+ ms (fetch + compile) | yes (metadata only) | |
| 100 | +| **Fetching** | Downloading | N/A | no | |
| 101 | +| **Stored** | Disk only | ~60-100 ms (read + compile) | yes | |
| 102 | +| **Loaded** | Memory + disk | ~50-80 ms (compile) | no | |
| 103 | +| **Compiled** | Machine code | ~1-5 ms (instantiate) | no | |
| 104 | +| **Running** | Active instance | <1 ms | no | |
| 105 | +| **Error** | Error details | N/A | yes | |
| 106 | + |
| 107 | +### Eviction Strategy |
| 108 | + |
| 109 | +Multi-tier eviction allows fine-grained memory/disk management: |
| 110 | + |
| 111 | +1. **Running → Compiled**: Drop idle instances, keep compiled code |
| 112 | +2. **Compiled → Loaded**: Drop machine code, keep bytes in memory |
| 113 | +3. **Loaded → Stored**: Free RAM, bytes still on disk |
| 114 | +4. **Stored → Unloaded**: Clear disk cache, refetch on next request |
| 115 | + |
| 116 | +Each tier has independent LRU tracking and configurable limits. |
| 117 | + |
| 118 | +### Error Handling |
| 119 | + |
| 120 | +Errors can occur at any stage: |
| 121 | +- **Fetching**: Invalid image reference, auth failure, network error |
| 122 | +- **Loaded → Compiled**: Invalid WASM, missing exports, compile failure |
| 123 | +- **Running**: Runtime trap, fuel exhaustion, memory limit exceeded |
| 124 | + |
| 125 | +**Error is a terminal state** — recovery requires user to update the WasmModule |
| 126 | +CR (e.g., fix the image reference). On CR update, the controller resets state |
| 127 | +to Unloaded and begins fresh reconciliation. |
| 128 | + |
| 129 | +## Module Isolation |
| 130 | + |
| 131 | +When multiple modules share a runner, each must be isolated from others: |
| 132 | + |
| 133 | +### Volume Mounts |
| 134 | + |
| 135 | +Volume handling spans three layers: |
| 136 | + |
| 137 | +| Layer | Scope | Mutable at Runtime | |
| 138 | +|---|---|---| |
| 139 | +| K8s Volumes | Pod spec - storage sources | **NO** - requires pod recreation | |
| 140 | +| K8s VolumeMounts | Runner filesystem paths | **NO** - requires pod recreation | |
| 141 | +| WASI Preopens | Guest paths per module | **YES** - per-module config | |
| 142 | + |
| 143 | +**Key insight**: [`builder.preopened_dir(host_path, guest_path, ...)`](runner/src/server.rs:203) |
| 144 | +supports aliasing — the host path and guest path can differ. |
| 145 | + |
| 146 | +#### Runtime-Stable Volume Strategy |
| 147 | + |
| 148 | +To avoid pod recreation when deploying new modules: |
| 149 | + |
| 150 | +``` |
| 151 | +┌───────────────────────────────────────────────────────────────────┐ |
| 152 | +│ Three-Layer Volume Model │ |
| 153 | +├───────────────────────────────────────────────────────────────────┤ |
| 154 | +│ K8s Volume (pod spec) │ PVC: shared-data │ |
| 155 | +│ K8s VolumeMount (runner) │ /wasm-volumes/shared-data │ |
| 156 | +│ WASI Preopen (module-a) │ host: /wasm-volumes/shared-data │ |
| 157 | +│ │ guest: /data │ |
| 158 | +│ WASI Preopen (module-b) │ host: /wasm-volumes/shared-data │ |
| 159 | +│ │ guest: /storage │ |
| 160 | +└───────────────────────────────────────────────────────────────────┘ |
| 161 | +``` |
| 162 | + |
| 163 | +Runners mount volumes to prefixed paths (`/wasm-volumes/{volume-name}`). |
| 164 | +Each module's preopen remaps to its expected guest path at runtime. |
| 165 | + |
| 166 | +#### Volume Profile Matching |
| 167 | + |
| 168 | +Runners are tagged with their "volume profile" — the set of mounted volumes. |
| 169 | +The controller places modules on runners with compatible profiles: |
| 170 | + |
| 171 | +| Module Needs | Runner Has | Result | |
| 172 | +|---|---|---| |
| 173 | +| (none) | (any) | **ALLOWED** - volumeless, fast placement | |
| 174 | +| `pvc-A` | `pvc-A, pvc-B` | **ALLOWED** - required volume present | |
| 175 | +| `pvc-A, pvc-B` | `pvc-A` | **NEW RUNNER** - missing volume | |
| 176 | + |
| 177 | +#### Volume Access Isolation |
| 178 | + |
| 179 | +Guest paths can be identical across modules (each has isolated WASI context). |
| 180 | +The protection is against **unintentional shared volume access**. |
| 181 | + |
| 182 | +**Per-volume opt-in**: We extend `corev1.Volume` with a wrapper type: |
| 183 | + |
| 184 | +```go |
| 185 | +type WasmVolume struct { |
| 186 | + corev1.Volume `json:",inline"` |
| 187 | + Shared bool `json:"shared,omitempty"` |
| 188 | +} |
| 189 | +``` |
| 190 | + |
| 191 | +**Conflict detection** - when two modules on the same runner reference the same volume: |
| 192 | + |
| 193 | +| Module A | Module B | Result | |
| 194 | +|---|---|---| |
| 195 | +| `pvc-A` at `/data` | `pvc-B` at `/data` | **ALLOWED** - different volumes | |
| 196 | +| `pvc-A` at `/mysql-data` | `pvc-A` at `/pgdata` | **REJECTED** - same volume, no opt-in | |
| 197 | +| `pvc-A` + `shared: true` | `pvc-A` at `/pgdata` | **REJECTED** - both must opt-in | |
| 198 | +| `pvc-A` + `shared: true` | `pvc-A` + `shared: true` | **ALLOWED** - mutual consent | |
| 199 | + |
| 200 | +**Rule**: Two modules accessing the same volume must BOTH declare `shared: true`. |
| 201 | + |
| 202 | +### Environment Variables |
| 203 | + |
| 204 | +Each module has isolated environment variables. Variables are scoped to module |
| 205 | +instances — no cross-module visibility. |
| 206 | + |
| 207 | +### Network Permissions |
| 208 | + |
| 209 | +Per-module network configuration (tcp.connect, udp.bind, etc.) is enforced via |
| 210 | +the runner's socket permission checks. Modules cannot escalate permissions of |
| 211 | +other modules on the same runner. |
| 212 | + |
| 213 | +**Port binding validation**: Two modules on the same runner cannot bind to the |
| 214 | +same port. The controller rejects CRs that would cause port conflicts. |
| 215 | + |
| 216 | +### Resource Limits |
| 217 | + |
| 218 | +Memory and CPU limits (fuel) are enforced per-module instance: |
| 219 | +- Each WASM instance has its own `StoreLimits` |
| 220 | +- Fuel consumption is tracked per-request |
| 221 | +- One module exhausting limits does not affect others |
| 222 | + |
| 223 | +**Capacity planning**: Module resource requests are summed and must not exceed |
| 224 | +runner capacity. Runner pool sizing is configured via ConfigMap (not per-module |
| 225 | +CRs), allowing cluster admins to control: |
| 226 | +- Default runner pool size and resource allocation |
| 227 | +- Named runner configurations |
| 228 | +- Memory/CPU limits per runner pod |
| 229 | + |
| 230 | +## Request Routing |
| 231 | + |
| 232 | +Routing uses **Host header dispatch** — cleaner than path prefixing, no URL rewriting. |
| 233 | + |
| 234 | +### K8s Service Model |
| 235 | + |
| 236 | +Each WasmModule gets a dedicated K8s Service with unique DNS name: |
| 237 | +- `module-a.default.svc.cluster.local` → shared runner pod |
| 238 | +- `module-b.default.svc.cluster.local` → shared runner pod |
| 239 | + |
| 240 | +All Services share the same `selector` pointing to runner pods: |
| 241 | + |
| 242 | +```yaml |
| 243 | +apiVersion: v1 |
| 244 | +kind: Service |
| 245 | +metadata: |
| 246 | + name: module-a |
| 247 | + namespace: default |
| 248 | +spec: |
| 249 | + selector: |
| 250 | + wasm.knative.dev/runner: default # Shared runner pool |
| 251 | + ports: |
| 252 | + - port: 80 |
| 253 | + targetPort: 8080 |
| 254 | +``` |
| 255 | +
|
| 256 | +### Runner Dispatch |
| 257 | +
|
| 258 | +The runner extracts the Host header and routes to the matching module: |
| 259 | +
|
| 260 | +``` |
| 261 | +Client Request Runner Pod |
| 262 | + │ │ |
| 263 | + │ Host: module-a.default.svc │ |
| 264 | + ├──────────────────────────────────►│ |
| 265 | + │ │ |
| 266 | + │ ┌──────────────┴──────────────┐ |
| 267 | + │ │ Routing Table │ |
| 268 | + │ │ module-a.* → module-a ctx │ |
| 269 | + │ │ module-b.* → module-b ctx │ |
| 270 | + │ └──────────────┬──────────────┘ |
| 271 | + │ │ |
| 272 | + │ ▼ |
| 273 | + │ Execute module-a |
| 274 | + │ WASI context |
| 275 | +``` |
| 276 | + |
| 277 | +### Lazy Loading on Request |
| 278 | + |
| 279 | +Requests to modules in non-Running states trigger just-in-time loading: |
| 280 | + |
| 281 | +| Current State | Action | Latency | |
| 282 | +|---|---|---| |
| 283 | +| Running | Direct dispatch | <1ms | |
| 284 | +| Compiled | Instantiate | ~1ms | |
| 285 | +| Stored (disk) | Load → Compile → Instantiate | ~10-50ms | |
| 286 | +| Unloaded | Fetch → Store → Load → Compile → Instantiate | ~100-500ms | |
| 287 | + |
| 288 | +## Trade-offs |
| 289 | + |
| 290 | +### Benefits |
| 291 | + |
| 292 | +| Aspect | 1:1 Model | Shared Runners | |
| 293 | +|---|---|---| |
| 294 | +| Cold start | 2-5 seconds (pod creation) | <100ms (module load) | |
| 295 | +| Warm start | <10ms (brief window before scale-to-zero) | <10ms (compiled module cached) | |
| 296 | +| Memory overhead | ~50-100MB per runner pod | Amortized across modules | |
| 297 | +| K8s scheduler bypass | No | Yes - module placement at runtime | |
| 298 | +| Module density | 1 per pod | 10-100+ per pod | |
| 299 | + |
| 300 | +### Costs |
| 301 | + |
| 302 | +| Aspect | Impact | Mitigation | |
| 303 | +|---|---|---| |
| 304 | +| Blast radius | Runner crash affects all modules | Health checks, graceful degradation | |
| 305 | +| Volume changes | Pod recreation disrupts co-located modules | Volume profile matching | |
| 306 | +| Isolation boundary | Process-level, not pod-level | WASI sandboxing, resource limits | |
| 307 | +| Complexity | Multi-module state management | Well-defined state machine | |
| 308 | +| Debugging | Shared logs across modules | Per-module log files (ConfigMap option) | |
| 309 | +| Readiness model | K8s readiness is pod-level, not module-level | Module-level readiness in WasmModule status | |
| 310 | +| Telemetry | Must aggregate pod + module metrics | Multi-layer telemetry collection | |
| 311 | + |
| 312 | +**Telemetry layers**: The runner must expose both pod-level metrics (memory, CPU, |
| 313 | +network) and per-module metrics (request count, latency, fuel consumption, errors). |
| 314 | +Module telemetry must be isolated via labels/prefixes to prevent metric collisions: |
| 315 | + |
| 316 | +``` |
| 317 | +wasm_module_requests_total{module="module-a", namespace="default"} 1234 |
| 318 | +wasm_module_requests_total{module="module-b", namespace="default"} 567 |
| 319 | +wasm_runner_memory_bytes{runner="default-pool-1"} 104857600 |
| 320 | +``` |
| 321 | + |
| 322 | +**Logging**: Per-module log files can be enabled via runner ConfigMap: |
| 323 | +```yaml |
| 324 | +data: |
| 325 | + logging.perModuleFiles: "true" # Creates /var/log/wasm/{module-name}.log |
| 326 | +``` |
| 327 | +
|
| 328 | +**Readiness probe limitation**: K8s marks the runner pod as Ready once it starts. |
| 329 | +New modules deployed to a running pod bypass K8s readiness probes entirely. |
| 330 | +The controller must track per-module readiness via WasmModule status conditions: |
| 331 | +
|
| 332 | +```yaml |
| 333 | +status: |
| 334 | + conditions: |
| 335 | + - type: Ready |
| 336 | + status: "True" |
| 337 | + reason: ModuleRunning |
| 338 | + - type: ModuleLoaded |
| 339 | + status: "True" |
| 340 | + reason: CompiledAndCached |
| 341 | +``` |
| 342 | +
|
| 343 | +Clients should check WasmModule status, not pod readiness. |
| 344 | +
|
| 345 | +### When to Use Named Runners |
| 346 | +
|
| 347 | +Named runners provide stronger isolation at the cost of density: |
| 348 | +
|
| 349 | +| Use Case | Runner Type | |
| 350 | +|---|---| |
| 351 | +| General workloads, microservices | Default pool | |
| 352 | +| Compliance requirements (PCI, HIPAA) | Named, dedicated | |
| 353 | +| Modules with specific volume needs | Named with volume profile | |
| 354 | +| Resource-intensive modules | Named with higher limits | |
| 355 | +
|
| 356 | +## Scale-to-Zero |
| 357 | +
|
| 358 | +Shared runners change the scale-to-zero model: |
| 359 | +
|
| 360 | +| Model | Trigger | Wake Time | |
| 361 | +|---|---|---| |
| 362 | +| 1:1 (current) | Pod termination after idle | 2-5s (pod creation) | |
| 363 | +| Shared | Module eviction after idle | <100ms (module reload) | |
| 364 | +
|
| 365 | +With shared runners, scale-to-zero becomes optional. Keeping `minScale: 1` for the |
| 366 | +runner pool is often beneficial — the cost of one warm pod is amortized across |
| 367 | +potentially hundreds of modules. Individual modules can still be evicted while the |
| 368 | +runner remains warm, ready for instant reloads. |
| 369 | + |
| 370 | +## Failure Recovery |
| 371 | + |
| 372 | +| Failure Type | Detection | Recovery | |
| 373 | +|---|---|---| |
| 374 | +| Module panic | Caught by WASI runtime | Mark Error state, log, continue serving other modules | |
| 375 | +| Runner crash | K8s liveness probe | Pod restart, reload all assigned modules | |
| 376 | +| OOM | K8s OOMKilled | Pod restart, reload modules with LRU priority | |
| 377 | +| Compile error | Caught during load | Mark Error state, reject requests to that module | |
| 378 | + |
| 379 | +**Module restart**: Error state is terminal. To recover, the user must update the |
| 380 | +WasmModule CR (fix image, config), triggering a new reconciliation cycle. |
| 381 | + |
| 382 | +## Closing Thoughts |
| 383 | + |
| 384 | +This architecture shifts WASM workload management from K8s pod orchestration |
| 385 | +to in-process module orchestration. The key enabler is WASM's tiny footprint — |
| 386 | +a 100-200KB module (typical for our examples) doesn't justify a 100MB pod. |
| 387 | + |
| 388 | +By treating the runner as a multi-tenant runtime and modules as lightweight |
| 389 | +tenants, we achieve the density of serverless with the control of containers. |
| 390 | + |
| 391 | +This approach enables competing with cloud Lambda-like solutions in both |
| 392 | +performance and footprint, while remaining fully open-source and tunable. |
| 393 | +No vendor lock-in, no opaque runtime — just WASI modules on Kubernetes. |
0 commit comments