Skip to content

Commit e7d29e8

Browse files
authored
Add shared runner architecture design (#12)
Comprehensive design document covering: - Multi-module runner pools with intelligent placement - Module lifecycle state machine (6 states + eviction) - Three-layer volume model with WASI preopen aliasing - Host header routing with lazy loading - Trade-offs, scale-to-zero, and failure recovery Assisted-by: 🤖 Claude Opus/Sonnet 4.5
1 parent 3b45372 commit e7d29e8

1 file changed

Lines changed: 393 additions & 0 deletions

File tree

Lines changed: 393 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,393 @@
1+
# WASM Shared Runners Architecture
2+
3+
> **Status:** Proposal
4+
> **Authors:** knative-serving-wasm maintainers
5+
> **Target:** v1alpha1
6+
7+
## Executive Summary
8+
9+
Today each `WasmModule` creates a dedicated Knative Service with its own runner
10+
pod. The cold start path is: schedule pod → pull runner image → download WASM
11+
module from OCI → compile WASM → serve. This puts WASM startup on par with
12+
(or worse than) regular containers, negating WASM's key advantage: tiny modules.
13+
14+
**Insight:** Runner images are ~50-100 MB; WASM modules are ~100 KB-2 MB.
15+
A pool of long-lived runners can host many modules, with intelligent placement
16+
and on-demand loading in milliseconds — no pod scheduling needed.
17+
18+
### Startup Comparison
19+
20+
| Approach | State | What happens | Latency |
21+
|---|---|---|---|
22+
| Knative container | cold | Schedule pod → pull image → start process | ~3-10 s |
23+
| Knative container | warm | Reuse running pod | <10 ms |
24+
| WASM PoC | cold | Schedule pod → pull runner image → download .wasm → compile | ~2-5 s |
25+
| WASM PoC | warm | Reuse running pod with compiled module | <10 ms |
26+
| WASM shared runner | cold | Download .wasm into running pod → compile | <100 ms |
27+
| WASM shared runner | warm | Route to in-memory module | <10 ms |
28+
29+
The shared runner pool **bypasses the Kubernetes scheduler entirely** for WASM
30+
module scaling. Runner pods are already running — modules are loaded/unloaded at
31+
runtime via the runner's control API, not by creating new pods. No scheduling, no
32+
image pulling, no container startup. Only a lightweight OCI fetch of a
33+
sub-megabyte module. This is the architectural advantage WASM was designed for.
34+
35+
## Architecture Overview
36+
37+
```
38+
┌─────────────────────────────────────────────────────────────────┐
39+
│ Kubernetes Cluster │
40+
│ │
41+
│ ┌──────────────┐ ┌─────────────────────────────────────┐ │
42+
│ │ Controller │ │ Default Runner Pool │ │
43+
│ │ │ │ ┌─────────┐ ┌─────────┐ │ │
44+
│ │ - watches │─────▶│ │Runner 1 │ │Runner 2 │ ... │ │
45+
│ │ WasmModule │ │ │ A, B, C │ │ D, E │ │ │
46+
│ │ - schedules │ │ └─────────┘ └─────────┘ │ │
47+
│ │ placement │ └─────────────────────────────────────┘ │
48+
│ └──────────────┘ │
49+
│ │ ┌─────────────────────────────────────┐ │
50+
│ └─────────────▶│ Named Runner: team-x │ │
51+
│ │ ┌─────────┐ │ │
52+
│ │ │ F, G │ (isolated) │ │
53+
│ │ └─────────┘ │ │
54+
│ └─────────────────────────────────────┘ │
55+
└─────────────────────────────────────────────────────────────────┘
56+
```
57+
58+
### Runner Selection Model
59+
60+
Users specify `spec.runner` in WasmModule:
61+
62+
| Value | Behavior |
63+
|---|---|
64+
| `""` or `default` | Intelligent placement across the default runner pool |
65+
| `<name>` | Dedicated runner for isolation/compliance/custom config |
66+
67+
**Default runner pool** — Multiple runner pods managed by the controller. Module
68+
placement is determined by a scheduler that considers:
69+
- Module size and declared memory limits
70+
- Current runner capacity and load
71+
- Historical telemetry (request patterns, memory usage)
72+
- Co-location affinity (modules that call each other)
73+
- Future: AI-based optimization for composition
74+
75+
The scheduler can rebalance modules across runners without user intervention.
76+
77+
**Named runners** — User-controlled, isolated runner instances for compliance,
78+
custom configuration, or guaranteed performance isolation.
79+
80+
## Module Lifecycle
81+
82+
Modules progress through distinct states, with bytes storage split into two tiers:
83+
84+
```
85+
┌──────────┐ ┌──────────┐ ┌────────┐ ┌────────┐ ┌──────────┐ ┌─────────┐
86+
│ Unloaded │──▶│ Fetching │──▶│ Stored │──▶│ Loaded │──▶│ Compiled │──▶│ Running │
87+
└──────────┘ └──────────┘ └────────┘ └────────┘ └──────────┘ └─────────┘
88+
▲ │ │ │ │ │
89+
│ └──────────────┴────────────┴─────────────┴─────────────┘
90+
│ │ (eviction)
91+
│ ▼
92+
│ (CR update) ┌─────────┐
93+
└───────────────────────│ Error │
94+
└─────────┘
95+
```
96+
97+
| State | Storage | Latency to serve | Survives restart |
98+
|---|---|---|---|
99+
| **Unloaded** | None | ~100+ ms (fetch + compile) | yes (metadata only) |
100+
| **Fetching** | Downloading | N/A | no |
101+
| **Stored** | Disk only | ~60-100 ms (read + compile) | yes |
102+
| **Loaded** | Memory + disk | ~50-80 ms (compile) | no |
103+
| **Compiled** | Machine code | ~1-5 ms (instantiate) | no |
104+
| **Running** | Active instance | <1 ms | no |
105+
| **Error** | Error details | N/A | yes |
106+
107+
### Eviction Strategy
108+
109+
Multi-tier eviction allows fine-grained memory/disk management:
110+
111+
1. **Running → Compiled**: Drop idle instances, keep compiled code
112+
2. **Compiled → Loaded**: Drop machine code, keep bytes in memory
113+
3. **Loaded → Stored**: Free RAM, bytes still on disk
114+
4. **Stored → Unloaded**: Clear disk cache, refetch on next request
115+
116+
Each tier has independent LRU tracking and configurable limits.
117+
118+
### Error Handling
119+
120+
Errors can occur at any stage:
121+
- **Fetching**: Invalid image reference, auth failure, network error
122+
- **Loaded → Compiled**: Invalid WASM, missing exports, compile failure
123+
- **Running**: Runtime trap, fuel exhaustion, memory limit exceeded
124+
125+
**Error is a terminal state** — recovery requires user to update the WasmModule
126+
CR (e.g., fix the image reference). On CR update, the controller resets state
127+
to Unloaded and begins fresh reconciliation.
128+
129+
## Module Isolation
130+
131+
When multiple modules share a runner, each must be isolated from others:
132+
133+
### Volume Mounts
134+
135+
Volume handling spans three layers:
136+
137+
| Layer | Scope | Mutable at Runtime |
138+
|---|---|---|
139+
| K8s Volumes | Pod spec - storage sources | **NO** - requires pod recreation |
140+
| K8s VolumeMounts | Runner filesystem paths | **NO** - requires pod recreation |
141+
| WASI Preopens | Guest paths per module | **YES** - per-module config |
142+
143+
**Key insight**: [`builder.preopened_dir(host_path, guest_path, ...)`](runner/src/server.rs:203)
144+
supports aliasing — the host path and guest path can differ.
145+
146+
#### Runtime-Stable Volume Strategy
147+
148+
To avoid pod recreation when deploying new modules:
149+
150+
```
151+
┌───────────────────────────────────────────────────────────────────┐
152+
│ Three-Layer Volume Model │
153+
├───────────────────────────────────────────────────────────────────┤
154+
│ K8s Volume (pod spec) │ PVC: shared-data │
155+
│ K8s VolumeMount (runner) │ /wasm-volumes/shared-data │
156+
│ WASI Preopen (module-a) │ host: /wasm-volumes/shared-data │
157+
│ │ guest: /data │
158+
│ WASI Preopen (module-b) │ host: /wasm-volumes/shared-data │
159+
│ │ guest: /storage │
160+
└───────────────────────────────────────────────────────────────────┘
161+
```
162+
163+
Runners mount volumes to prefixed paths (`/wasm-volumes/{volume-name}`).
164+
Each module's preopen remaps to its expected guest path at runtime.
165+
166+
#### Volume Profile Matching
167+
168+
Runners are tagged with their "volume profile" — the set of mounted volumes.
169+
The controller places modules on runners with compatible profiles:
170+
171+
| Module Needs | Runner Has | Result |
172+
|---|---|---|
173+
| (none) | (any) | **ALLOWED** - volumeless, fast placement |
174+
| `pvc-A` | `pvc-A, pvc-B` | **ALLOWED** - required volume present |
175+
| `pvc-A, pvc-B` | `pvc-A` | **NEW RUNNER** - missing volume |
176+
177+
#### Volume Access Isolation
178+
179+
Guest paths can be identical across modules (each has isolated WASI context).
180+
The protection is against **unintentional shared volume access**.
181+
182+
**Per-volume opt-in**: We extend `corev1.Volume` with a wrapper type:
183+
184+
```go
185+
type WasmVolume struct {
186+
corev1.Volume `json:",inline"`
187+
Shared bool `json:"shared,omitempty"`
188+
}
189+
```
190+
191+
**Conflict detection** - when two modules on the same runner reference the same volume:
192+
193+
| Module A | Module B | Result |
194+
|---|---|---|
195+
| `pvc-A` at `/data` | `pvc-B` at `/data` | **ALLOWED** - different volumes |
196+
| `pvc-A` at `/mysql-data` | `pvc-A` at `/pgdata` | **REJECTED** - same volume, no opt-in |
197+
| `pvc-A` + `shared: true` | `pvc-A` at `/pgdata` | **REJECTED** - both must opt-in |
198+
| `pvc-A` + `shared: true` | `pvc-A` + `shared: true` | **ALLOWED** - mutual consent |
199+
200+
**Rule**: Two modules accessing the same volume must BOTH declare `shared: true`.
201+
202+
### Environment Variables
203+
204+
Each module has isolated environment variables. Variables are scoped to module
205+
instances — no cross-module visibility.
206+
207+
### Network Permissions
208+
209+
Per-module network configuration (tcp.connect, udp.bind, etc.) is enforced via
210+
the runner's socket permission checks. Modules cannot escalate permissions of
211+
other modules on the same runner.
212+
213+
**Port binding validation**: Two modules on the same runner cannot bind to the
214+
same port. The controller rejects CRs that would cause port conflicts.
215+
216+
### Resource Limits
217+
218+
Memory and CPU limits (fuel) are enforced per-module instance:
219+
- Each WASM instance has its own `StoreLimits`
220+
- Fuel consumption is tracked per-request
221+
- One module exhausting limits does not affect others
222+
223+
**Capacity planning**: Module resource requests are summed and must not exceed
224+
runner capacity. Runner pool sizing is configured via ConfigMap (not per-module
225+
CRs), allowing cluster admins to control:
226+
- Default runner pool size and resource allocation
227+
- Named runner configurations
228+
- Memory/CPU limits per runner pod
229+
230+
## Request Routing
231+
232+
Routing uses **Host header dispatch** — cleaner than path prefixing, no URL rewriting.
233+
234+
### K8s Service Model
235+
236+
Each WasmModule gets a dedicated K8s Service with unique DNS name:
237+
- `module-a.default.svc.cluster.local` → shared runner pod
238+
- `module-b.default.svc.cluster.local` → shared runner pod
239+
240+
All Services share the same `selector` pointing to runner pods:
241+
242+
```yaml
243+
apiVersion: v1
244+
kind: Service
245+
metadata:
246+
name: module-a
247+
namespace: default
248+
spec:
249+
selector:
250+
wasm.knative.dev/runner: default # Shared runner pool
251+
ports:
252+
- port: 80
253+
targetPort: 8080
254+
```
255+
256+
### Runner Dispatch
257+
258+
The runner extracts the Host header and routes to the matching module:
259+
260+
```
261+
Client Request Runner Pod
262+
│ │
263+
│ Host: module-a.default.svc │
264+
├──────────────────────────────────►│
265+
│ │
266+
│ ┌──────────────┴──────────────┐
267+
│ │ Routing Table │
268+
│ │ module-a.* → module-a ctx │
269+
│ │ module-b.* → module-b ctx │
270+
│ └──────────────┬──────────────┘
271+
│ │
272+
│ ▼
273+
│ Execute module-a
274+
│ WASI context
275+
```
276+
277+
### Lazy Loading on Request
278+
279+
Requests to modules in non-Running states trigger just-in-time loading:
280+
281+
| Current State | Action | Latency |
282+
|---|---|---|
283+
| Running | Direct dispatch | <1ms |
284+
| Compiled | Instantiate | ~1ms |
285+
| Stored (disk) | Load → Compile → Instantiate | ~10-50ms |
286+
| Unloaded | Fetch → Store → Load → Compile → Instantiate | ~100-500ms |
287+
288+
## Trade-offs
289+
290+
### Benefits
291+
292+
| Aspect | 1:1 Model | Shared Runners |
293+
|---|---|---|
294+
| Cold start | 2-5 seconds (pod creation) | <100ms (module load) |
295+
| Warm start | <10ms (brief window before scale-to-zero) | <10ms (compiled module cached) |
296+
| Memory overhead | ~50-100MB per runner pod | Amortized across modules |
297+
| K8s scheduler bypass | No | Yes - module placement at runtime |
298+
| Module density | 1 per pod | 10-100+ per pod |
299+
300+
### Costs
301+
302+
| Aspect | Impact | Mitigation |
303+
|---|---|---|
304+
| Blast radius | Runner crash affects all modules | Health checks, graceful degradation |
305+
| Volume changes | Pod recreation disrupts co-located modules | Volume profile matching |
306+
| Isolation boundary | Process-level, not pod-level | WASI sandboxing, resource limits |
307+
| Complexity | Multi-module state management | Well-defined state machine |
308+
| Debugging | Shared logs across modules | Per-module log files (ConfigMap option) |
309+
| Readiness model | K8s readiness is pod-level, not module-level | Module-level readiness in WasmModule status |
310+
| Telemetry | Must aggregate pod + module metrics | Multi-layer telemetry collection |
311+
312+
**Telemetry layers**: The runner must expose both pod-level metrics (memory, CPU,
313+
network) and per-module metrics (request count, latency, fuel consumption, errors).
314+
Module telemetry must be isolated via labels/prefixes to prevent metric collisions:
315+
316+
```
317+
wasm_module_requests_total{module="module-a", namespace="default"} 1234
318+
wasm_module_requests_total{module="module-b", namespace="default"} 567
319+
wasm_runner_memory_bytes{runner="default-pool-1"} 104857600
320+
```
321+
322+
**Logging**: Per-module log files can be enabled via runner ConfigMap:
323+
```yaml
324+
data:
325+
logging.perModuleFiles: "true" # Creates /var/log/wasm/{module-name}.log
326+
```
327+
328+
**Readiness probe limitation**: K8s marks the runner pod as Ready once it starts.
329+
New modules deployed to a running pod bypass K8s readiness probes entirely.
330+
The controller must track per-module readiness via WasmModule status conditions:
331+
332+
```yaml
333+
status:
334+
conditions:
335+
- type: Ready
336+
status: "True"
337+
reason: ModuleRunning
338+
- type: ModuleLoaded
339+
status: "True"
340+
reason: CompiledAndCached
341+
```
342+
343+
Clients should check WasmModule status, not pod readiness.
344+
345+
### When to Use Named Runners
346+
347+
Named runners provide stronger isolation at the cost of density:
348+
349+
| Use Case | Runner Type |
350+
|---|---|
351+
| General workloads, microservices | Default pool |
352+
| Compliance requirements (PCI, HIPAA) | Named, dedicated |
353+
| Modules with specific volume needs | Named with volume profile |
354+
| Resource-intensive modules | Named with higher limits |
355+
356+
## Scale-to-Zero
357+
358+
Shared runners change the scale-to-zero model:
359+
360+
| Model | Trigger | Wake Time |
361+
|---|---|---|
362+
| 1:1 (current) | Pod termination after idle | 2-5s (pod creation) |
363+
| Shared | Module eviction after idle | <100ms (module reload) |
364+
365+
With shared runners, scale-to-zero becomes optional. Keeping `minScale: 1` for the
366+
runner pool is often beneficial — the cost of one warm pod is amortized across
367+
potentially hundreds of modules. Individual modules can still be evicted while the
368+
runner remains warm, ready for instant reloads.
369+
370+
## Failure Recovery
371+
372+
| Failure Type | Detection | Recovery |
373+
|---|---|---|
374+
| Module panic | Caught by WASI runtime | Mark Error state, log, continue serving other modules |
375+
| Runner crash | K8s liveness probe | Pod restart, reload all assigned modules |
376+
| OOM | K8s OOMKilled | Pod restart, reload modules with LRU priority |
377+
| Compile error | Caught during load | Mark Error state, reject requests to that module |
378+
379+
**Module restart**: Error state is terminal. To recover, the user must update the
380+
WasmModule CR (fix image, config), triggering a new reconciliation cycle.
381+
382+
## Closing Thoughts
383+
384+
This architecture shifts WASM workload management from K8s pod orchestration
385+
to in-process module orchestration. The key enabler is WASM's tiny footprint —
386+
a 100-200KB module (typical for our examples) doesn't justify a 100MB pod.
387+
388+
By treating the runner as a multi-tenant runtime and modules as lightweight
389+
tenants, we achieve the density of serverless with the control of containers.
390+
391+
This approach enables competing with cloud Lambda-like solutions in both
392+
performance and footprint, while remaining fully open-source and tunable.
393+
No vendor lock-in, no opaque runtime — just WASI modules on Kubernetes.

0 commit comments

Comments
 (0)