When using fake-gpu-operator with migStrategy: mixed to simulate MIG resources on CPU-only nodes, the mig-faker component successfully fakes MIG hardware metadata (writes MIG UUIDs, sets nvidia.com/mig.config.state: success) but never registers a Kubernetes device plugin with kubelet. As a result, nvidia.com/mig-1g.5gb (or any nvidia.com/mig-* resource) never appears in node.status.allocatable, making MIG-based scheduling impossible.
Environment
fake-gpu-operator version: 0.0.77
- Kubernetes:
v1.35.1-gke.1396002 (GKE)
- Node type: CPU-only (
e2-standard-8), no real GPUs
- GPU product simulated:
NVIDIA-A100-SXM4-40GB
- MIG strategy:
mixed
- Goal: Scheduler validation with
nvidia.com/mig-1g.5gb workloads
Steps to reproduce
1. Install fake-gpu-operator with MIG config:
helm upgrade -i gpu-operator \
oci://ghcr.io/run-ai/fake-gpu-operator/fake-gpu-operator \
--namespace gpu-operator --create-namespace \
--version 0.0.77 \
--set 'topology.nodePools.mig-pool.gpuCount=1' \
--set 'topology.nodePools.mig-pool.gpuProduct=NVIDIA-A100-SXM4-40GB' \
--set 'topology.migStrategy=mixed'
2. Label nodes for MIG pool and mig-faker:
kubectl label node <node> run.ai/simulated-gpu-node-pool=mig-pool --overwrite
kubectl label node <node> node-role.kubernetes.io/runai-dynamic-mig=true --overwrite
3. Annotate nodes with MIG config:
run.ai/mig.config: |
version: v1
mig-configs:
selected:
- devices: [0]
mig-enabled: true
mig-devices:
- name: 1g.5gb
position: 0
size: 1
4. Observe mig-faker logs — success:
Labelling node <node> with map[nvidia.com/mig.config.state:success]
Labelling node <node> with map[run.ai/mig-mapping:eyIwIjpb...]
Successfully updated MIG config
5. Check allocatable — MIG resources missing:
kubectl get node -o json | jq '.items[].status.allocatable'
{
"nvidia.com/gpu": "1"
}
Expected:
{
"nvidia.com/mig-1g.5gb": "7"
}
Root cause analysis
After tracing the full component chain:
| Component |
Behavior |
| topology-server |
✅ Correctly serves migStrategy: mixed via /topology HTTP endpoint |
| status-updater |
✅ Creates per-node topology ConfigMaps with migStrategy: mixed |
| mig-faker |
✅ Parses annotation, writes MIG UUIDs to node labels, sets mig.config.state: success — but never opens a device plugin socket |
| device-plugin |
❌ Hardcoded to only register nvidia.com/gpu — ignores MIG config entirely |
The device-plugin log always shows:
Starting device plugin for RealNodeDevicePlugin-nvidia.com/gpu
regardless of migStrategy. Only one socket ever exists on the node:
/var/lib/kubelet/device-plugins/fake-nvidia-gpu.sock
No nvidia.com/mig-* socket is ever created.
Additional findings
migProfile is silently dropped — adding migProfile: 1g.5gb to the topology ConfigMap has no effect; the field is not part of the NodePoolTopology struct and the topology-server drops it silently. The /topology HTTP response confirms:
{"NodePools":{"mig-pool":{"GpuCount":1,"GpuMemory":40960,"GpuProduct":"NVIDIA-A100-SXM4-40GB","OtherDevices":null}}}
mig.config annotation format is undocumented — the annotation requires a full YAML struct matching AnnotationMigConfig (with version, mig-configs.selected, devices as integer indices). Plain string values like 1g.5gb fail with:
failed to unmarshal mig config: cannot unmarshal !!str `1g.5gb` into migfaker.AnnotationMigConfig
And devices: [all] fails with:
failed to parse gpu index all: strconv.Atoi: parsing "all": invalid syntax
This format and the required node label node-role.kubernetes.io/runai-dynamic-mig=true are not documented anywhere.
Expected behavior
After mig-faker successfully applies a MIG config, either:
mig-faker itself should register a kubelet device plugin socket advertising nvidia.com/mig-<profile> resources, or
device-plugin should detect migStrategy: mixed from its topology source and register MIG resources instead of / in addition to nvidia.com/gpu
When using
fake-gpu-operatorwithmigStrategy: mixedto simulate MIG resources on CPU-only nodes, themig-fakercomponent successfully fakes MIG hardware metadata (writes MIG UUIDs, setsnvidia.com/mig.config.state: success) but never registers a Kubernetes device plugin with kubelet. As a result,nvidia.com/mig-1g.5gb(or anynvidia.com/mig-*resource) never appears innode.status.allocatable, making MIG-based scheduling impossible.Environment
fake-gpu-operatorversion:0.0.77v1.35.1-gke.1396002(GKE)e2-standard-8), no real GPUsNVIDIA-A100-SXM4-40GBmixednvidia.com/mig-1g.5gbworkloadsSteps to reproduce
1. Install fake-gpu-operator with MIG config:
2. Label nodes for MIG pool and mig-faker:
3. Annotate nodes with MIG config:
4. Observe mig-faker logs — success:
5. Check allocatable — MIG resources missing:
Expected:
Root cause analysis
After tracing the full component chain:
The device-plugin log always shows:
regardless of
migStrategy. Only one socket ever exists on the node:No
nvidia.com/mig-*socket is ever created.Additional findings
migProfileis silently dropped — addingmigProfile: 1g.5gbto the topology ConfigMap has no effect; the field is not part of theNodePoolTopologystruct and the topology-server drops it silently. The/topologyHTTP response confirms:mig.configannotation format is undocumented — the annotation requires a full YAML struct matchingAnnotationMigConfig(withversion,mig-configs.selected,devicesas integer indices). Plain string values like1g.5gbfail with:And
devices: [all]fails with:This format and the required node label
node-role.kubernetes.io/runai-dynamic-mig=trueare not documented anywhere.Expected behavior
After
mig-fakersuccessfully applies a MIG config, either:mig-fakeritself should register a kubelet device plugin socket advertisingnvidia.com/mig-<profile>resources, ordevice-pluginshould detectmigStrategy: mixedfrom its topology source and register MIG resources instead of / in addition tonvidia.com/gpu