Skip to content

mig-faker does not register a kubelet device plugin — nvidia.com/mig-* resources never appear in node allocatable #177

Description

@nirav-rafay

When using fake-gpu-operator with migStrategy: mixed to simulate MIG resources on CPU-only nodes, the mig-faker component successfully fakes MIG hardware metadata (writes MIG UUIDs, sets nvidia.com/mig.config.state: success) but never registers a Kubernetes device plugin with kubelet. As a result, nvidia.com/mig-1g.5gb (or any nvidia.com/mig-* resource) never appears in node.status.allocatable, making MIG-based scheduling impossible.


Environment

  • fake-gpu-operator version: 0.0.77
  • Kubernetes: v1.35.1-gke.1396002 (GKE)
  • Node type: CPU-only (e2-standard-8), no real GPUs
  • GPU product simulated: NVIDIA-A100-SXM4-40GB
  • MIG strategy: mixed
  • Goal: Scheduler validation with nvidia.com/mig-1g.5gb workloads

Steps to reproduce

1. Install fake-gpu-operator with MIG config:

helm upgrade -i gpu-operator \
  oci://ghcr.io/run-ai/fake-gpu-operator/fake-gpu-operator \
  --namespace gpu-operator --create-namespace \
  --version 0.0.77 \
  --set 'topology.nodePools.mig-pool.gpuCount=1' \
  --set 'topology.nodePools.mig-pool.gpuProduct=NVIDIA-A100-SXM4-40GB' \
  --set 'topology.migStrategy=mixed'

2. Label nodes for MIG pool and mig-faker:

kubectl label node <node> run.ai/simulated-gpu-node-pool=mig-pool --overwrite
kubectl label node <node> node-role.kubernetes.io/runai-dynamic-mig=true --overwrite

3. Annotate nodes with MIG config:

run.ai/mig.config: |
  version: v1
  mig-configs:
    selected:
    - devices: [0]
      mig-enabled: true
      mig-devices:
      - name: 1g.5gb
        position: 0
        size: 1

4. Observe mig-faker logs — success:

Labelling node <node> with map[nvidia.com/mig.config.state:success]
Labelling node <node> with map[run.ai/mig-mapping:eyIwIjpb...]
Successfully updated MIG config

5. Check allocatable — MIG resources missing:

kubectl get node -o json | jq '.items[].status.allocatable'
{
  "nvidia.com/gpu": "1"
}

Expected:

{
  "nvidia.com/mig-1g.5gb": "7"
}

Root cause analysis

After tracing the full component chain:

Component Behavior
topology-server ✅ Correctly serves migStrategy: mixed via /topology HTTP endpoint
status-updater ✅ Creates per-node topology ConfigMaps with migStrategy: mixed
mig-faker ✅ Parses annotation, writes MIG UUIDs to node labels, sets mig.config.state: success — but never opens a device plugin socket
device-plugin ❌ Hardcoded to only register nvidia.com/gpu — ignores MIG config entirely

The device-plugin log always shows:

Starting device plugin for RealNodeDevicePlugin-nvidia.com/gpu

regardless of migStrategy. Only one socket ever exists on the node:

/var/lib/kubelet/device-plugins/fake-nvidia-gpu.sock

No nvidia.com/mig-* socket is ever created.


Additional findings

  1. migProfile is silently dropped — adding migProfile: 1g.5gb to the topology ConfigMap has no effect; the field is not part of the NodePoolTopology struct and the topology-server drops it silently. The /topology HTTP response confirms:
{"NodePools":{"mig-pool":{"GpuCount":1,"GpuMemory":40960,"GpuProduct":"NVIDIA-A100-SXM4-40GB","OtherDevices":null}}}
  1. mig.config annotation format is undocumented — the annotation requires a full YAML struct matching AnnotationMigConfig (with version, mig-configs.selected, devices as integer indices). Plain string values like 1g.5gb fail with:
failed to unmarshal mig config: cannot unmarshal !!str `1g.5gb` into migfaker.AnnotationMigConfig

And devices: [all] fails with:

failed to parse gpu index all: strconv.Atoi: parsing "all": invalid syntax

This format and the required node label node-role.kubernetes.io/runai-dynamic-mig=true are not documented anywhere.


Expected behavior

After mig-faker successfully applies a MIG config, either:

  • mig-faker itself should register a kubelet device plugin socket advertising nvidia.com/mig-<profile> resources, or
  • device-plugin should detect migStrategy: mixed from its topology source and register MIG resources instead of / in addition to nvidia.com/gpu

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions