Skip to content

bug(runc): AmazonLinux2 runc 1.3.2 amazon-eks-node-1.32-v20251103 containers fail to start with "failed to write cpu.cfs_quota_us: invalid argument" #2498

@matzegebbe

Description

@matzegebbe

What happened:

After upgrading nodes to amazon-eks-node-1.32-v20251103, containers with the limits set to e.g.

        resources:
          limits:
            cpu: 125m
            memory: 32Mi
          requests:
            cpu: 125m
            memory: 32Mi

fail to start with the following error:

Error: failed to create containerd task: failed to create shim task:
OCI runtime create failed: runc create failed: unable to start container process:
error during container init: error setting cgroup config for procHooks process:
failed to write "13000": write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-podaf0df499_2151_4b3d_b80d_161533ca5b8e.slice/cri-containerd-xxx-xxx-frontend.scope/cpu.cfs_quota_us:
invalid argument: unknown

The same workload runs correctly on the previous AMI amazon-eks-node-1.32-v20251023.

What you expected to happen:

Pods with fractional CPU limits (e.g. cpu: 120m or cpu: 125m) should start normally as they did on earlier AMIs.

How to reproduce it (as minimally and precisely as possible):

  1. Launch a node with AMI amazon-eks-node-1.32-v20251103.
  2. Deploy any Pod with fractional CPU limits:
    resources:
      limits:
        cpu: 125m
        memory: 32Mi
      requests:
        cpu: 125m
        memory: 32Mi
  3. Observe the Pod fail to start with the cpu.cfs_quota_us: invalid argument error.
  4. Run the same manifest on a node using AMI amazon-eks-node-1.32-v20251023 - Pod starts successfully.

Environment:

  • AWS Region: eu-central-1
  • Instance Type(s): t3a.xlarge
  • Cluster Kubernetes version: 1.32
  • Node Kubernetes version: 1.32
  • AMI Version:
    • Broken: amazon-eks-node-1.32-v20251103
    Kernel: 5.10.245-241.976.amzn2
    containerd: 1.7.27
    runc: 1.3.2
    cgroup fs: tmpfs (v1)
    
    • Working: amazon-eks-node-1.32-v20251023
    Kernel: 5.10.244-240.970.amzn2
    containerd: 1.7.27
    runc: 1.3.1
    cgroup fs: tmpfs (v1)
    

Additional context:

This appears to be caused by a regression in runc 1.3.2, which introduces stricter validation of cgroup v1 CPU quotas: https://github.com/opencontainers/runc/releases/tag/v1.3.2 (al2 uses f cgroup v1)

Workarounds:

  • Downgrade runc to 1.3.1
  • Or round CPU limits to cleaner values (100m, 250m, etc.)
  • Use old AMI Image amazon-eks-node-1.32-v20251023

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions