What happened:
After upgrading nodes to amazon-eks-node-1.32-v20251103, containers with the limits set to e.g.
resources:
limits:
cpu: 125m
memory: 32Mi
requests:
cpu: 125m
memory: 32Mi
fail to start with the following error:
Error: failed to create containerd task: failed to create shim task:
OCI runtime create failed: runc create failed: unable to start container process:
error during container init: error setting cgroup config for procHooks process:
failed to write "13000": write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-podaf0df499_2151_4b3d_b80d_161533ca5b8e.slice/cri-containerd-xxx-xxx-frontend.scope/cpu.cfs_quota_us:
invalid argument: unknown
The same workload runs correctly on the previous AMI amazon-eks-node-1.32-v20251023.
What you expected to happen:
Pods with fractional CPU limits (e.g. cpu: 120m or cpu: 125m) should start normally as they did on earlier AMIs.
How to reproduce it (as minimally and precisely as possible):
- Launch a node with AMI
amazon-eks-node-1.32-v20251103.
- Deploy any Pod with fractional CPU limits:
resources:
limits:
cpu: 125m
memory: 32Mi
requests:
cpu: 125m
memory: 32Mi
- Observe the Pod fail to start with the
cpu.cfs_quota_us: invalid argument error.
- Run the same manifest on a node using AMI
amazon-eks-node-1.32-v20251023 - Pod starts successfully.
Environment:
- AWS Region: eu-central-1
- Instance Type(s): t3a.xlarge
- Cluster Kubernetes version: 1.32
- Node Kubernetes version: 1.32
- AMI Version:
- Broken:
amazon-eks-node-1.32-v20251103
Kernel: 5.10.245-241.976.amzn2
containerd: 1.7.27
runc: 1.3.2
cgroup fs: tmpfs (v1)
- Working:
amazon-eks-node-1.32-v20251023
Kernel: 5.10.244-240.970.amzn2
containerd: 1.7.27
runc: 1.3.1
cgroup fs: tmpfs (v1)
Additional context:
This appears to be caused by a regression in runc 1.3.2, which introduces stricter validation of cgroup v1 CPU quotas: https://github.com/opencontainers/runc/releases/tag/v1.3.2 (al2 uses f cgroup v1)
Workarounds:
- Downgrade
runc to 1.3.1
- Or round CPU limits to cleaner values (
100m, 250m, etc.)
- Use old AMI Image amazon-eks-node-1.32-v20251023
What happened:
After upgrading nodes to
amazon-eks-node-1.32-v20251103, containers with the limits set to e.g.fail to start with the following error:
The same workload runs correctly on the previous AMI
amazon-eks-node-1.32-v20251023.What you expected to happen:
Pods with fractional CPU limits (e.g.
cpu: 120morcpu: 125m) should start normally as they did on earlier AMIs.How to reproduce it (as minimally and precisely as possible):
amazon-eks-node-1.32-v20251103.cpu.cfs_quota_us: invalid argumenterror.amazon-eks-node-1.32-v20251023- Pod starts successfully.Environment:
amazon-eks-node-1.32-v20251103amazon-eks-node-1.32-v20251023Additional context:
This appears to be caused by a regression in
runc 1.3.2, which introduces stricter validation of cgroup v1 CPU quotas: https://github.com/opencontainers/runc/releases/tag/v1.3.2 (al2 uses f cgroup v1)Workarounds:
runcto1.3.1100m,250m, etc.)