-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Hi! I opened the issue here aws-samples/aws-efa-eks#8 so they can be tracked in sync. I just updated my version of eksctl and it pulled in the new changes, and we started seeing the issue I'll report here. We are creating an EKS cluster with eksctl, specifically like this:
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: flux-cluster
region: us-east-2
version: "1.23"
availabilityZones: ["us-east-2b", "us-east-2c"]
managedNodeGroups:
- name: workers
instanceType: hpc6a.48xlarge
minSize: 64
maxSize: 64
labels: { "fluxoperator": "true" }
availabilityZones: ["us-east-2b"]
efaEnabled: true
placement:
groupName: eks-efa-testingAnd when I request a job asking for efa for my pods, e.g, (this is our operator CRD that has worked before):
# Resource limits to enable efa
resources:
limits:
vpc.amazonaws.com/efa: 1
memory: "340G"
cpu: 94the pods are stuck in pending. Further inspection reveals:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 27s (x11 over 13m) default-scheduler 0/64 nodes are available: 64 Insufficient vpc.amazonaws.com/efa.And then I realized I could look at the logs of the pod that is supposed to provide the efa (which is where I found the container name / config that is provided in the manifest folder of this repo) and I saw:
$ kubectl describe pods -n kube-system aws-efa-k8s-device-plugin-daemonset-zpg2s
...
Warning Failed 64m (x12 over 66m) kubelet Error: container has runAsNonRoot and image will run as root (pod: "aws-efa-k8s-device-plugin-daemonset-zpg2s_kube-system(1b46d2ac-c922-449b-b630-bab344976d9f)", container: aws-efa-k8s-device-plugin)
Normal Pulled 115s (x303 over 66m) kubelet Container image "602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/aws-efa-k8s-device-plugin:v0.3.3" already present on machine
I traced that to this change 943de83 that must have come with the updated eksctl. And unless there is a plan to update the container, I want to suggest you remove this added boolean. This is likely the version I used that was working before the update (and mirrors the one I found in your example repo) https://github.com/weaveworks/eksctl/blob/7ad54ae5d60d730e6d2ca8741d866f5415bab518/pkg/addons/assets/efa-device-plugin.yaml. Thanks!