Skip to content

[Bug] Issue with efa device plugin running as root #6222

@vsoch

Description

@vsoch

Hi! I opened the issue here aws-samples/aws-efa-eks#8 so they can be tracked in sync. I just updated my version of eksctl and it pulled in the new changes, and we started seeing the issue I'll report here. We are creating an EKS cluster with eksctl, specifically like this:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: flux-cluster
  region: us-east-2
  version: "1.23"
  

availabilityZones: ["us-east-2b", "us-east-2c"]
managedNodeGroups:
  - name: workers
    instanceType: hpc6a.48xlarge
    minSize: 64
    maxSize: 64
    labels: { "fluxoperator": "true" }
    availabilityZones: ["us-east-2b"]
    efaEnabled: true
    placement:
      groupName: eks-efa-testing

And when I request a job asking for efa for my pods, e.g, (this is our operator CRD that has worked before):

# Resource limits to enable efa
resources:
    limits:
        vpc.amazonaws.com/efa: 1
        memory: "340G"
        cpu: 94

the pods are stuck in pending. Further inspection reveals:

Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  27s (x11 over 13m)  default-scheduler  0/64 nodes are available: 64 Insufficient vpc.amazonaws.com/efa.

And then I realized I could look at the logs of the pod that is supposed to provide the efa (which is where I found the container name / config that is provided in the manifest folder of this repo) and I saw:

$ kubectl describe pods -n kube-system aws-efa-k8s-device-plugin-daemonset-zpg2s
...
  Warning  Failed     64m (x12 over 66m)    kubelet            Error: container has runAsNonRoot and image will run as root (pod: "aws-efa-k8s-device-plugin-daemonset-zpg2s_kube-system(1b46d2ac-c922-449b-b630-bab344976d9f)", container: aws-efa-k8s-device-plugin)
  Normal   Pulled     115s (x303 over 66m)  kubelet            Container image "602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/aws-efa-k8s-device-plugin:v0.3.3" already present on machine

I traced that to this change 943de83 that must have come with the updated eksctl. And unless there is a plan to update the container, I want to suggest you remove this added boolean. This is likely the version I used that was working before the update (and mirrors the one I found in your example repo) https://github.com/weaveworks/eksctl/blob/7ad54ae5d60d730e6d2ca8741d866f5415bab518/pkg/addons/assets/efa-device-plugin.yaml. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugpriority/important-longtermImportant over the long term, but may not be currently staffed and/or may require multiple releases

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions