Skip to content

[Feature] Add more supported instances (P5) to EFA Device Plugin DaemonSet - refer to eks-charts #8089

@ytsssun

Description

@ytsssun

What feature/behavior/change do you want?

eksctl would try to install the EFA device plugin if efaEnabled is set to true. The pkg/addons/assets/efa-device-plugin.yaml maintained in this repo is what eksctl used to deploy the EFA device plugin DaemonSet. However, it is pretty outdated:

  • It lacks of the newly supported instance type like p5 instances.
  • It is still pointing to a fairly old image tag /eks/aws-efa-k8s-device-plugin:v0.3.3. The official EFA Device plugin vended by eks-charts is pointing to image tag v0.5.4 already.

Instead of maintain the yaml file, can we just generate it from the latest eks-chart? One can do this

git clone https://github.com/aws/eks-charts.git
cd eks-charts/stable/aws-efa-k8s-device-plugin/
helm template . > efa-device-plugin.yaml

The generated efa-device-plugin.yaml would be like below

---
# Source: aws-efa-k8s-device-plugin/templates/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: release-name-aws-efa-k8s-device-plugin
  labels:
    helm.sh/chart: aws-efa-k8s-device-plugin-v0.5.7
    app.kubernetes.io/name: aws-efa-k8s-device-plugin
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "v0.5.4"
    app.kubernetes.io/managed-by: Helm
spec:
  selector:
    matchLabels:
      name:  release-name-aws-efa-k8s-device-plugin
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: release-name-aws-efa-k8s-device-plugin
    spec:
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                - key: node.kubernetes.io/instance-type
                  operator: In
                  values:
                    - m5dn.24xlarge
                    - m5dn.metal
                    - m5n.24xlarge
                    - m5n.metal
                    - m5zn.12xlarge
                    - m5zn.metal
                    - m6a.48xlarge
                    - m6a.metal
                    - m6i.32xlarge
                    - m6i.metal
                    - m6id.32xlarge
                    - m6id.metal
                    - m6idn.32xlarge
                    - m6idn.metal
                    - m6in.32xlarge
                    - m6in.metal
                    - m7a.48xlarge
                    - m7a.metal-48xl
                    - m7g.16xlarge
                    - m7g.metal
                    - m7gd.16xlarge
                    - m7i.48xlarge
                    - m7i.metal-48xl
                    - c5n.9xlarge
                    - c5n.18xlarge
                    - c5n.metal
                    - c6a.48xlarge
                    - c6a.metal
                    - c6gn.16xlarge
                    - c6i.32xlarge
                    - c6i.metal
                    - c6id.32xlarge
                    - c6id.metal
                    - c6in.32xlarge
                    - c6in.metal
                    - c7a.48xlarge
                    - c7a.metal-48xl
                    - c7g.16xlarge
                    - c7g.metal
                    - c7gd.16xlarge
                    - c7gn.16xlarge
                    - c7i.48xlarge
                    - c7i.metal-48xl
                    - r5dn.24xlarge
                    - r5dn.metal
                    - r5n.24xlarge
                    - r5n.metal
                    - r6a.48xlarge
                    - r6a.metal
                    - r6i.32xlarge
                    - r6i.metal
                    - r6idn.32xlarge
                    - r6idn.metal
                    - r6in.32xlarge
                    - r6in.metal
                    - r6id.32xlarge
                    - r6id.metal
                    - r7a.48xlarge
                    - r7a.metal-48xl
                    - r7g.16xlarge
                    - r7g.metal
                    - r7gd.16xlarge
                    - r7i.48xlarge
                    - r7i.metal-48xl
                    - r7iz.32xlarge
                    - r7iz.metal-32xl
                    - x2idn.32xlarge
                    - x2idn.metal
                    - x2iedn.32xlarge
                    - x2iedn.metal
                    - x2iezn.12xlarge
                    - x2iezn.metal
                    - i3en.12xlarge
                    - i3en.24xlarge
                    - i3en.metal
                    - i4g.16xlarge
                    - i4i.32xlarge
                    - i4i.metal
                    - im4gn.16xlarge
                    - dl1.24xlarge
                    - dl2q.24xlarge
                    - g4dn.8xlarge
                    - g4dn.12xlarge
                    - g4dn.16xlarge
                    - g4dn.metal
                    - g5.8xlarge
                    - g5.12xlarge
                    - g5.16xlarge
                    - g5.24xlarge
                    - g5.48xlarge
                    - g6.8xlarge
                    - g6.12xlarge
                    - g6.16xlarge
                    - g6.24xlarge
                    - g6.48xlarge
                    - g6e.8xlarge
                    - g6e.12xlarge
                    - g6e.16xlarge
                    - g6e.24xlarge
                    - g6e.48xlarge
                    - gr6.8xlarge
                    - inf1.24xlarge
                    - p3dn.24xlarge
                    - p4d.24xlarge
                    - p4de.24xlarge
                    - p5.48xlarge
                    - p5e.48xlarge
                    - p5en.48xlarge
                    - trn1.32xlarge
                    - trn1n.32xlarge
                    - trn2.48xlarge
                    - vt1.24xlarge
                    - hpc6a.48xlarge
                    - hpc6id.32xlarge
                    - hpc7a.12xlarge
                    - hpc7a.24xlarge
                    - hpc7a.48xlarge
                    - hpc7a.96xlarge
                    - hpc7g.4xlarge
                    - hpc7g.8xlarge
                    - hpc7g.16xlarge
                - key: eks.amazonaws.com/compute-type
                  operator: NotIn
                  values:
                  - auto
      hostNetwork: true
      containers:
        - image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/aws-efa-k8s-device-plugin:v0.5.4
          name: aws-efa-k8s-device-plugin
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
              - ALL
            runAsNonRoot: false
          resources:
            requests:
              cpu: 10m
              memory: 20Mi
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: infiniband-volume
              mountPath: /dev/infiniband/
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: infiniband-volume
          hostPath:
            path: /dev/infiniband/

Why do you want this feature?

The eksctl installed device plugin is outdated and cannot be deployed to a certain instance type that already supports EFA (e.g. p5.48xlarge, trn2.48xlarge).

There has been past PRs/issues asking to update the yaml with more supported instances:

I do see that there were past changes that got reverted - 2f12605. I am fine with just updating the p5 series instances to begin with.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions