-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
What feature/behavior/change do you want?
eksctl would try to install the EFA device plugin if efaEnabled is set to true. The pkg/addons/assets/efa-device-plugin.yaml maintained in this repo is what eksctl used to deploy the EFA device plugin DaemonSet. However, it is pretty outdated:
- It lacks of the newly supported instance type like p5 instances.
- It is still pointing to a fairly old image tag
/eks/aws-efa-k8s-device-plugin:v0.3.3. The official EFA Device plugin vended by eks-charts is pointing to image tag v0.5.4 already.
Instead of maintain the yaml file, can we just generate it from the latest eks-chart? One can do this
git clone https://github.com/aws/eks-charts.git
cd eks-charts/stable/aws-efa-k8s-device-plugin/
helm template . > efa-device-plugin.yaml
The generated efa-device-plugin.yaml would be like below
---
# Source: aws-efa-k8s-device-plugin/templates/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: release-name-aws-efa-k8s-device-plugin
labels:
helm.sh/chart: aws-efa-k8s-device-plugin-v0.5.7
app.kubernetes.io/name: aws-efa-k8s-device-plugin
app.kubernetes.io/instance: release-name
app.kubernetes.io/version: "v0.5.4"
app.kubernetes.io/managed-by: Helm
spec:
selector:
matchLabels:
name: release-name-aws-efa-k8s-device-plugin
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: release-name-aws-efa-k8s-device-plugin
spec:
tolerations:
- key: CriticalAddonsOnly
operator: Exists
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- m5dn.24xlarge
- m5dn.metal
- m5n.24xlarge
- m5n.metal
- m5zn.12xlarge
- m5zn.metal
- m6a.48xlarge
- m6a.metal
- m6i.32xlarge
- m6i.metal
- m6id.32xlarge
- m6id.metal
- m6idn.32xlarge
- m6idn.metal
- m6in.32xlarge
- m6in.metal
- m7a.48xlarge
- m7a.metal-48xl
- m7g.16xlarge
- m7g.metal
- m7gd.16xlarge
- m7i.48xlarge
- m7i.metal-48xl
- c5n.9xlarge
- c5n.18xlarge
- c5n.metal
- c6a.48xlarge
- c6a.metal
- c6gn.16xlarge
- c6i.32xlarge
- c6i.metal
- c6id.32xlarge
- c6id.metal
- c6in.32xlarge
- c6in.metal
- c7a.48xlarge
- c7a.metal-48xl
- c7g.16xlarge
- c7g.metal
- c7gd.16xlarge
- c7gn.16xlarge
- c7i.48xlarge
- c7i.metal-48xl
- r5dn.24xlarge
- r5dn.metal
- r5n.24xlarge
- r5n.metal
- r6a.48xlarge
- r6a.metal
- r6i.32xlarge
- r6i.metal
- r6idn.32xlarge
- r6idn.metal
- r6in.32xlarge
- r6in.metal
- r6id.32xlarge
- r6id.metal
- r7a.48xlarge
- r7a.metal-48xl
- r7g.16xlarge
- r7g.metal
- r7gd.16xlarge
- r7i.48xlarge
- r7i.metal-48xl
- r7iz.32xlarge
- r7iz.metal-32xl
- x2idn.32xlarge
- x2idn.metal
- x2iedn.32xlarge
- x2iedn.metal
- x2iezn.12xlarge
- x2iezn.metal
- i3en.12xlarge
- i3en.24xlarge
- i3en.metal
- i4g.16xlarge
- i4i.32xlarge
- i4i.metal
- im4gn.16xlarge
- dl1.24xlarge
- dl2q.24xlarge
- g4dn.8xlarge
- g4dn.12xlarge
- g4dn.16xlarge
- g4dn.metal
- g5.8xlarge
- g5.12xlarge
- g5.16xlarge
- g5.24xlarge
- g5.48xlarge
- g6.8xlarge
- g6.12xlarge
- g6.16xlarge
- g6.24xlarge
- g6.48xlarge
- g6e.8xlarge
- g6e.12xlarge
- g6e.16xlarge
- g6e.24xlarge
- g6e.48xlarge
- gr6.8xlarge
- inf1.24xlarge
- p3dn.24xlarge
- p4d.24xlarge
- p4de.24xlarge
- p5.48xlarge
- p5e.48xlarge
- p5en.48xlarge
- trn1.32xlarge
- trn1n.32xlarge
- trn2.48xlarge
- vt1.24xlarge
- hpc6a.48xlarge
- hpc6id.32xlarge
- hpc7a.12xlarge
- hpc7a.24xlarge
- hpc7a.48xlarge
- hpc7a.96xlarge
- hpc7g.4xlarge
- hpc7g.8xlarge
- hpc7g.16xlarge
- key: eks.amazonaws.com/compute-type
operator: NotIn
values:
- auto
hostNetwork: true
containers:
- image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/aws-efa-k8s-device-plugin:v0.5.4
name: aws-efa-k8s-device-plugin
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsNonRoot: false
resources:
requests:
cpu: 10m
memory: 20Mi
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: infiniband-volume
mountPath: /dev/infiniband/
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: infiniband-volume
hostPath:
path: /dev/infiniband/Why do you want this feature?
The eksctl installed device plugin is outdated and cannot be deployed to a certain instance type that already supports EFA (e.g. p5.48xlarge, trn2.48xlarge).
There has been past PRs/issues asking to update the yaml with more supported instances:
- Add additional EFA supported instances. #5274
- Add p4de instance types to EFA plugin #5330
- Added AWS G6 GPU instance support #7819
- Add/hpc7g node arm support #6743
I do see that there were past changes that got reverted - 2f12605. I am fine with just updating the p5 series instances to begin with.