Skip to content

Systemd fix for GPU instances is not working properly #3391

@frejonb

Description

@frejonb

What were you trying to accomplish?

What happened?
I'm on EKS 1.9 and eksctl 0.39.0. Given the fix in #3007 I would expect the command in /etc/sysconfig/docker to include the cgroupdriver override. However, sshing into the node, I say my file didn't have the --exec-opt flag. Hence the node was not joining the cluster. I was deploying a g4dn.xlarge spot instance.

How to reproduce it?

eksctl create nodegroup -f cluster.yaml

with cluster.yaml including:

- name: mygpu
    minSize: 0
    maxSize: 1
    availabilityZones:
    - "eu-central-1a"
    - "eu-central-1b"
    instancesDistribution:
      maxPrice: 0.5
      instanceTypes:
        - "g4dn.xlarge"
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
      spotInstancePools: 1
    volumeSize: 50
    volumeType: gp2
    privateNetworking: true
    iam:
      withAddonPolicies:
        autoScaler: true
    labels:
      foo: bar
    taints:
      foo: "bar:NoSchedule"      
    tags:
      k8s.io/cluster-autoscaler/node-template/label/foo: bar
      k8s.io/cluster-autoscaler/node-template/taint/foo: "bar:NoSchedule"

To debug the problem I temporarily added ssh support and had a look at the files. To solve the issue I manually added the following command

preBootstrapCommands: 
- "sed -i 's/^OPTIONS=\"/&--exec-opt native.cgroupdriver=systemd /' /etc/sysconfig/docker"
- "systemctl restart docker"

Logs

Anything else we need to know?

Versions

$ eksctl version
$ kubectl version

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions