Skip to content

[Bug]: autoscaler nodes with zram_size hang in cloud-init and never join the cluster #2161

@acschm1d

Description

@acschm1d

Description

Affected versions: 2.19.0, 2.19.1 (introduced in 7cd5843, shipped in v2.19.0)

Symptoms

When zram_size is set on an autoscaler nodepool, newly provisioned nodes boot but never join the k3s cluster. The autoscaler eventually times out and deletes them. No k3s installation takes place.

Root cause

There is a circular systemd ordering dependency in templates/autoscaler-cloudinit.yaml.tpl.

The zram.service unit is written with After=multi-user.target:

[Unit]
Description=Swap with zram
After=multi-user.target

This unit is activated during runcmd via systemctl enable --now zram.service. However, runcmd runs inside cloud-final.service, and multi-user.target waits for cloud-final.service to complete before it becomes active. This creates a deadlock:

cloud-final.service (running runcmd)
  → systemctl enable --now zram.service
    → zram.service waits for multi-user.target (After=)
      → multi-user.target waits for cloud-final.service
        → deadlock - node hangs forever

The node never progresses past this point, so k3s is never installed and the node never joins the cluster.

Why regular nodepools are unaffected: regular nodes set up zram via Terraform remote-exec over SSH after the node has fully booted - at that point multi-user.target is already active and there is no deadlock.

Verification

On a hung autoscaler node, the blocked process is visible:

cloud-final.service: running systemctl enable --now zram.service (PID stuck waiting)

Killing that PID allows runcmd to continue. k3s installs successfully and the node joins the cluster within minutes - confirming the deadlock is the sole blocker.

Kube.tf file

module "k3s-cluster" {
  source  = "kube-hetzner/kube-hetzner/hcloud"
  version = "2.19.1"

  # ... other config ...

  autoscaler_nodepools = [
    {
      name        = "autoscaled-compute"
      server_type = "cx33"
      location    = "hel1"
      min_nodes   = 0
      max_nodes   = 3
      labels      = {}
      taints      = []
      swap_size   = "1G"
      zram_size   = "1G"  # <-- triggers the deadlock
      kubelet_args = []
    }
  ]

  autoscaler_disable_ipv4 = true
  autoscaler_disable_ipv6 = true
}

Screenshots

No response

Platform

Mac

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions