Description
Affected versions: 2.19.0, 2.19.1 (introduced in 7cd5843, shipped in v2.19.0)
Symptoms
When zram_size is set on an autoscaler nodepool, newly provisioned nodes boot but never join the k3s cluster. The autoscaler eventually times out and deletes them. No k3s installation takes place.
Root cause
There is a circular systemd ordering dependency in templates/autoscaler-cloudinit.yaml.tpl.
The zram.service unit is written with After=multi-user.target:
[Unit]
Description=Swap with zram
After=multi-user.target
This unit is activated during runcmd via systemctl enable --now zram.service. However, runcmd runs inside cloud-final.service, and multi-user.target waits for cloud-final.service to complete before it becomes active. This creates a deadlock:
cloud-final.service (running runcmd)
→ systemctl enable --now zram.service
→ zram.service waits for multi-user.target (After=)
→ multi-user.target waits for cloud-final.service
→ deadlock - node hangs forever
The node never progresses past this point, so k3s is never installed and the node never joins the cluster.
Why regular nodepools are unaffected: regular nodes set up zram via Terraform remote-exec over SSH after the node has fully booted - at that point multi-user.target is already active and there is no deadlock.
Verification
On a hung autoscaler node, the blocked process is visible:
cloud-final.service: running systemctl enable --now zram.service (PID stuck waiting)
Killing that PID allows runcmd to continue. k3s installs successfully and the node joins the cluster within minutes - confirming the deadlock is the sole blocker.
Kube.tf file
module "k3s-cluster" {
source = "kube-hetzner/kube-hetzner/hcloud"
version = "2.19.1"
# ... other config ...
autoscaler_nodepools = [
{
name = "autoscaled-compute"
server_type = "cx33"
location = "hel1"
min_nodes = 0
max_nodes = 3
labels = {}
taints = []
swap_size = "1G"
zram_size = "1G" # <-- triggers the deadlock
kubelet_args = []
}
]
autoscaler_disable_ipv4 = true
autoscaler_disable_ipv6 = true
}
Screenshots
No response
Platform
Mac
Description
Affected versions: 2.19.0, 2.19.1 (introduced in 7cd5843, shipped in v2.19.0)
Symptoms
When
zram_sizeis set on an autoscaler nodepool, newly provisioned nodes boot but never join the k3s cluster. The autoscaler eventually times out and deletes them. No k3s installation takes place.Root cause
There is a circular systemd ordering dependency in
templates/autoscaler-cloudinit.yaml.tpl.The
zram.serviceunit is written withAfter=multi-user.target:This unit is activated during
runcmdviasystemctl enable --now zram.service. However,runcmdruns insidecloud-final.service, andmulti-user.targetwaits forcloud-final.serviceto complete before it becomes active. This creates a deadlock:The node never progresses past this point, so k3s is never installed and the node never joins the cluster.
Why regular nodepools are unaffected: regular nodes set up zram via Terraform
remote-execover SSH after the node has fully booted - at that pointmulti-user.targetis already active and there is no deadlock.Verification
On a hung autoscaler node, the blocked process is visible:
Killing that PID allows
runcmdto continue. k3s installs successfully and the node joins the cluster within minutes - confirming the deadlock is the sole blocker.Kube.tf file
Screenshots
No response
Platform
Mac