Skip to content

Machines lost network connection #1873

@loomsen

Description

@loomsen

Hi folks,

I'm running different clusters, and suddenly all my kubernetes machines, workers and control planes, have lost connection to the outside world. I have a NAT gateway and a wireguard running in the projects. The NAT GW and the wireguard machine work fine, but the k8s clusters all say:

# ping google.de
ping: google.de: Temporary failure in name resolution
# ping 185.12.64.1
ping: connect: Network is unreachable

Anybody else experiencing the same? It happened all of a sudden about 9 hours ago.
I haven't found a way to get them back online yet Adding back a default route brought the connectivity back, temporarily 🤷‍♂️

Fortunately the services are still online, but when running terraform, I get:

module.kube.module.kube.null_resource.kustomization (remote-exec): + kubectl delete --ignore-not-found -n kube-system helmchart.helm.cattle.io/hcloud-cloud-controller-manager
module.kube.module.kube.null_resource.kustomization (remote-exec): + kubectl apply -k /var/post_install
module.kube.module.kube.null_resource.kustomization (remote-exec): error: accumulating resources: accumulation err='accumulating resources from 'https://github.com/kubereboot/kured/releases/download/1.17.1/kured-1.17.1-dockerhub.yaml': Get "https://github.com/kubereboot/kured/releases/download/1.17.1/kured-1.17.1-dockerhub.yaml": dial tcp: lookup github.com: Try again': failed to run '/usr/bin/git fetch --depth=1 https://github.com/kubereboot/kured HEAD': fatal: unable to access 'https://github.com/kubereboot/kured/': Could not resolve host: github.com
module.kube.module.kube.null_resource.kustomization (remote-exec): : exit status 128
╷
│ Error: remote-exec provisioner error
│ 
│   with module.kube.module.kube.null_resource.kustomization,
│   on ../../../../terraform-modules/terraform-hcloud-kube-hetzner/init.tf line 405, in resource "null_resource" "kustomization":
│  405:   provisioner "remote-exec" {
│ 
│ error executing "/tmp/terraform_756741801.sh": Process exited with status 1

edit

Apparently the default route vanished from all the machines:

# ip r s
10.0.0.0/8 via 10.0.0.1 dev eth1 proto dhcp src 10.127.128.5 metric 100 
10.0.0.1 dev eth1 proto dhcp scope link src 10.127.128.5 metric 100 
169.254.169.254 via 10.0.0.1 dev eth1 proto dhcp src 10.127.128.5 metric 100

# ip r add default via 10.0.0.1 dev eth1

# ping -c1 google.de
PING google.de (142.250.184.227) 56(84) bytes of data.
64 bytes from fra24s12-in-f3.1e100.net (142.250.184.227): icmp_seq=1 ttl=114 time=32.9 ms

--- google.de ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 32.869/32.869/32.869/0.000 ms

However, this is not reboot safe. After a reboot the default route is gone again, and node is stuck without internet access 🤷‍♂️

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions