Skip to content

Wait for system deployments and jobs on init#1948

Merged
mysticaltech merged 1 commit into
mysticaltech:fix/staging-review-2026-01-11from
vsalomaki:feature/wait-for-deployments
Jan 11, 2026
Merged

Wait for system deployments and jobs on init#1948
mysticaltech merged 1 commit into
mysticaltech:fix/staging-review-2026-01-11from
vsalomaki:feature/wait-for-deployments

Conversation

@vsalomaki
Copy link
Copy Markdown
Contributor

@vsalomaki vsalomaki commented Oct 30, 2025

Description

Adds wait-steps to wait for system deployments and jobs on init. Some of the system-Helms, for example cert-manager and longhorn are a bit slow to start completely and subsequent manifest installations may fail if these expected system helms are not yet fully installed.

To inspect I ran a kubectl get pods,deployments,jobs -A command right after this line:
https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/blob/ee1b6badba4223ff6dc7f619e52aee7f822db48c/init.tf#L488

Before waits
Looking at the output below, certain deployments and jobs were still in different states of initialization. (List of Pods is just for reference here and other types were excluded from this extract)

 NAMESPACE         NAME                                                     READY   STATUS              RESTARTS   AGE
 kube-system       pod/cilium-9lfk6                                         1/1     Running             0          39s
 kube-system       pod/cilium-c7n4s                                         1/1     Running             0          39s
 kube-system       pod/cilium-envoy-5f59d                                   1/1     Running             0          39s
 kube-system       pod/cilium-envoy-cxm94                                   1/1     Running             0          39s
 kube-system       pod/cilium-envoy-fbd8z                                   1/1     Running             0          39s
 kube-system       pod/cilium-envoy-jgzfv                                   1/1     Running             0          39s
 kube-system       pod/cilium-envoy-vsxzx                                   1/1     Running             0          39s
 kube-system       pod/cilium-envoy-x4wmd                                   1/1     Running             0          39s
 kube-system       pod/cilium-ffjlf                                         1/1     Running             0          39s
 kube-system       pod/cilium-ldz5p                                         1/1     Running             0          39s
 kube-system       pod/cilium-operator-657f46f59c-ccfc6                     1/1     Running             0          39s
 kube-system       pod/cilium-operator-657f46f59c-kfmqz                     1/1     Running             0          39s
 kube-system       pod/cilium-qlvq5                                         1/1     Running             0          39s
 kube-system       pod/cilium-vj4vx                                         1/1     Running             0          39s
 kube-system       pod/coredns-64fd4b4794-mgpwv                             1/1     Running             0          2m48s
 kube-system       pod/hcloud-cloud-controller-manager-fc85ff7c7-z24sh      1/1     Running             0          42s
 kube-system       pod/hcloud-csi-controller-767b5cf9cd-b2qn8               5/5     Running             0          42s
 kube-system       pod/hcloud-csi-node-7twfd                                3/3     Running             0          42s
 kube-system       pod/hcloud-csi-node-95ms6                                3/3     Running             0          42s
 kube-system       pod/hcloud-csi-node-p92ft                                3/3     Running             0          42s
 kube-system       pod/helm-install-cert-manager-69vmt                      1/1     Running             0          50s
 kube-system       pod/helm-install-cilium-9vfzc                            0/1     Completed           0          50s
 kube-system       pod/helm-install-hcloud-cloud-controller-manager-bpzmt   0/1     Completed           0          50s
 kube-system       pod/helm-install-hcloud-csi-4rp48                        0/1     Completed           0          49s
 kube-system       pod/helm-install-longhorn-xzjvp                          0/1     Completed           0          49s
 kube-system       pod/helm-install-traefik-2zpm2                           0/1     Completed           0          49s
 kube-system       pod/kured-272nr                                          1/1     Running             0          13s
 kube-system       pod/kured-2m8gz                                          1/1     Running             0          13s
 kube-system       pod/kured-8l8fp                                          1/1     Running             0          13s
 kube-system       pod/kured-bvwgs                                          0/1     ContainerCreating   0          10s
 kube-system       pod/kured-knhgh                                          1/1     Running             0          13s
 kube-system       pod/kured-r6gbn                                          1/1     Running             0          13s
 kube-system       pod/metrics-server-7bfffcd44-nbdvg                       1/1     Running             0          2m48s
 longhorn-system   pod/longhorn-driver-deployer-74f45ccf86-brfjp            0/1     Init:0/1            0          1s
 longhorn-system   pod/longhorn-manager-2mkkz                               0/2     ContainerCreating   0          1s
 longhorn-system   pod/longhorn-manager-2n656                               0/2     ContainerCreating   0          1s
 longhorn-system   pod/longhorn-manager-qld5m                               0/2     ContainerCreating   0          1s
 longhorn-system   pod/longhorn-ui-7b8657c6cd-cbdtx                         0/1     ContainerCreating   0          1s
 longhorn-system   pod/longhorn-ui-7b8657c6cd-jq66n                         0/1     ContainerCreating   0          1s
 system-upgrade    pod/system-upgrade-controller-6df5bf54f6-zsrdv           1/1     Running             0          50s
 traefik           pod/traefik-79c7f596d9-k2p2q                             1/1     Running             0          25s
 traefik           pod/traefik-79c7f596d9-tvr9g                             1/1     Running             0          25s
 traefik           pod/traefik-79c7f596d9-zzspd                             1/1     Running             0          40s
 
 NAMESPACE         NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
 cert-manager      deployment.apps/cert-manager                      0/1     0            0           0s
 cert-manager      deployment.apps/cert-manager-cainjector           0/1     0            0           0s
 cert-manager      deployment.apps/cert-manager-webhook              0/1     0            0           0s
 kube-system       deployment.apps/cilium-operator                   2/2     2            2           39s
 kube-system       deployment.apps/coredns                           1/1     1            1           2m51s
 kube-system       deployment.apps/hcloud-cloud-controller-manager   1/1     1            1           42s
 kube-system       deployment.apps/hcloud-csi-controller             1/1     1            1           42s
 kube-system       deployment.apps/metrics-server                    1/1     1            1           2m50s
 longhorn-system   deployment.apps/longhorn-driver-deployer          0/1     1            0           1s
 longhorn-system   deployment.apps/longhorn-ui                       0/2     2            0           1s
 system-upgrade    deployment.apps/system-upgrade-controller         1/1     1            1           50s
 traefik           deployment.apps/traefik                           3/3     3            3           40s
 
 NAMESPACE     NAME                                                     STATUS     COMPLETIONS   DURATION   AGE
 kube-system   job.batch/helm-install-cert-manager                      Running    0/1           50s        50s
 kube-system   job.batch/helm-install-cilium                            Complete   1/1           14s        50s
 kube-system   job.batch/helm-install-hcloud-cloud-controller-manager   Complete   1/1           10s        50s
 kube-system   job.batch/helm-install-hcloud-csi                        Complete   1/1           10s        49s
 kube-system   job.batch/helm-install-longhorn                          Running    0/1           49s        49s
 kube-system   job.batch/helm-install-traefik                           Complete   1/1           12s        49s

After wait
Then waiting for the deployments and jobs as described in this PR and running the same kubectl get all -A right after those waits yields:

 NAMESPACE         NAME                                                     READY   STATUS      RESTARTS   AGE
 cert-manager      pod/cert-manager-77b74755d9-ffdhb                        1/1     Running     0          23s
 cert-manager      pod/cert-manager-cainjector-65fcfd6ccf-sf7mm             1/1     Running     0          23s
 cert-manager      pod/cert-manager-webhook-9b4dd78-957r9                   1/1     Running     0          23s
 kube-system       pod/cilium-9lfk6                                         1/1     Running     0          62s
 kube-system       pod/cilium-c7n4s                                         1/1     Running     0          62s
 kube-system       pod/cilium-envoy-5f59d                                   1/1     Running     0          62s
 kube-system       pod/cilium-envoy-cxm94                                   1/1     Running     0          62s
 kube-system       pod/cilium-envoy-fbd8z                                   1/1     Running     0          62s
 kube-system       pod/cilium-envoy-jgzfv                                   1/1     Running     0          62s
 kube-system       pod/cilium-envoy-vsxzx                                   1/1     Running     0          62s
 kube-system       pod/cilium-envoy-x4wmd                                   1/1     Running     0          62s
 kube-system       pod/cilium-ffjlf                                         1/1     Running     0          62s
 kube-system       pod/cilium-ldz5p                                         1/1     Running     0          62s
 kube-system       pod/cilium-operator-657f46f59c-ccfc6                     1/1     Running     0          62s
 kube-system       pod/cilium-operator-657f46f59c-kfmqz                     1/1     Running     0          62s
 kube-system       pod/cilium-qlvq5                                         1/1     Running     0          62s
 kube-system       pod/cilium-vj4vx                                         1/1     Running     0          62s
 kube-system       pod/coredns-64fd4b4794-mgpwv                             1/1     Running     0          3m11s
 kube-system       pod/hcloud-cloud-controller-manager-fc85ff7c7-z24sh      1/1     Running     0          65s
 kube-system       pod/hcloud-csi-controller-767b5cf9cd-b2qn8               5/5     Running     0          65s
 kube-system       pod/hcloud-csi-node-7twfd                                3/3     Running     0          65s
 kube-system       pod/hcloud-csi-node-95ms6                                3/3     Running     0          65s
 kube-system       pod/hcloud-csi-node-p92ft                                3/3     Running     0          65s
 kube-system       pod/helm-install-cert-manager-69vmt                      0/1     Completed   0          73s
 kube-system       pod/helm-install-cilium-9vfzc                            0/1     Completed   0          73s
 kube-system       pod/helm-install-hcloud-cloud-controller-manager-bpzmt   0/1     Completed   0          73s
 kube-system       pod/helm-install-hcloud-csi-4rp48                        0/1     Completed   0          72s
 kube-system       pod/helm-install-longhorn-xzjvp                          0/1     Completed   0          72s
 kube-system       pod/helm-install-traefik-2zpm2                           0/1     Completed   0          72s
 kube-system       pod/kured-272nr                                          1/1     Running     0          36s
 kube-system       pod/kured-2m8gz                                          1/1     Running     0          36s
 kube-system       pod/kured-8l8fp                                          1/1     Running     0          36s
 kube-system       pod/kured-bvwgs                                          1/1     Running     0          33s
 kube-system       pod/kured-knhgh                                          1/1     Running     0          36s
 kube-system       pod/kured-r6gbn                                          1/1     Running     0          36s
 kube-system       pod/metrics-server-7bfffcd44-nbdvg                       1/1     Running     0          3m11s
 longhorn-system   pod/engine-image-ei-26bab25d-87sdv                       0/1     Running     0          8s
 longhorn-system   pod/engine-image-ei-26bab25d-pkv5d                       0/1     Running     0          8s
 longhorn-system   pod/engine-image-ei-26bab25d-wttkg                       0/1     Running     0          8s
 longhorn-system   pod/longhorn-driver-deployer-74f45ccf86-brfjp            1/1     Running     0          24s
 longhorn-system   pod/longhorn-manager-2mkkz                               2/2     Running     0          24s
 longhorn-system   pod/longhorn-manager-2n656                               2/2     Running     0          24s
 longhorn-system   pod/longhorn-manager-qld5m                               2/2     Running     0          24s
 longhorn-system   pod/longhorn-ui-7b8657c6cd-cbdtx                         1/1     Running     0          24s
 longhorn-system   pod/longhorn-ui-7b8657c6cd-jq66n                         1/1     Running     0          24s
 system-upgrade    pod/system-upgrade-controller-6df5bf54f6-zsrdv           1/1     Running     0          73s
 traefik           pod/traefik-79c7f596d9-k2p2q                             1/1     Running     0          48s
 traefik           pod/traefik-79c7f596d9-tvr9g                             1/1     Running     0          48s
 traefik           pod/traefik-79c7f596d9-zzspd                             1/1     Running     0          63s

 NAMESPACE         NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
 cert-manager      deployment.apps/cert-manager                      1/1     1            1           23s
 cert-manager      deployment.apps/cert-manager-cainjector           1/1     1            1           23s
 cert-manager      deployment.apps/cert-manager-webhook              1/1     1            1           23s
 kube-system       deployment.apps/cilium-operator                   2/2     2            2           62s
 kube-system       deployment.apps/coredns                           1/1     1            1           3m14s
 kube-system       deployment.apps/hcloud-cloud-controller-manager   1/1     1            1           65s
 kube-system       deployment.apps/hcloud-csi-controller             1/1     1            1           65s
 kube-system       deployment.apps/metrics-server                    1/1     1            1           3m13s
 longhorn-system   deployment.apps/longhorn-driver-deployer          1/1     1            1           24s
 longhorn-system   deployment.apps/longhorn-ui                       2/2     2            2           24s
 system-upgrade    deployment.apps/system-upgrade-controller         1/1     1            1           73s
 traefik           deployment.apps/traefik                           3/3     3            3           63s

 NAMESPACE     NAME                                                     STATUS     COMPLETIONS   DURATION   AGE
 kube-system   job.batch/helm-install-cert-manager                      Complete   1/1           72s        73s
 kube-system   job.batch/helm-install-cilium                            Complete   1/1           14s        73s
 kube-system   job.batch/helm-install-hcloud-cloud-controller-manager   Complete   1/1           10s        73s
 kube-system   job.batch/helm-install-hcloud-csi                        Complete   1/1           10s        72s
 kube-system   job.batch/helm-install-longhorn                          Complete   1/1           51s        72s
 kube-system   job.batch/helm-install-traefik                           Complete   1/1           12s        72s

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @vsalomaki, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the Kubernetes cluster initialization process. By integrating explicit wait commands for all system deployments and jobs, it ensures that foundational components are fully operational before proceeding with further configurations. This change directly mitigates potential installation failures caused by race conditions with slow-starting services, leading to a more stable and reliable cluster setup.

Highlights

  • Enhanced Cluster Initialization Reliability: Introduced explicit wait steps during cluster initialization to ensure all system deployments are 'Available' and all system jobs are 'Complete' before subsequent configurations are applied.
  • Prevention of Installation Failures: Addresses issues where slow-starting system components like cert-manager and longhorn could lead to failures in later manifest installations by adding a 180-second timeout for these wait operations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces wait steps to ensure system deployments and jobs are fully initialized before proceeding with cluster setup. This is a valuable addition for improving the robustness of the initialization process. My review includes a suggestion to increase the timeout for these waits to better handle slower environments and prevent potential race conditions, making the process more reliable.

Comment thread init.tf Outdated
@vsalomaki vsalomaki force-pushed the feature/wait-for-deployments branch from 7c352b1 to 5c54fb7 Compare October 30, 2025 22:49
@vsalomaki
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces waits for system deployments and jobs to complete during cluster initialization, which is a great improvement for the stability and reliability of the setup process. The provided logs clearly demonstrate the effectiveness of this change. I have one suggestion to make the wait condition even more robust by also considering DaemonSet resources, which are used for critical components like the CNI.

Comment thread init.tf Outdated
@vsalomaki vsalomaki force-pushed the feature/wait-for-deployments branch from 5c54fb7 to 8991a83 Compare October 30, 2025 23:08
@mysticaltech mysticaltech changed the base branch from master to fix/staging-review-2026-01-11 January 11, 2026 18:07
@mysticaltech mysticaltech merged commit ef7026e into mysticaltech:fix/staging-review-2026-01-11 Jan 11, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants