Skip to content

release: promote staging to master for v3#2147

Open
mysticaltech wants to merge 412 commits intomasterfrom
staging
Open

release: promote staging to master for v3#2147
mysticaltech wants to merge 412 commits intomasterfrom
staging

Conversation

@mysticaltech
Copy link
Copy Markdown
Owner

@mysticaltech mysticaltech commented Feb 17, 2026

v3.0.0 Release PR: staging -> master

Executive Summary

This PR promotes the v3 line from staging to master.

It is a major release candidate, not an incremental feature merge. The train includes the original ideas_v3 work, the v3 migration contract, Terraform/OpenTofu validation hardening, first-class Tailscale node transport, clearer topology guidance, Cilium Gateway API support, embedded registry mirror support, and final release smoke gates.

No release tag is created by this PR.

Why This Is A Major Version

v3 contains deliberate breaking changes that require explicit operator intent:

  • Removed/renamed public inputs and consolidated older customization surfaces.
  • Terraform/OpenTofu minimums and hcloud provider minimums were raised.
  • Default behavior moved toward Leap Micro, per-nodepool networking, stricter validation, and larger-cluster topology modeling.
  • Networking, endpoint, nodepool, autoscaler, NAT/private paths, and addon orchestration have materially changed.
  • Tailscale node transport and multinetwork scale paths introduce new supported topologies that must be configured intentionally.

Current v3 Differentiators

  • Leap Micro-first kube-hetzner clusters with k3s and RKE2 support.
  • Terraform and OpenTofu module contracts with extensive cross-variable validation.
  • Tailscale as a supported secure node transport for single-network hardening and +100-node multinetwork scale.
  • Topology chooser and 10k-node reference examples that respect Hetzner Network and placement-group limits.
  • Cilium Gateway API as an opt-in first-class path.
  • Embedded k3s/RKE2 registry mirror as an opt-in large-cluster pull-pressure reducer.
  • Deterministic endpoint outputs for kubeconfig, node join path, transport mode, and Tailnet MagicDNS hostnames.

Core Changes By Theme

1) OS, Distribution, and Bootstrap

  • Leap Micro support and transactional-update persistence hardened.
  • RKE2 promoted as a first-class distribution path.
  • SELinux defaults and migration inversion documented and validated.
  • Host bootstrap, config updates, kured, and system-upgrade paths made distribution-aware.

2) Network, Endpoint, and Transport Topology

  • network_subnet_mode = "per_nodepool" is the v3 default for new clusters.
  • Existing/shared subnet behavior remains available for migration compatibility.
  • Tailscale node transport added for secure node connectivity and supported multinetwork scale.
  • Cilium public-overlay multinetwork remains documented as experimental rather than a production scale claim.
  • Endpoint behavior is documented for direct public, public LB, private LB/NAT, explicit endpoints, and Tailscale MagicDNS.

3) Nodepools, Autoscaler, and Placement Groups

  • Nodepool schemas expanded with per-node overrides, public network controls, extra networks, volumes, labels, taints, and map/count safety.
  • Autoscaler paths support per-network/Tailscale-aware rendering where configured.
  • Placement groups are sharded/validated around Hetzner's spread-group limits.
  • +100-node and 10k reference examples encode Hetzner Network attachment limits instead of pretending a single Network scales forever.

4) Addons and Kubernetes APIs

  • Cilium Gateway API support added via cilium_gateway_api_enabled.
  • Gateway API standard CRDs are installed automatically when Cilium Gateway API or Traefik Gateway provider mode is enabled.
  • Embedded registry mirror support added for k3s/RKE2 with safe merge behavior over user registries_config.
  • CCM/CSI/ingress/cert-manager/user-kustomization rendering tightened for v3 behavior.

5) Migration and Operator Experience

  • MIGRATION.md, docs/v2-to-v3-migration.md, README, kube.tf.example, examples, and skills updated for v3.
  • scripts/v2_to_v3_migration_assistant.py provides guided v2-to-v3 checks.
  • docs/v3-topology-recommendations.md is the topology chooser for new clusters.
  • Kube-hetzner skills now know the v3 migration, testing, topology, Tailscale, Gateway API, registry mirror, and release gates.

Important Supported/Unsupported Boundaries

  • Supported secure scale path: Tailscale node transport with per-network nodepools and route advertisement where needed.
  • Supported Gateway API path: Cilium with kube-proxy disabled and cilium_gateway_api_enabled = true.
  • Supported registry mirror path: opt-in embedded mirror for trusted-node clusters.
  • Not a Talos pivot: Talos remains a different project shape.
  • Not a public-network/IP-query-server scale story: v3 does not claim production +100 via public CNI overlay.
  • Cilium public overlay remains experimental until live-proven beyond static planning.

Validation Evidence For Latest Push

Latest pushed commit: 0c03f4b (chore: finalize v3 topology support boundaries).

Local gates run from /Volumes/MysticalTech/Code/kube-hetzner:

  • Latest doc/boundary commit rechecked with terraform fmt -check -recursive, terraform validate, temp-copy tofu init -backend=false && tofu validate, uv run scripts/validate_v3_final_polish_examples.py, uv run scripts/validate_tailscale_large_scale_examples.py, and git diff --check.
  • terraform fmt -recursive
  • terraform-docs markdown . > docs/terraform.md
  • no live null_resource / hashicorp/null provider usage remains
  • terraform init -backend=false -no-color
  • terraform validate -no-color
  • temp-copy tofu init -backend=false -no-color && tofu validate -no-color
  • git diff --check
  • uv run scripts/validate_tailscale_large_scale_examples.py
  • uv run scripts/validate_v3_final_polish_examples.py
  • uv run scripts/smoke_v3_plan_matrix.py

Example parse/validate gates:

  • kube.tf.example with local module source substitution
  • examples/argocd/main.tf with local module source substitution
  • examples/cilium-gateway-api/main.tf with local module source substitution
  • examples/tailscale-node-transport/main.tf with local module source substitution

Disposable plan matrix coverage:

  • default k3s + Cilium
  • Cilium Gateway API valid
  • Cilium Gateway API invalid with Flannel
  • Cilium Gateway API invalid with kube-proxy enabled
  • embedded registry mirror valid for k3s
  • embedded registry mirror valid for RKE2
  • embedded registry mirror invalid duplicate registries
  • embedded registry mirror invalid empty registry set
  • Tailscale + embedded registry + external network valid with private route advertisement
  • Tailscale + embedded registry + external network invalid without private route advertisement

Test workspace smoke:

  • /Volumes/MysticalTech/Code/kube-test
  • terraform init -upgrade -no-color
  • terraform plan -refresh=false -lock=false -input=false -no-color -detailed-exitcode
  • Result: valid create-only plan, 43 to add, 0 to change, 0 to destroy
  • Note: used a temporary local override to pin addon versions during this smoke because unauthenticated GitHub release API quota was exhausted; the override was removed after the plan.

Reviewer Guide

Suggested high-signal review order:

  1. variables.tf
  2. validation-locals.tf
  3. locals.tf
  4. tailscale.tf
  5. init.tf
  6. control_planes.tf
  7. agents.tf
  8. autoscaler-agents.tf
  9. data.tf
  10. output.tf
  11. README.md
  12. kube.tf.example
  13. MIGRATION.md
  14. docs/v3-topology-recommendations.md
  15. examples/tailscale-node-transport/
  16. examples/cilium-gateway-api/
  17. .claude/skills/*/SKILL.md

Release Intent

Merge this PR only when v3 is ready to become the master-line release candidate. Tagging/publishing remains a separate maintainer action after final review.

@tiran133
Copy link
Copy Markdown
Contributor

Hi
I was playing around with this. What I notice is that all server control plane and agent nodes end up in the same subnet
even though there is agent subnet and control plane subnet.
I changed the cidr just for a test bit its the same with the default 10.0.0.0/8

network_ipv4_cidr = "10.0.0.0/16"
subnet_amount     = 256
image

Here is my kube.tf

kube.tf
locals {
  # You have the choice of setting your Hetzner API token here or define the TF_VAR_hcloud_token env
  # within your shell, such as: export TF_VAR_hcloud_token=xxxxxxxxxxx. Or you can use .tfvars-files.
  # If you choose to define it in the shell, this can be left as is.

  # Your Hetzner token can be found in your Project > Security > API Token (Read & Write is required).
  hcloud_token = ""

  # Credentials for the Hetzner Robot webservice
  robot_user     = ""
  robot_password = ""

  etcd-s3-endpoint        = "fsn1.your-objectstorage.com"
  etcd-s3-access-key      = ""
  etcd-s3-secret-key      = ""
  etcd-s3-bucket          = "backups-01"
  etcd-s3-region          = "fns1"
  etcd-s3-folder          = "k3s-etcd-snapshots"

  longhorn_volume_size            = 200
}

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
  robot_user     = var.robot_user != "" ? var.robot_user : local.robot_user
  robot_password = var.robot_password != "" ? var.robot_password : local.robot_password

  kubernetes_distribution_type = "rke2"
  # network_subnet_mode = "per_nodepool"
  network_subnet_mode = "legacy"

  network_ipv4_cidr = "10.0.0.0/16"
  subnet_amount     = 256

  # source = "kube-hetzner/kube-hetzner/hcloud"
  source = "../terraform-hcloud-kube-hetzner"

 
  ssh_public_key = file("~/.ssh/id_ed25519.pub")
  ssh_private_key = null # Use agent
  network_region = "eu-central" # change to `us-east` if location is ash
  control_plane_nodepools = [
    {
      name        = "control-plane-nbg1",
      server_type = "cpx32",
      location    = "nbg1",
      labels      = [
        "node.kubernetes.io/role=egress",
      ],
      taints      = [],
      count       = 1
      swap_size   = "2G" # remember to add the suffix, examples: 512M, 1G
      zram_size   = "2G" # remember to add the suffix, examples: 512M, 1G
    },
    {
      name        = "control-plane-fsn1",
      server_type = "cpx32",
      location    = "fsn1",
      labels      = [
        "node.kubernetes.io/role=egress",
      ],
      taints      = [],
      count       = 1
      swap_size   = "2G" # remember to add the suffix, examples: 512M, 1G
      zram_size   = "2G" # remember to add the suffix, examples: 512M, 1G
    },
    {
      name        = "control-plane-hel1",
      server_type = "cpx32",
      location    = "hel1",
      labels      = [
        "node.kubernetes.io/role=egress",
      ],
      taints      = [],
      count       = 1
      swap_size   = "2G" # remember to add the suffix, examples: 512M, 1G
      zram_size   = "2G" # remember to add the suffix, examples: 512M, 1G
    }
  ]

  agent_nodepools = [
    {
      name        = "agent-medium",
      server_type = "ccx23",
      location    = "nbg1",
      labels      = [
        "node.kubernetes.io/role=worker",
        "node.longhorn.io/create-default-disk=config",
        "storage-type=fast"
      ],
      taints      = [],
      count       = 3
      swap_size   = "2G" # remember to add the suffix, examples: 512M, 1G
      zram_size   = "2G" # remember to add the suffix, examples: 512M, 1G
    },
    {
      name        = "storage",
      server_type = "ccx23",
      location    = "nbg1",
      labels      = [
        "storage-type=capacity",
        "node.kubernetes.io/role=longhorn-storage",
        "node.longhorn.io/create-default-disk=config",
      ],
      taints      = [],
      count       = 3
      attached_volumes = [
        {
          size       = local.longhorn_volume_size
          mount_path = "/var/longhorn"
          filesystem = "ext4"  # ext4 or xfs
        }
      ]
    },
  ]

  control_planes_custom_config = {
   etcd-expose-metrics = true,
   kube-controller-manager-arg = "bind-address=0.0.0.0",
   # kube-proxy-arg ="metrics-bind-address=0.0.0.0",
   kube-scheduler-arg = "bind-address=0.0.0.0",
  }
  enable_wireguard = true
  load_balancer_type     = "lb11"
  load_balancer_location = "nbg1"
  etcd_s3_backup = {
    etcd-s3-endpoint        = local.etcd-s3-endpoint
    etcd-s3-access-key      = local.etcd-s3-access-key
    etcd-s3-secret-key      = local.etcd-s3-secret-key
    etcd-s3-bucket          = local.etcd-s3-bucket
    etcd-s3-region          = local.etcd-s3-region
    etcd-s3-folder          = local.etcd-s3-folder
  }

  csi_driver_smb_version = "v1.20.1"
  hetzner_ccm_use_helm = true
  ingress_controller = "none"
  system_upgrade_use_drain = true

  install_rke2_version = "v1.34.6+rke2r3"
  initial_k3s_channel = "v1.34"
  cluster_name = "rke2-cluster"
  firewall_kube_api_source   = ["2.2.2.2","1.1.1.1"] # Dummy
  firewall_ssh_source        = ["2.2.2.2","1.1.1.1"]
  cni_plugin = "cilium"

  cilium_merge_values = <<EOT
lbIPAM:
  enabled: false
EOT


  cilium_routing_mode = "native"
  cilium_egress_gateway_enabled = true
  cilium_hubble_enabled = true
  cilium_hubble_metrics_enabled = [
    "policy:sourceContext=app|workload-name|pod|reserved-identity;destinationContext=app|workload-name|pod|dns|reserved-identity;labelsContext=source_namespace,destination_namespace"
  ]
  cilium_loadbalancer_acceleration_mode = "native"
  disable_kube_proxy = true
  enable_cert_manager = false
  dns_servers = [
    "1.1.1.1",
    "9.9.9.9",
    "2606:4700:4700::1111"
  ]
}

provider "hcloud" {
  token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token
}

terraform {
  required_version = ">= 1.10.1"
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = ">= 1.51.0"
    }
    deepmerge = {
      source  = "isometry/deepmerge"
      version = "= 1.2.1"  # or whatever version worked before
    }
  }
}

output "kubeconfig" {
  value     = module.kube-hetzner.kubeconfig
  sensitive = true
}

variable "hcloud_token" {
  sensitive = true
  default   = ""
}

variable "robot_user" {
  sensitive = true
  default   = ""
}

variable "robot_password" {
  sensitive = true
  default   = ""
}

output "k3s_token" {
  value     = module.kube-hetzner.k3s_token
  sensitive = true
}

I also tried

network_subnet_mode = "per_nodepool"

image image

Or am I missing something?

I mean the cluster works fine I guess. But I just don't understand the different subnets I guess

If so can you explain?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants