Skip to content

Commit 13e84db

Browse files
committed
fix: make Tailscale network intent plan-known
1 parent a1edbd6 commit 13e84db

22 files changed

Lines changed: 570 additions & 150 deletions

.claude/skills/kh-assistant/SKILL.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -281,7 +281,7 @@ Core rules:
281281
- Invert positive/negative booleans carefully.
282282
- Remove `network_id = 0`; omitted/null means the primary Network in v3.
283283
- Remove control-plane `network_id`; control planes stay on the primary Network.
284-
- For secure Tailnet access or private multinetwork scale, prefer `node_transport_mode = "tailscale"`. For v2-to-v3 upgrades, introduce large multinetwork scale in a separate audited plan after the base upgrade. Tailscale mode keeps Kubernetes node IPs on Hetzner private addresses and can advertise node-private `/32` routes with Tailscale subnet-route SNAT disabled.
284+
- For secure Tailnet access or private multinetwork scale, prefer `node_transport_mode = "tailscale"`. For v2-to-v3 upgrades, introduce large multinetwork scale in a separate audited plan after the base upgrade. Tailscale mode keeps Kubernetes node IPs on Hetzner private addresses and can advertise node-private `/32` routes with Tailscale subnet-route SNAT disabled. Active agent/autoscaler nodepools in Tailscale mode must set `network_scope = "primary"` or `network_scope = "external"` so invalid same-root external Network configs fail at plan time.
285285
- Do not suggest Calico with Tailscale node transport yet. Flannel is first supported; Cilium is still explicitly experimental in this transport mode.
286286
- For Cloudflare, recommend only the external Access/Tunnel pattern for kube API, SSH, Rancher, Grafana, or ingress. Do not suggest Cloudflare Mesh/WARP as kube-hetzner node transport and do not invent Cloudflare provider inputs.
287287
- Run `terraform fmt -recursive`, `terraform init -upgrade`, `terraform validate`, and `terraform plan -out=v3-upgrade.tfplan`.
@@ -452,17 +452,18 @@ tailscale_node_transport = {
452452
mode = "auth_key"
453453
}
454454
routing = {
455-
# Single-network clusters may set false; external network_id nodepools need true.
455+
# Single-network clusters may set false; network_scope = "external" nodepools need true.
456456
advertise_node_private_routes = false
457457
}
458458
}
459459
```
460460

461461
Rules to mention:
462-
- Tailnet ACLs must auto-approve advertised Hetzner node-private `/32` routes when external `network_id` nodepools are used.
462+
- Tailscale mode requires explicit `network_scope = "primary"` or `"external"` on every active agent/autoscaler nodepool. Use `"primary"` when `network_id` is omitted/null; use `"external"` with external `network_id`, including same-root `hcloud_network.*.id`.
463+
- Tailnet ACLs must auto-approve advertised Hetzner node-private `/32` routes when external `network_scope` nodepools are used.
463464
- The module disables Tailscale subnet-route SNAT for node/CNI traffic.
464465
- Flannel VXLAN is first supported; Cilium needs the experimental flag; Calico is rejected.
465-
- Managed Hetzner private LBs are fine for single-primary-network clusters; external `network_id` nodepools need public LB targets or non-Hetzner/private alternatives.
466+
- Managed Hetzner private LBs are fine for single-primary-network clusters; external `network_scope` nodepools need public LB targets or non-Hetzner/private alternatives.
466467
- The module NAT router only gives egress to the primary Hetzner Network; external-network Tailscale nodepools need public egress or an external bootstrap path.
467468
- Large examples live in `examples/tailscale-node-transport/large-scale-200.tf.example` and `examples/tailscale-node-transport/massive-10000-nodes.tf.example`.
468469
- The 200-node static example is `3 control planes + 97 primary agents + 100 agents on one external Network`; both Networks are exactly at Hetzner's 100-attachment limit and placement groups auto-shard to 21 groups.

.claude/skills/migrate-v2-to-v3/SKILL.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -168,9 +168,13 @@ terraform show -json v3-upgrade.tfplan \
168168
- Tailscale mode keeps Kubernetes node IPs on Hetzner private addresses and
169169
can advertise node-private `/32` routes with Tailscale subnet-route SNAT
170170
disabled. Single-network clusters may disable route advertisement; external
171-
`network_id` nodepools require route advertisement and Tailnet auto-approval.
171+
`network_scope = "external"` nodepools require route advertisement and Tailnet
172+
auto-approval.
172173
- External agent/autoscaler Network IDs must be positive Hetzner Network IDs.
173-
Omit/null means the primary Network.
174+
Omit/null means the primary Network. In Tailscale mode, active
175+
agent/autoscaler nodepools must also set `network_scope = "primary"` or
176+
`network_scope = "external"` so same-root external Network IDs validate
177+
during `terraform plan`.
174178
- Do not add new optional v3 features such as `cilium_gateway_api_enabled`,
175179
`embedded_registry_mirror`, new Tailscale multinetwork shards, or external
176180
Cloudflare Access/Tunnel routing during the same first in-place v2-to-v3

.claude/skills/prepare-release/SKILL.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,8 @@ For v3, additionally verify the Tailscale node-transport surfaces stay aligned:
237237
private multinetwork path, Flannel is first supported, Cilium is experimental,
238238
Calico is rejected, subnet-route SNAT is disabled when advertising routes,
239239
single-network examples may disable node-private route advertisement, and
240+
active Tailscale agent/autoscaler nodepools set `network_scope` explicitly so
241+
same-root external Network IDs are validated during `terraform plan`,
240242
external-overlay docs still describe only user-owned operator
241243
access/post-bootstrap features.
242244

CHANGELOG.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ This is the v3 major-release line. Before upgrading from any `v2.x` release:
4646
- **Cilium v3 Dual-Stack Defaults** - Cilium now renders IPv4/IPv6 Helm values from the configured cluster CIDRs and keeps kube-proxy replacement tied to `enable_kube_proxy = false` (#2170, #2178).
4747
- **Cilium Gateway API Support** - Added `cilium_gateway_api_enabled` to install standard Gateway API CRDs for the selected Cilium line, enable Cilium `gatewayAPI.enabled`, and wire cert-manager Gateway API support. Added `examples/cilium-gateway-api`.
4848
- **Cilium Multinetwork Public Overlay Preview** - Added gated `multinetwork_mode = "cilium_public_overlay"` plumbing for Cilium-only clusters spanning multiple Hetzner Networks, including public IPv4/IPv6/dual-stack transport selection, WireGuard/tunnel defaults, public load-balancer targeting, control-plane fanout removal, and one Cluster Autoscaler Deployment per effective `network_id`. This path now requires `enable_experimental_cilium_public_overlay = true` and is not production-supported until the live datapath E2E passes.
49-
- **Tailscale Node Transport** - Added opt-in `node_transport_mode = "tailscale"` for secure single-network clusters and supported private multinetwork scale-out. The module can bootstrap Tailscale, use MagicDNS for Terraform/kubeconfig access, optionally advertise each node's Hetzner private `/32` route with subnet-route SNAT disabled, keep Kubernetes node IPs on Hetzner private addresses, validate Tailnet/firewall/CNI/load-balancer constraints at plan time, and render autoscaler nodes with per-Network Tailscale bootstrap. Flannel is the first supported CNI; Cilium remains gated as experimental for this transport until live datapath coverage promotes it.
49+
- **Tailscale Node Transport** - Added opt-in `node_transport_mode = "tailscale"` for secure single-network clusters and supported private multinetwork scale-out. The module can bootstrap Tailscale, use MagicDNS for Terraform/kubeconfig access, optionally advertise each node's Hetzner private `/32` route with subnet-route SNAT disabled, keep Kubernetes node IPs on Hetzner private addresses, validate Tailnet/firewall/CNI/load-balancer constraints at plan time with explicit nodepool `network_scope`, and render autoscaler nodes with per-Network Tailscale bootstrap. Flannel is the first supported CNI; Cilium remains gated as experimental for this transport until live datapath coverage promotes it.
5050
- **Embedded Registry Mirror** - Added `embedded_registry_mirror` for trusted large clusters, enabling k3s/RKE2's embedded Spegel mirror while preserving user `registries_config` entries.
5151
- **Placement Group Auto-Sharding** - Count-based nodepools without an explicit `placement_group` now shard implicit Hetzner spread placement groups every 10 servers; explicit placement groups still fail validation above Hetzner's 10-server limit.
5252
- **Large-Scale Tailscale Examples** - Added +100-node and 10,000-total-node Tailscale node-transport reference examples that account for Hetzner Network attachment limits, placement-group limits, autoscaler shards, and the public-IP/Tailnet exposure model.
@@ -66,7 +66,8 @@ This is the v3 major-release line. Before upgrading from any `v2.x` release:
6666
- **Terraform 1.15 Validation Compatibility** - Moved cross-variable and local-dependent module contract checks from input-variable validation blocks into a hard `terraform_data.validation_contract` precondition surface, preserving plan-time failures while allowing Terraform 1.15.0 to initialize and validate the module.
6767
- **Tailscale Volume Provisioning Ordering** - Agent Longhorn and attached-volume configuration now waits for Tailscale agent bootstrap before using Tailnet MagicDNS SSH targets.
6868
- **Tailscale Auth-Key Ergonomics** - `auth_key` mode no longer advertises kube-hetzner tags by default, so simple pre-auth keys work without Tailnet `tagOwners`; tagged nodes remain an explicit opt-in and OAuth mode now validates that tag-scoped auth is configured.
69-
- **Tailscale Single-Network Ergonomics** - Tailscale mode now cleanly supports ordinary single-network clusters: node-private route advertisement can be disabled when no external `network_id` nodepools are used, private control-plane Load Balancers are allowed, and private managed ingress Load Balancers are rejected only for external-network scale-out.
69+
- **Tailscale Single-Network Ergonomics** - Tailscale mode now cleanly supports ordinary single-network clusters: node-private route advertisement can be disabled when no `network_scope = "external"` nodepools are used, private control-plane Load Balancers are allowed, and private managed ingress Load Balancers are rejected only for external-network scale-out.
70+
- **Tailscale Same-Root Network Validation** - Tailscale static agent and autoscaler nodepools now use explicit `network_scope = "primary" | "external"` intent, so invalid same-root external Network configurations fail during `terraform plan` even when `network_id` is not known until apply.
7071
- **Placement Group Disable/Limit Semantics** - `enable_placement_groups = false` now stops creating unused placement-group resources, and plan-time validation enforces Hetzner's 50-placement-group project limit before large static topologies hit provider errors.
7172
- **Same-Root Tailscale External Networks** - In Tailscale transport mode, nodepool `network_id` values can come from Hetzner Network resources created in the same Terraform root because control planes no longer need apply-time fanout attachments to every external agent Network.
7273
- **Cloud-Init Health-Checker Race** - Host and autoscaler cloud-init now masks Leap Micro/MicroOS `health-checker.service` before `cloud-final` to prevent a systemd ordering-cycle race that can skip first-boot Kubernetes bootstrap on autoscaled nodes.

MIGRATION.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -328,13 +328,17 @@ Current behavior:
328328
`network_id` is omitted or null. Set a positive Hetzner Network ID only for an
329329
external network, and use either Tailscale node transport or the Cilium public
330330
overlay preview for autoscaler external networks.
331+
- In `node_transport_mode = "tailscale"`, active agent and autoscaler nodepools
332+
must also set `network_scope = "primary"` or `network_scope = "external"`.
333+
This makes the topology plan-known when `network_id` comes from a same-root
334+
`hcloud_network` resource.
331335
- In default mode, control planes may attach to external agent networks for
332336
compatibility with the existing private-network behavior.
333337
- In `node_transport_mode = "tailscale"`, control-plane fanout is disabled,
334338
Kubernetes keeps Hetzner private node IPs, and Tailscale can advertise each
335339
node's Hetzner private `/32` route with subnet-route SNAT disabled. Route
336340
advertisement can be disabled for single-primary-network clusters; it must
337-
stay enabled for external `network_id` nodepools.
341+
stay enabled for `network_scope = "external"` nodepools.
338342
- In `multinetwork_mode = "cilium_public_overlay"`, control-plane fanout is
339343
disabled and Cilium uses public IPv4/IPv6 transport with WireGuard encryption
340344
for pod-to-pod reachability across Hetzner Network islands.
@@ -349,7 +353,7 @@ Current behavior:
349353

350354
Do not turn an existing v2 cluster into a large multinetwork cluster as part of
351355
the same first v3 apply. Upgrade cleanly first, then introduce Tailscale
352-
transport or new external `network_id` nodepools in a separate audited plan, or
356+
transport or new `network_scope = "external"` nodepools in a separate audited plan, or
353357
use blue/green.
354358

355359
## Post-upgrade verification checklist

README.md

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@ Only apply after reviewing all planned resource actions.
132132
| Private-only | `nat_router` plus private control-plane LB on the primary Network. |
133133
| Secure operator/API access | `node_transport_mode = "tailscale"` with public API/SSH firewall sources closed. |
134134
| Cloudflare-protected operator/app access | Keep Tailscale or Hetzner private transport underneath; put user-managed Cloudflare Access/Tunnel in front of kube API, SSH, Rancher, or ingress endpoints. |
135-
| More than 100 Cloud nodes | Tailscale node transport plus external `network_id` shards, one Hetzner Network per 100-node budget. |
135+
| More than 100 Cloud nodes | Tailscale node transport plus `network_scope = "external"` shards, one Hetzner Network per 100-node budget. |
136136
| Very large reference | Autoscaler-first Tailscale multinetwork; see the 200-node and 10k-node examples. |
137137
| Cilium Gateway API | Cilium, `enable_kube_proxy = false`, `cilium_gateway_api_enabled = true`. |
138138
| Heavy image-pull pressure | `embedded_registry_mirror.enabled = true` on trusted clusters. |
@@ -1154,10 +1154,21 @@ tailscale_node_transport = {
11541154
# Tailscale mode deliberately rejects public world-open API/SSH defaults.
11551155
firewall_kube_api_source = null
11561156
firewall_ssh_source = null
1157+
1158+
# Every active Tailscale agent/autoscaler nodepool sets network_scope explicitly.
1159+
# Use "primary" when network_id is omitted/null.
1160+
# agent_nodepools = [{
1161+
# name = "agent", server_type = "cx23", location = "nbg1",
1162+
# labels = [], taints = [], count = 2, network_scope = "primary"
1163+
# }]
11571164
```
11581165

1159-
Multinetwork scale-out adds external `network_id` values and requires approved
1160-
node-private routes:
1166+
Multinetwork scale-out adds `network_scope = "external"` nodepools with
1167+
external `network_id` values and requires approved node-private routes. Set
1168+
`network_scope` explicitly in Tailscale mode so
1169+
Terraform can validate primary-vs-external Network intent during `plan`, even
1170+
when a `network_id` comes from an `hcloud_network` resource created in the same
1171+
root.
11611172

11621173
```tf
11631174
node_transport_mode = "tailscale"
@@ -1194,6 +1205,7 @@ agent_nodepools = [
11941205
taints = []
11951206
count = 50
11961207
# network_id omitted/null means the primary kube-hetzner network.
1208+
network_scope = "primary"
11971209
},
11981210
{
11991211
name = "agent-small-b"
@@ -1202,7 +1214,8 @@ agent_nodepools = [
12021214
labels = []
12031215
taints = []
12041216
count = 50
1205-
network_id = 11959154 # existing external private network id
1217+
network_id = 11959154 # existing external private network id
1218+
network_scope = "external"
12061219
},
12071220
]
12081221
@@ -1213,14 +1226,16 @@ autoscaler_nodepools = [
12131226
location = "nbg1"
12141227
min_nodes = 0
12151228
max_nodes = 50
1229+
network_scope = "primary"
12161230
},
12171231
{
12181232
name = "autoscaled-b"
12191233
server_type = "cx23"
12201234
location = "nbg1"
12211235
min_nodes = 0
12221236
max_nodes = 50
1223-
network_id = 11959154
1237+
network_id = 11959154
1238+
network_scope = "external"
12241239
},
12251240
]
12261241
```
@@ -1245,7 +1260,9 @@ The important constraints are enforced during `terraform plan`:
12451260
- Control planes always stay on the primary kube-hetzner network and no longer
12461261
accept `network_id`.
12471262
- Static agents and autoscaler nodepools may use `network_id` to spread across
1248-
existing Hetzner private Networks.
1263+
existing Hetzner private Networks. In Tailscale mode, every active static
1264+
agent node, agent nodepool, and autoscaler nodepool must set
1265+
`network_scope = "primary"` or `network_scope = "external"`.
12491266
- Control planes are not auto-attached to every external agent Network, avoiding
12501267
Hetzner's 3-Networks-per-server limit.
12511268
- The module can advertise each node's Hetzner private `/32` route through
@@ -1254,8 +1271,8 @@ The important constraints are enforced during `terraform plan`:
12541271
source IP.
12551272
- Single-network clusters may set
12561273
`tailscale_node_transport.routing.advertise_node_private_routes = false` to
1257-
avoid Tailnet route approvals. External `network_id` nodepools require the
1258-
default `true`.
1274+
avoid Tailnet route approvals. Any nodepool with `network_scope = "external"`
1275+
requires the default `true`.
12591276
- For multinetwork clusters, Tailnet ACLs must auto-approve node-private routes
12601277
for the users, groups, or node tags you use, or the cluster will wait for
12611278
manual route approval. Tags are optional in `auth_key` mode, but they are the

0 commit comments

Comments
 (0)