Consul clients cache Kubernetes Service IP addresses and don't retry DNS resolution on connection failures (new IPs)

### Community Note

* Please vote on this issue by adding a 👍 [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
* Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
* If you are interested in working on this issue or have submitted a pull request, please leave a comment.



---

### Overview of the Issue

Consul clients resolve DNS names in `retry_join` configuration **only once** during startup and cache the resulting IP addresses indefinitely. When Consul servers get new IP addresses (common in Kubernetes environments during node upgrades, pod restarts, etc.), **even newly restarted client pods** continue attempting connections to cached stale IP addresses and **never retry DNS resolution**, causing permanent connection failures.

This creates a critical issue in cloud-native Kubernetes deployments where pod IP changes are routine operations. When servers restart and get new IPs, client pods that restart simultaneously (or shortly after) still resolve to the old cached DNS entries and get stuck in init containers indefinitely, even though they are performing fresh DNS lookups during their startup process.

**Root Cause:** Consul's `retry_join` mechanism performs DNS resolution only at startup and caches resolved IPs permanently, without implementing DNS re-resolution on connection failures. This affects both existing clients and newly started clients that may still resolve stale DNS entries due to DNS caching layers.

### Reproduction Steps

1. Deploy Consul cluster using official Helm chart with the following configuration:

```yaml
global:
  name: consul
  datacenter: dc1
  domain: consul.example.com
  enabled: true
  logLevel: debug

server:
  enabled: true
  replicas: 3
  bootstrapExpect: 3
  extraConfig: |
    {
      "retry_join": [
        "consul-server-0.consul.example.com.",
        "consul-server-1.consul.example.com.",
        "consul-server-2.consul.example.com."
      ]
    }

client:
  enabled: true
```

2. Observe initial startup - clients successfully resolve DNS and join cluster:
   - `consul-server-0.consul.example.com.` → `10.1.2.10`
   - `consul-server-1.consul.example.com.` → `10.1.2.11`
   - `consul-server-2.consul.example.com.` → `10.1.2.12`

3. Trigger server StatefulSet restart (simulating node upgrade or maintenance):
```bash
kubectl rollout restart statefulset/consul-server -n consul
```

4. New server pods get different IP addresses:
   - `consul-server-0` → `10.1.5.20`
   - `consul-server-1` → `10.1.5.21` 
   - `consul-server-2` → `10.1.5.22`

5. **Restart client pods** (simulating simultaneous restart during maintenance):
```bash
kubectl delete pods -l app=consul,component=client -n consul
```

6. Verify DNS resolution works correctly from new client pods:
```bash
kubectl exec -it consul-client-xyz -n consul -- nslookup consul-server-0.consul.example.com.
# Returns: 10.1.5.20 (new correct IP)
```

7. **Issue:** Even freshly restarted client pods continue attempting connections to old cached IP addresses (`10.1.2.10`, `10.1.2.11`, `10.1.2.12`) instead of the newly resolved IPs

8. Client pods remain stuck in init containers indefinitely, never performing fresh DNS resolution during retry attempts

### Logs

<details>
  <summary>Freshly Restarted Client Logs During Issue</summary>

```
[DEBUG] agent: Starting Consul agent (fresh restart)
[INFO]  agent: Consul agent running!
[DEBUG] agent: Retry join is supported for the following discovery methods: cluster_addr, aliyun, aws, azure, digitalocean, gce, k8s, linode, mdns, os, scaleway, triton, vsphere
[INFO]  agent: Joining cluster...
[DEBUG] agent: (LAN) joining: [consul-server-0.consul.example.com.:8301 consul-server-1.consul.example.com.:8301 consul-server-2.consul.example.com.:8301]

# Initial DNS resolution during startup - still resolving to OLD IPs
[DEBUG] agent: Resolved consul-server-0.consul.example.com.:8301 to 10.1.2.10:8301
[DEBUG] agent: Resolved consul-server-1.consul.example.com.:8301 to 10.1.2.11:8301
[DEBUG] agent: Resolved consul-server-2.consul.example.com.:8301 to 10.1.2.12:8301

# Connection attempts to cached OLD IPs (servers no longer exist at these addresses)
[ERROR] agent: failed to join: error="dial tcp 10.1.2.10:8301: connect: connection refused" address=10.1.2.10:8301
[ERROR] agent: failed to join: error="dial tcp 10.1.2.11:8301: connect: connection refused" address=10.1.2.11:8301
[ERROR] agent: failed to join: error="dial tcp 10.1.2.12:8301: connect: connection refused" address=10.1.2.12:8301
[WARN]  agent: Join failed: error="3 errors occurred:\n\t* dial tcp 10.1.2.10:8301: connection refused\n\t* dial tcp 10.1.2.11:8301: connection refused\n\t* dial tcp 10.1.2.12:8301: connection refused"

# Retry attempts - NO DNS re-resolution, continues with same cached IPs
[DEBUG] agent: (LAN) joining: [consul-server-0.consul.example.com.:8301 consul-server-1.consul.example.com.:8301 consul-server-2.consul.example.com.:8301]
[ERROR] agent: failed to join: error="dial tcp 10.1.2.10:8301: connect: connection refused" address=10.1.2.10:8301
[ERROR] agent: failed to join: error="dial tcp 10.1.2.11:8301: connect: connection refused" address=10.1.2.11:8301
[ERROR] agent: failed to join: error="dial tcp 10.1.2.12:8301: connect: connection refused" address=10.1.2.12:8301

# Pattern repeats indefinitely - DNS names shown in logs but IPs never re-resolved
```

</details>
<details>
  <summary>DNS Verification Shows Correct Resolution Available</summary>

```bash
# Manual DNS lookup from same freshly restarted client pod shows correct new IPs:
$ kubectl exec -it consul-client-xyz -n consul -- nslookup consul-server-0.consul.example.com.
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      consul-server-0.consul.example.com.
Address 1: 10.1.5.20 consul-server-0.consul-server.consul.svc.cluster.local

$ kubectl exec -it consul-client-xyz -n consul -- nslookup consul-server-1.consul.example.com.
Name:      consul-server-1.consul.example.com.
Address 1: 10.1.5.21 consul-server-1.consul-server.consul.svc.cluster.local

$ kubectl exec -it consul-client-xyz -n consul -- nslookup consul-server-2.consul.example.com.
Name:      consul-server-2.consul.example.com.
Address 1: 10.1.5.22 consul-server-2.consul-server.consul.svc.cluster.local

# But client continues trying old cached IPs 10.1.2.x
# DNS resolution works perfectly - the issue is Consul's caching behavior
```

</details>

### Expected behavior

- Consul clients should **retry DNS resolution** when connection attempts fail
- When `retry_join` contains DNS names, DNS lookups should be performed on each retry attempt or periodically
- Clients should automatically discover new server IP addresses without manual intervention
- Init containers should eventually succeed when servers become available at new IPs
- DNS TTL settings should be respected for re-resolution timing

### Environment details

- **consul-k8s version:** 1.5.5 (Chart version 1.5.5)
- **Kubernetes version:** v1.28.x
- **Cloud Provider:** Google Kubernetes Engine (GKE)
- **Networking CNI plugin:** Default GKE networking

**Complete values.yaml:**
```yaml
client:
  enabled: true
connectInject:
  enabled: false
dns:
  enabled: true
global:
  acls:
    bootstrapToken:
      secretKey: token
      secretName: consul-bootstrap
    manageSystemACLs: false
  datacenter: dc1
  domain: consul.example.com
  enabled: true
  gossipEncryption:
    autoGenerate: false
    secretKey: key
    secretName: consul-gossip
  logLevel: debug
  metrics:
    agentMetricsRetentionTime: 1h
    enableAgentMetrics: true
    enabled: true
  name: consul
  tls:
    caCert:
      secretKey: caCert
      secretName: consul-federation
    caKey:
      secretKey: caKey
      secretName: consul-federation
    enableAutoEncrypt: true
    enabled: true
    httpsOnly: false
    serverAdditionalDNSSANs:
    - consul.example.com.
    - consul-ui.example.com.
    - consul-server-join.example.com.
    - server.dc1.consul.example.com
    serverAdditionalIPSANs: []
    verify: true
server:
  bootstrapExpect: 3
  connect: false
  enabled: true
  extraConfig: |
    {
      "retry_join": [
        "consul-server-0.consul.example.com.",
        "consul-server-1.consul.example.com.",
        "consul-server-2.consul.example.com."
      ],
      "limits": {
        "http_max_conns_per_client": 1000,
        "rpc_max_conns_per_client": 1000
      }
    }
  replicas: 3
  resources:
    limits:
      cpu: 1000m
      memory: 4Gi
    requests:
      cpu: 1000m
      memory: 4Gi
  service:
    additionalSpec: |
      publishNotReadyAddresses: true
    annotations: |
      "external-dns.alpha.kubernetes.io/dns-zone": "internal"
      "external-dns.alpha.kubernetes.io/hostname": "consul.example.com."
      "external-dns.alpha.kubernetes.io/ttl": "60"
ui:
  enabled: true
  ingress:
    annotations: |
      'cert-manager.io/cluster-issuer': 'letsencrypt'
      'external-dns.alpha.kubernetes.io/dns-zone': 'external'
      'external-dns.alpha.kubernetes.io/hostname': 'consul-ui.example.com'
      'kubernetes.io/ingress.class': 'gce'
      'networking.gke.io/v1beta1.FrontendConfig': 'frontend-config-consul'
      'kubernetes.io/ingress.allow-http': 'false'
    enabled: true
    hosts:
    - host: consul-ui.example.com
      paths:
      - /
      - /*
    pathType: ImplementationSpecific
    tls:
    - hosts:
      - consul-ui.example.com
      secretName: consul-ingress-cert
  metrics:
    enabled: false
  service:
    annotations: |
      'beta.cloud.google.com/backend-config': '{"default":"consul-backend-config"}'
      'cloud.google.com/app-protocols': '{"https":"HTTPS", "http":"HTTP"}'
      'cloud.google.com/neg': '{"ingress":true}'
    type: ClusterIP
```

### Additional Context

**Impact on Production Operations:**
This issue severely impacts routine Kubernetes operations:
- ✅ **GKE Node Upgrades** - Automatic maintenance causes pod IP changes
- ✅ **Pod Evictions** - Resource pressure or node draining
- ✅ **Rolling Updates** - Server pod updates get new IPs
- ✅ **Cluster Autoscaling** - Node scaling operations
- ✅ **StatefulSet Restarts** - Maintenance operations requiring server restarts

**Workarounds Attempted:**
- Using IP addresses instead of DNS names - defeats purpose of service discovery
- Manual client pod restarts - temporary fix but not sustainable for production

**Suggested Solutions:**
1. Implement periodic DNS re-resolution in `retry_join` logic
2. Add configuration option to control DNS cache TTL behavior
3. Retry DNS resolution on connection failures
4. Respect Kubernetes DNS TTL settings for service discovery

This issue makes Consul unsuitable for production Kubernetes environments where pod IP changes are normal operations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consul clients cache Kubernetes Service IP addresses and don't retry DNS resolution on connection failures (new IPs) #4657

Community Note

Overview of the Issue

Reproduction Steps

Logs

Expected behavior

Environment details

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consul clients cache Kubernetes Service IP addresses and don't retry DNS resolution on connection failures (new IPs) #4657

Description

Community Note

Overview of the Issue

Reproduction Steps

Logs

Expected behavior

Environment details

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions