Skip to content

Conversation

@jdheyburn
Copy link

When dual-channel-replication is enabled, and replica-announce-ip is set, the RDB/AOF channel does not announce itself at this endpoint. This defaults to the IP address behind the NAT, or the Kubernetes Pod IP in our case.

This means that if Sentinel is polling the primary for connected replicas, it will first see the ephemeral pod IP, then revert to the announce-ip - leaving behind the pod IP as a down replica.

This PR configures the RDB/AOF channel to also announce itself at the announce-ip to prevent the stale replica.

Testing

I evaluated writing unit tests for this, but I am not sure of a way we can test an IP address different to localhost (127.0.0.1) that would fail without the fix. I did test on Kubernetes against 9.0 tag and verified the fix there too.

Status quo

On 9.0 image tag:

$ kubectl get pods -n valkey-baseline -o custom-columns=NAME:.metadata.name,POD-IP:.status.podIP
NAME                              POD-IP
valkey-primary-5bd78c8566-llb6k   10.244.0.25
valkey-replica-0                  10.244.0.17
valkey-replica-1                  10.244.0.13

$ kubectl get services -n valkey-baseline -o custom-columns=NAME:.metadata.name,CLUSTER-IP:.spec.clusterIP
NAME               CLUSTER-IP
valkey-primary     10.96.147.28
valkey-replica-0   10.96.66.233
valkey-replica-1   10.96.57.230

Logs below show that pod IP for valkey-primary-5bd78c8566-llb6k 10.244.0.25:6379 is being used for dual-channel replication. This should be its cluster IP 10.96.147.28 as this is what is set in replica-announce-ip.

1:M 14 Nov 2025 17:57:51.750 * Replica 10.96.147.28:6379 asks for synchronization
1:M 14 Nov 2025 17:57:51.751 * Replica 10.244.0.25:6379 asks for synchronization
1:M 14 Nov 2025 17:57:56.135 * Dual channel replication: Sending to replica 10.244.0.25:6379 RDB end offset 1763269 and client-id 35
1:M 14 Nov 2025 17:57:56.140 * Replica 10.96.147.28:6379 asks for synchronization

This fix

$ kubectl get pods -n valkey-test -o custom-columns=NAME:.metadata.name,CLUSTER-IP:.status.podIP  
NAME                              POD-IP
valkey-primary-594c9597b5-qqvdk   10.244.0.26
valkey-replica-0                  10.244.0.10
valkey-replica-1                  10.244.0.18

$ kubectl get services -n valkey-test -o custom-columns=NAME:.metadata.name,CLUSTER-IP:.spec.clusterIP
NAME               CLUSTER-IP
valkey-primary     10.96.125.142
valkey-replica     None
valkey-replica-0   10.96.155.74
valkey-replica-1   10.96.64.111
valkey-sentinel    None

Logs show that the Cluster IP is now being used for dual-channel replication.

1:M 14 Nov 2025 17:57:49.923 * Replica 10.96.125.142:6379 asks for synchronization
1:M 14 Nov 2025 17:57:49.924 * Replica 10.96.125.142:6379 asks for synchronization
1:M 14 Nov 2025 17:57:54.913 * Dual channel replication: Sending to replica 10.96.125.142:6379 RDB end offset 1771247 and client-id 36
1:M 14 Nov 2025 17:57:54.916 * Replica 10.96.125.142:6379 asks for synchronization

Fixes #2338

@jdheyburn jdheyburn changed the title Fix stale sentinel replicas when dual-channel-replication is enabled Dual-channel-replication announces itself at replica-announce-ip if configured Nov 14, 2025
@ranshid ranshid self-requested a review November 18, 2025 16:17
Copy link
Member

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the fix LGTM

Can we please add a tcl test for it?

return C_ERR;
}

if (server.replica_announce_ip) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can ALWAYS include the ip-address in the first replconf and thus reduce the need to explicitly handle the second replconf error handing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to be consistent with the current pattern for how replication handles replica_announce_ip. I am unsure how the REPLCONF command would handle a null ip-address.

  • valkey/src/replication.c

    Lines 3725 to 3731 in e19ceb7

    /* Set the replica ip, so that primary's INFO command can list the
    * replica IP address port correctly in case of port forwarding or NAT.
    * Skip REPLCONF ip-address if there is no replica-announce-ip option set. */
    if (server.replica_announce_ip) {
    err = sendCommand(conn, "REPLCONF", "ip-address", server.replica_announce_ip, NULL);
    if (err) goto err;
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I meant that we could pass the replica IP address even when the parameter is not configured. but it is not that critical TBH

@jdheyburn
Copy link
Author

Can we please add a tcl test for it?

@ranshid I am not sure of a way to accurately test this via a tcl. The replica-announce-ip that would need to be set during the test would have to be a local IP address such as 127.0.0.1 which would be the IP address used anyway. I had a tcl test case before, but removing the code I added caused the test to pass anyway.

Is there another means of testing? This is why I put the emphasis on the test I added in the description.

@ranshid
Copy link
Member

ranshid commented Nov 20, 2025

Can we please add a tcl test for it?

@ranshid I am not sure of a way to accurately test this via a tcl. The replica-announce-ip that would need to be set during the test would have to be a local IP address such as 127.0.0.1 which would be the IP address used anyway. I had a tcl test case before, but removing the code I added caused the test to pass anyway.

Is there another means of testing? This is why I put the emphasis on the test I added in the description.

I was thinking to set the config to something like
replica-announce-ip 5.5.5.5

and then delay the full sync (IIRC you can use the config set repl-diskless-sync-delay)

during that time the primary info shuold indicate the 'rdb-channel' has ip address 5.5.5.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] dual-channel-replication-enabled causes "duplicate" replica on Sentinel

2 participants