-
Notifications
You must be signed in to change notification settings - Fork 962
Description
In large Valkey clusters we have observed that killing roughly half of the primaries (but leaving a slim majority of them alive) can still leave one or more shards permanently unrecoverable. In the scenario, we had a cluster of 2000 nodes. We terminated 499/1000 primaries (leaving 501 primaries and all 1000 replicas running), yet 2 shards never promoted their replicas, leaving the cluster stuck in the fail state even though quorum was theoretically achievable.
Logs
Replica notices primary in PFAIL, never upgrades
4048:S 21 May 2025 19:05:33.545 * NODE c8efab28fe995035d5c5c564321c40665bf08841 () possibly failing.
4048:S 21 May 2025 19:06:01.458 # Cluster state changed: fail
4048:S 21 May 2025 19:06:01.458 # Cluster is currently down: I am part of a minority partition.
Other nodes reach quorum and mark primary FAIL
3379:S 21 May 2025 19:05:33.672 * NODE c8efab28fe995035d5c5c564321c40665bf08841 () possibly failing.
3379:S 21 May 2025 19:06:09.358 * Marking node c8efab28fe995035d5c5c564321c40665bf08841 () as failing (quorum reached).
Another primary knows that the shard-primary has failed
3423:S 21 May 2025 19:05:33.567 * NODE c8efab28fe995035d5c5c564321c40665bf08841 () possibly failing.
3423:S 21 May 2025 19:06:10.335 * FAIL message received from dc18b60439f3a737c829501e22a5c6c8e31312d3 () about c8efab28fe995035d5c5c564321c40665bf08841 ()
Cluster node list (from the point of view of other primary) during failure
c8efab28fe995035d5c5c564321c40665bf08841 10.0.28.216:6379@16379 master,fail - 1747854313519 1747854306000 2426 disconnected 3876-3892
e75567640b70a77b1ffc3e1e4a7b795b02383d61 10.0.25.1:6379@16379 slave c8efab28fe995035d5c5c564321c40665bf08841 0 1747862948000 2426 connected
Cluster Nodes POV from the replica of the failed primary.
c8efab28fe995035d5c5c564321c40665bf08841 10.0.28.216:6379@16379 master,fail? - 1747854318517 1747854311000 2426 disconnected 3876-3892
Since the replica still thinks that it's replica is in pfail, and not failed, it keeps on trying to connect to it's failed primary indefinitely, which causes the cluster to never recover.
Expected behavior
The cluster should have recovered.
Setup
- Number of Nodes: 2000
- Instance Type: r7g.xlarge
- Sharding Strategy: 1 replica per shard
- Engine Version: Valkey 8.1.0
- Cluster node timeout: 15 seconds