Cluster fails to recover when ~50 % of primaries are killed (majority still alive)

In large Valkey clusters we have observed that killing roughly half of the primaries (but leaving a slim majority of them alive) can still leave one or more shards permanently unrecoverable. In the scenario, we had a cluster of 2000 nodes. We terminated 499/1000 primaries (leaving 501 primaries and all 1000 replicas running), yet 2 shards never promoted their replicas, leaving the cluster stuck in the fail state even though quorum was theoretically achievable.

### **Logs**

Replica notices primary in PFAIL, never upgrades

```
4048:S 21 May 2025 19:05:33.545 * NODE c8efab28fe995035d5c5c564321c40665bf08841 () possibly failing.
4048:S 21 May 2025 19:06:01.458 # Cluster state changed: fail
4048:S 21 May 2025 19:06:01.458 # Cluster is currently down: I am part of a minority partition.
```

Other nodes reach quorum and mark primary FAIL
```
3379:S 21 May 2025 19:05:33.672 * NODE c8efab28fe995035d5c5c564321c40665bf08841 () possibly failing.
3379:S 21 May 2025 19:06:09.358 * Marking node c8efab28fe995035d5c5c564321c40665bf08841 () as failing (quorum reached).
```

Another primary knows that the shard-primary has failed
```
3423:S 21 May 2025 19:05:33.567 * NODE c8efab28fe995035d5c5c564321c40665bf08841 () possibly failing.
3423:S 21 May 2025 19:06:10.335 * FAIL message received from dc18b60439f3a737c829501e22a5c6c8e31312d3 () about c8efab28fe995035d5c5c564321c40665bf08841 ()
```


Cluster node list (from the point of view of other primary) during failure
```
c8efab28fe995035d5c5c564321c40665bf08841 10.0.28.216:6379@16379 master,fail - 1747854313519 1747854306000 2426 disconnected 3876-3892
e75567640b70a77b1ffc3e1e4a7b795b02383d61 10.0.25.1:6379@16379 slave c8efab28fe995035d5c5c564321c40665bf08841 0 1747862948000 2426 connected
```

Cluster Nodes POV from the replica of the failed primary.

```
c8efab28fe995035d5c5c564321c40665bf08841 10.0.28.216:6379@16379 master,fail? - 1747854318517 1747854311000 2426 disconnected 3876-3892
```

Since the replica still thinks that it's replica is in `pfail`, and not failed, it keeps on trying to connect to it's failed primary indefinitely, which causes the cluster to _never_ recover. 

**Expected behavior**

The cluster should have recovered.

**Setup**

- Number of Nodes: **2000**
- Instance Type: **r7g.xlarge**
- Sharding Strategy: **1 replica per shard**
- Engine Version: **Valkey 8.1.0**
- Cluster node timeout: **15 seconds**



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster fails to recover when ~50 % of primaries are killed (majority still alive) #2181

Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cluster fails to recover when ~50 % of primaries are killed (majority still alive) #2181

Description

Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Cluster fails to recover when ~50 % of primaries are killed (majority still alive) #2181