Skip to content

Cluster fails to recover when ~50 % of primaries are killed (majority still alive) #2181

@sarthakaggarwal97

Description

@sarthakaggarwal97

In large Valkey clusters we have observed that killing roughly half of the primaries (but leaving a slim majority of them alive) can still leave one or more shards permanently unrecoverable. In the scenario, we had a cluster of 2000 nodes. We terminated 499/1000 primaries (leaving 501 primaries and all 1000 replicas running), yet 2 shards never promoted their replicas, leaving the cluster stuck in the fail state even though quorum was theoretically achievable.

Logs

Replica notices primary in PFAIL, never upgrades

4048:S 21 May 2025 19:05:33.545 * NODE c8efab28fe995035d5c5c564321c40665bf08841 () possibly failing.
4048:S 21 May 2025 19:06:01.458 # Cluster state changed: fail
4048:S 21 May 2025 19:06:01.458 # Cluster is currently down: I am part of a minority partition.

Other nodes reach quorum and mark primary FAIL

3379:S 21 May 2025 19:05:33.672 * NODE c8efab28fe995035d5c5c564321c40665bf08841 () possibly failing.
3379:S 21 May 2025 19:06:09.358 * Marking node c8efab28fe995035d5c5c564321c40665bf08841 () as failing (quorum reached).

Another primary knows that the shard-primary has failed

3423:S 21 May 2025 19:05:33.567 * NODE c8efab28fe995035d5c5c564321c40665bf08841 () possibly failing.
3423:S 21 May 2025 19:06:10.335 * FAIL message received from dc18b60439f3a737c829501e22a5c6c8e31312d3 () about c8efab28fe995035d5c5c564321c40665bf08841 ()

Cluster node list (from the point of view of other primary) during failure

c8efab28fe995035d5c5c564321c40665bf08841 10.0.28.216:6379@16379 master,fail - 1747854313519 1747854306000 2426 disconnected 3876-3892
e75567640b70a77b1ffc3e1e4a7b795b02383d61 10.0.25.1:6379@16379 slave c8efab28fe995035d5c5c564321c40665bf08841 0 1747862948000 2426 connected

Cluster Nodes POV from the replica of the failed primary.

c8efab28fe995035d5c5c564321c40665bf08841 10.0.28.216:6379@16379 master,fail? - 1747854318517 1747854311000 2426 disconnected 3876-3892

Since the replica still thinks that it's replica is in pfail, and not failed, it keeps on trying to connect to it's failed primary indefinitely, which causes the cluster to never recover.

Expected behavior

The cluster should have recovered.

Setup

  • Number of Nodes: 2000
  • Instance Type: r7g.xlarge
  • Sharding Strategy: 1 replica per shard
  • Engine Version: Valkey 8.1.0
  • Cluster node timeout: 15 seconds

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions