-
Notifications
You must be signed in to change notification settings - Fork 961
Description
We aim to scale Valkey to support large deployments (2000 nodes) in cluster-enabled setup with the following criteria:
-
Failure detection period:
cluster-node-timeout- 15 seconds -
Deployment strategy
- Nodes spread across 3 or more isolated data center zones (availability zones)
- 1 primary / 1 replica per shard i.e. 1000 primary / 1000 replica(s)
- 1 primary / 2 replica per shard i.e. 666 primary / 1332 replica(s)
- Nodes spread across 3 or more isolated data center zones (availability zones)
-
Resilience against simultaneous failure scenarios:
- 33% of overall node failures
Expectations:
-
Steady and predictable CPU utilization under normal and failure conditions.
-
Consistent and bounded failure detection time (~15 seconds).
-
Predictable failover time bounded to minimal downtime.
-
Convergence (cluster state stabilization) within bounded limits post-failure recovery.
Alternatives to Consider:
-
Client-side sharding across multiple smaller clusters.
-
Using a proxy solution to route commands/data across multiple smaller clusters.
We will be using this issue as the high level tracker. Will be posting updates here periodically as well about the benchmark results.
High Level Areas:
Cluster initialization / setup:
Concurrent node failure detection / failover:
- Logs consuming significant compute during cluster node failure detection #2076 / Avoid log spam about cluster node failure detection by each primary #2010
- [NEW] Excessive connection attempts to failed nodes in clusterCron() cause CPU overhead #2122
- [NEW] Performance bottleneck in clusterNodeCleanupFailureReports under heavy failure load #2139
- Cluster fails to recover when ~50 % of primaries are killed (majority still alive) #2181
- Mark primary node as alive immediately if reachable and failover is not possible #1927
- [NEW] Propagate node availability-zone information across cluster and make failover smarter #1937
- [NEW] Faster cluster failover #2023
Convergence / Information dissemination:
Observability:
- Add node pfail and fail count to cluster info metrics #1910
- [NEW] Measure the amount of network in/out over cluster bus #1929
- Log failed cluster node(s) state periodically to capture transient state for debuggability #2011
- Introduce CLUSTERLOG command support #2029