Support Large Valkey Cluster

We aim to scale Valkey to support large deployments (2000 nodes) in cluster-enabled setup with the following criteria:

* Failure detection period: `cluster-node-timeout` - 15 seconds

* Deployment strategy
	* Nodes spread across 3 or more isolated data center zones (availability zones)
		* 1 primary / 1 replica per shard i.e. 1000 primary / 1000 replica(s)
		* 1 primary / 2 replica per shard i.e. 666 primary / 1332 replica(s)
	

* Resilience against simultaneous failure scenarios:
	* 33% of overall node failures


## Expectations:

* Steady and predictable CPU utilization under normal and failure conditions.

* Consistent and bounded failure detection time (~15 seconds).

* Predictable failover time bounded to minimal downtime.

* Convergence (cluster state stabilization) within bounded limits post-failure recovery.

## Alternatives to Consider:

* Client-side sharding across multiple smaller clusters.

* Using a proxy solution to route commands/data across multiple smaller clusters.

We will be using this issue as the high level tracker. Will be posting updates here periodically as well about the benchmark results.

## High Level Areas:

### Cluster initialization / setup:

- [x] https://github.com/valkey-io/valkey/pull/2009

### Concurrent node failure detection / failover:

- [x] https://github.com/valkey-io/valkey/issues/2076 / https://github.com/valkey-io/valkey/pull/2010
- [x] https://github.com/valkey-io/valkey/issues/2122
- [x] https://github.com/valkey-io/valkey/issues/2139
- [x] https://github.com/valkey-io/valkey/issues/2181
- [ ] https://github.com/valkey-io/valkey/pull/1927
- [ ] https://github.com/valkey-io/valkey/issues/1937
- [ ] https://github.com/valkey-io/valkey/issues/2023

### Convergence / Information dissemination:
- [ ] https://github.com/valkey-io/valkey/issues/1924

### Observability:

- [x] https://github.com/valkey-io/valkey/pull/1910
- [ ] https://github.com/valkey-io/valkey/issues/1929
- [ ] https://github.com/valkey-io/valkey/pull/2011
- [ ] https://github.com/valkey-io/valkey/issues/2029

### Steady state [CPU utilization]:

- [ ] https://github.com/valkey-io/valkey/issues/2291
- [ ] https://github.com/valkey-io/valkey/issues/2369


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Large Valkey Cluster #2281

Expectations:

Alternatives to Consider:

High Level Areas:

Cluster initialization / setup:

Concurrent node failure detection / failover:

Convergence / Information dissemination:

Observability:

Steady state [CPU utilization]:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support Large Valkey Cluster #2281

Description

Expectations:

Alternatives to Consider:

High Level Areas:

Cluster initialization / setup:

Concurrent node failure detection / failover:

Convergence / Information dissemination:

Observability:

Steady state [CPU utilization]:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions