Skip to content

Support Large Valkey Cluster #2281

@hpatro

Description

@hpatro

We aim to scale Valkey to support large deployments (2000 nodes) in cluster-enabled setup with the following criteria:

  • Failure detection period: cluster-node-timeout - 15 seconds

  • Deployment strategy

    • Nodes spread across 3 or more isolated data center zones (availability zones)
      • 1 primary / 1 replica per shard i.e. 1000 primary / 1000 replica(s)
      • 1 primary / 2 replica per shard i.e. 666 primary / 1332 replica(s)
  • Resilience against simultaneous failure scenarios:

    • 33% of overall node failures

Expectations:

  • Steady and predictable CPU utilization under normal and failure conditions.

  • Consistent and bounded failure detection time (~15 seconds).

  • Predictable failover time bounded to minimal downtime.

  • Convergence (cluster state stabilization) within bounded limits post-failure recovery.

Alternatives to Consider:

  • Client-side sharding across multiple smaller clusters.

  • Using a proxy solution to route commands/data across multiple smaller clusters.

We will be using this issue as the high level tracker. Will be posting updates here periodically as well about the benchmark results.

High Level Areas:

Cluster initialization / setup:

Concurrent node failure detection / failover:

Convergence / Information dissemination:

Observability:

Steady state [CPU utilization]:

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions