[NEW] Atomic slot migration HLD 

**Last Updated:  Jan 12, 2025**

# Problem Statement

Efficient slot migration is critical for maintaining the scalability and resilience of Valkey clusters, but the current client-driven migration approach presents significant challenges.

Operators currently need to perform multiple manual steps to migrate slots, including setting migration states, transferring data, and updating ownership, all of which are prone to errors and inefficiencies.

While existing migration methods ensure availability during the process, they rely on mechanisms like redirection (-MOVED, -ASK) to maintain client access to data. These mechanisms introduce significant client-side complexity, requiring clients to handle redirections, retries, and reconfigurations during migration. Ensuring atomicity and consistency during migrations remains a challenge, especially for operations involving multiple keys or large datasets.

Compounding these challenges, limited observability leaves operators without adequate insight into the migration process, making it difficult to monitor progress or resolve issues efficiently.


# Command Definitions


## CLUSTER IMPORT SLOTS



*   **Syntax:** `CLUSTER IMPORT SLOTS <slot_start> <slot_end> [<slot_start> <slot_end>...]`
*   **Arguments:**
    *   `<slot_start>`: The starting slot number of a range (integer between 0 and 16383).
    *   `<slot_end>`: The ending slot number of a range (integer between 0 and 16383 and equal to or greater than `<slot_start>`), inclusive.
*   **Return Value:**
    *   `<op_id>` on queuing success, where `<op_id>` is a UUID for the long running import operation.
    *   `-ERR <error message>` on failure (e.g., invalid slot range).
*   **Description:** This command asynchronously imports specified slot ranges to the target primary node, allowing background import management. Existing target node slots and slots with undetermined sources are skipped. Note that although the import ID is ephemeral and lost if the target primary is terminated, atomic slot migration ensures the cluster remains consistent, with keys in a slot always residing on a single shard.


## CLUSTER IMPORT STATUS



*   **Syntax:** `CLUSTER IMPORT STATUS OPID <id>`
*   **Arguments:**
    *   `<op_id>`: The unique identifier of the import operation, as returned by `CLUSTER IMPORT SLOTS`.
    *   `-ERR <error message>` on failure (e.g., invalid operation ID).
*   **Return Value (RESP2 and RESP3):** A map with the following information:
    *   Total number of slots requested for import.
    *   Number of slots successfully imported.
    *   Number of slots that failed to import.
    *   Number of slots currently being imported.
    *   Number of slots whose import was canceled.
    *   ... other relevant statistics … such as total number of keys or bytes moved
*   D**escription:** This command checks the status and progress of an import. How the information is returned depends on the client's RESP version. RESP3 clients get a map with details for each slot; RESP2 clients get a nested array with labeled data points. Server logs have more information about failed slot migrations.


## CLUSTER IMPORT CANCEL



*   **Syntax:** `CLUSTER IMPORT CANCEL OPID <id>`
*   **Arguments:**
    *   `<op_id>`: The unique identifier of the import operation, as returned by `CLUSTER IMPORT SLOTS`.
*   **Return Value:**
    *   `+OK`
    *   `-ERR <error message>` on failure (e.g., invalid operation ID).
*   **Description:** This command cancels queued or in-progress import operations, rolling back any in-progress slot transfers. The import session is not immediately deleted upon cancellation or completion; the server retains a FIFO queue of a certain number of completed sessions.


## SYNCSLOTS (Internal)



*   **Syntax:** `SYNCSLOTS SLOTS <slot_start> <slot_end> [<slot_start> <slot_end> ...]`
*   **Arguments:**
    *    `<slot_start>`: The starting slot number of a range (integer between 0 and 16383).
    *   `<slot_end>`: The ending slot number of a range (integer between 0 and 16383).
*   **Return Value:**
    *   `<session_id>` on success, where `<session_id>` is a UUID for the long running full sync operation.
    *   `-ERR <error message>` on failure. Possible error messages include:
        *   -ERR Slot range <slot_start>-<slot_end> is already being imported.
*   **Description: **This command is used internally by Valkey to initiate the transfer of data for specified slot ranges from a source primary node to a target primary node during a slot migration. It is not intended to be used directly by users. Upon receiving this command, the source node first checks if any of the requested slots are already involved in an ongoing import operation with a different target node. If so, the command fails with an appropriate error message, indicating the conflicting slot range.


## TRANSFERSLOTS (Internal)



*   **Syntax:** `TRANSFERSLOTS SESSIONID <session_id>`
*   **Arguments:**
    *   `<session_id>`: The unique identifier of the import operation, previously provided by the source node in response to a `SYNCSLOTS` command.
*   **Return Value:**
    *   `<tracked_changes_size>` on success, where `<tracked_changes_size>` is the number of bytes of remaining tracked changes.
    *   `-ERR <error message>` on failure. Possible error messages include:
        *   `-ERR Invalid session ID.`
        *   `-ERR Slot transfer already in progress.`
*   **Description:** This command is used internally by Valkey to signal the source primary node to finalize the transfer of slots that were previously requested via the SYNCSLOTS command. It is not intended to be used directly by users. The target primary node sends this command to the source node after it has finished processing the initial AOF snapshot of the slots and has sufficiently caught up with the ongoing changes.


# Operator WorkFlow

Consider a Valkey cluster comprising three nodes (A, B, and C), each serving a distinct range of hash slots. To optimize resource utilization, an operator decides to migrate slots 1000-2000 from Node A to Node C. This migration is initiated by connecting to Node C via `valkey-cli` and issuing the command `CLUSTER` `IMPORT SLOTS 1000 2000`. Node C responds with a unique identifier for the import operation.

The operator can then periodically monitor the progress of this migration using the command `CLUSTER IMPORT STATUS` followed by the unique identifier. This command provides a high level slot migration status. This allows the operator to track the migration's progress and identify any potential issues.

Should the need arise to halt the migration, perhaps due to an unforeseen event or a change in operational requirements, the operator can issue the command `CLUSTER IMPORT CANCEL` along with the corresponding import identifier. This command signals Node C to stop the ongoing migration process.

Upon completion of the migration, the operator can verify the successful transfer of slots 1000-2000 to Node C by inspecting the cluster's slot distribution using commands like `CLUSTER NODES`. This confirms that the data has been correctly redistributed within the cluster as intended.


```
valkey-cli -h 10.0.0.3 -p 6379
> HELLO 3
> CLUSTER IMPORT SLOTS 1000 2000 
c7e98d63-5717-479f-8b8a-4c1d49720286 
> CLUSTER IMPORT STATUS OPID c7e98d63-5717-479f-8b8a-4c1d49720286 
1# "requested-slots" => (integer) 1001 
2# "completed-slots" => (integer) 800 
3# "failed-slots" => (integer) 50 
4# "importing-slots" => (integer) 15 
5# "canceled-slots" => (integer) 0 
> CLUSTER IMPORT CANCEL OPID c7e98d63-5717-479f-8b8a-4c1d49720286
 +OK 
> CLUSTER IMPORT STATUS OPID c7e98d63-5717-479f-8b8a-4c1d49720286 
1# "requested-slots" => (integer) 1001 
2# "completed-slots" => (integer) 800 
3# "failed-slots" => (integer) 50 
4# "importing-slots" => (integer) 0 
5# "canceled-slots" => (integer) 151 
```



# High-Level Migration Flow

![image](https://github.com/user-attachments/assets/1e18ca1a-b99a-4442-8dd8-3d1e697ef59d)

When a target primary receives a `CLUSTER IMPORT` command, it groups the slots for import by source primary node. It then connects to the source primaries and starts the migration using a new internal command, `SYNCSLOTS`. During migration, the target primary acts as a "slot-replica" of the source primary, conceptually. Upon receiving `SYNCSLOTS`, the source primary forks a child process to create a slot-specific AOF command sequence containing only the keys in the specified slots. This sequence is streamed to the target primary, which executes it on the main thread and replicates it normally to the target replicas. To ensure all changes to the migrating slots are captured, incremental updates (deltas) from the source shard are also streamed to the target primary, as a replica would handle live updates from its primary. Optimizations similar to dual-channel full sync could also be considered.

During atomic slot migration, the target primary only takes ownership of the slots after fully catching up with the source primary. This is done by freezing writes to the source primary for the migrating slots after the target receives the full slot snapshots. Minimizing the delta between the primaries before ownership transfer is important to reduce the unavailable window for writes (due to pausing). The target receives delta changes during the write pause. Upon achieving full synchronization, the target bumps up its config epoch without consensus.  Note that the source primary’s pause of all writers is done with a timeout (potentially the same cluster node timeout). The source primary resumes writes upon detecting the ownership transfer (via the target's cluster message with a higher config epoch). If the target doesn't claim the slots before the timeout, the source unpauses writers and retains ownership. While this may cause data loss in some rare cases, it is acceptable for this design, and further solutions are explored in issue #1355.

If the source primary fails over before migration completes, the target primary can retry with exponential backoff or proceed to the next source. This implementation detail can be discussed further in the PR review.

The migration concludes when the source shard relinquishes ownership of the migrated slots, removes the associated data, and redirects client traffic to the target shard.

Neither the source nor target replicas are aware of the ongoing slot migration. Target nodes can either be new or serving client traffic. If a source or target replica fails during this process, it will simply perform a full synchronization from its respective primary as though no migration is underway. 

Client traffic, including both read and write commands, to the slots in question continues to be directed to the source shard throughout the migration process, with no awareness of the migration. 


## Inter-Node Slot Import Protocol

The process of transferring slots from a source node to a target node involves an orchestrated sequence of steps to ensure data consistency and availability. This sequence begins with the target primary node sending a `SYNCSLOTS` command to the source primary, specifying the desired slot range. The source node responds with a unique session ID to identify this specific import operation.

Upon receiving the `SYNCSLOTS` command and acknowledging with the session ID, the source node forks a child process dedicated to streaming an AOF snapshot of the requested slots to the target. Concurrently, the source node begins tracking changes made to these slots only. This granular tracking mechanism allows the source node to maintain a record of all modifications to the migrating slots without having to pause writes to the entire keyspace, thereby preserving write availability for unaffected slots. This represents a trade-off: increased memory consumption in exchange for enhanced write availability.

Once the target primary has finished processing the AOF snapshot and has sufficiently caught up with the ongoing changes, it sends a `TRANSFERSLOTS SESSIONID <session id>` command to the source. This command signals the source to finalize the slot transfer.

In response to `TRANSFERSLOTS`, the source node pauses all further writes to the slots being migrated. This effectively freezes the tracked changes, ensuring that no new modifications are made to the data while the final transfer is in progress. The source then replies to the `TRANSFERSLOTS` command with the size of the tracked changes, indicating the amount of data the target node needs to receive to be fully synchronized.

The target node, upon receiving the size of the tracked changes, starts counting the number of bytes received from the source. As soon as the expected number of bytes has been received, the target node is certain that it has a complete and up-to-date copy of the migrating slots. At this point, the target node proceeds to claim ownership of the slots and bumps up its config epoch.

Immediately after claiming the slots and updating its epoch, the target node broadcasts this information to the entire cluster. This broadcast ensures that all other nodes are aware of the new slot ownership and can redirect client requests accordingly.


# Major Design Considerations


## `CLUSTER IMPORT` Execution Model (ASYNC)

The proposed `CLUSTER IMPORT` command uses an asynchronous approach for better client compatibility and resilience to network issues. This requires two additional commands for progress monitoring (`CLUSTER IMPORT STATUS`) and cancellation support (`CLUSTER IMPORT CANCEL`).


## Epoch Bumping Strategies (Concensusless)

In Valkey, a node's configuration epoch determines slot ownership. When a slot is migrated from one node to another, the target node must increase its epoch to a value higher than any other node in the cluster. This ensures a clear and unambiguous handover of ownership, as the node with the highest epoch is considered the rightful owner of the slot.

There are two primary approaches to bumping a node's epoch:



1. **Consensusless Bumping:** The target node directly increases its epoch without requiring agreement from other nodes. This method is efficient but carries the risk of epoch collisions if multiple nodes attempt to claim the same slot concurrently.

2. **Consensus-Based Bumping:** The target node proposes an epoch increase and requires a majority of nodes in the cluster to approve this change before it can take effect. This approach reduces the risk of collisions but introduces complexity and potential delays.

For atomic slot migration in Valkey, it is argued that consensus-based epoch bumping is not necessary. This argument rests on the observation that consensus does not inherently eliminate the risk of data loss during slot migration.

Consider a scenario where slots are being migrated from node `A` to node `B`. Node `B` initiates the process, and node `A` pauses writes to the slots being transferred. After node `B` fully catches up with the data, it stages a vote to increase its epoch. However, due to network issues or other unforeseen circumstances, node `A` might not receive or process this vote request.

Despite this lack of acknowledgment from node `A`, node `B` might still secure enough votes from other nodes to bump its epoch. Subsequently, node `B` claims ownership of the slots. However, if node `B`'s cluster messages fail to reach node `A` before its pause timeout expires, node `A` will resume accepting writes to the slots it no longer owns. This leads to an inconsistency where both nodes believe they own the slots, and writes directed to node `A` after the timeout will ultimately be lost when it eventually receives node `B`'s updated epoch.

This scenario highlights the inherent trade-off between data consistency and availability. While a consensus-based approach might seem to offer stronger consistency guarantees, it cannot prevent data loss in situations where network partitions or node failures occur. Ultimately, the decision of whether to unpause writes on node `A` without explicit confirmation from node `B` rests on a delicate balance between ensuring data consistency and maintaining availability.

This complex issue of balancing availability and consistency, particularly in the context of write pauses and timeout mechanisms, is best addressed within the broader discussion of data durability. Therefore, further exploration of this topic and potential solutions are deferred to issue #1355, which focuses on enhancing data durability guarantees in Valkey.


## Streaming Format (AOF)

When migrating data between nodes in Valkey, a fundamental design decision involves choosing the appropriate format for representing and transferring that data. Two primary candidates emerge: AOF, which logs Valkey commands, and RDB, a snapshot of the in-memory data. This section analyzes these options, considering the constraints of atomic data types and the implications for memory management on the receiving end.

Regardless of whether AOF, RDB, or a chunked variant of RDB is used, the receiver can only apply a change once the corresponding data is fully received. These changes can be primitive data types like strings and integers or composite data types like `SET`s, `HASH`es, etc. The primitives are indivisible and cannot be processed partially.

When a composite data structure is transferred, the receiver must buffer the entire structure in memory for processing. This remains the case even with chunked data unless the chunks are aligned with the atomic data type boundaries within the structure (akin to AOF).

Consequently, streaming the atomic data types (strings and integers) emerges as the most memory-efficient and straightforward approach. This strategy minimizes the buffering and tracking requirements on the receiver, as each atomic unit can be processed and discarded immediately upon reception.

This approach aligns with the AOF format, which essentially represents data as a sequence of Valkey commands operating on these atomic data types. Using AOF for data transfer offers several advantages:



*   **Existing Support:** Valkey already has robust mechanisms for handling AOF files, simplifying implementation.
*   **Efficient Processing:** Multiplexing the loading of AOF commands on the main thread has minimal impact on overall performance.
*   **Simplified Replication:** Replicating AOF commands to replicas requires little to no additional work.

In contrast, RDB, even when streamed in chunks, presents challenges. Unless the chunks align perfectly with the atomic data type boundaries, the receiver still needs to buffer potentially large segments of data, increasing memory pressure. This makes RDB or chunked RDB less suitable for efficient data transfer in Valkey.

Therefore, based on the constraints of atomic data types and the need to minimize memory pressure on the receiver, AOF emerges as the preferred solution for data transfer in Valkey. Its inherent alignment with atomic units, combined with existing support and efficient processing capabilities, makes it a more suitable choice compared to RDB or its chunked variants. 


# Observability for Atomic Slot Migration

To enhance the observability of the atomic slot migration process and provide operators with improved visibility into its progress and potential issues, we can integrate detailed metrics as follows into `CLUSTER INFO`:

**Metrics on the Source Primary**



*   Track the total number of migration requests received.
*   Monitor the current number of slots actively being migrated to a target node.
*   Count the migration operations that failed when the source acted as the sender.
*   Track specific error types encountered during migration (e.g., fork failures, network issues).

**Metrics on the Target Primary**



*   Track the total number of migration requests initiated.
*   Monitor the current number of slots being imported.
*   Count the migration operations that failed when the target primary acted as the requester.
*   Track specific errors encountered during the import process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NEW] Atomic slot migration HLD #23

Problem Statement

Command Definitions

CLUSTER IMPORT SLOTS

CLUSTER IMPORT STATUS

CLUSTER IMPORT CANCEL

SYNCSLOTS (Internal)

TRANSFERSLOTS (Internal)

Operator WorkFlow

High-Level Migration Flow

Inter-Node Slot Import Protocol

Major Design Considerations

`CLUSTER IMPORT` Execution Model (ASYNC)

Epoch Bumping Strategies (Concensusless)

Streaming Format (AOF)

Observability for Atomic Slot Migration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[NEW] Atomic slot migration HLD #23

Description

Problem Statement

Command Definitions

CLUSTER IMPORT SLOTS

CLUSTER IMPORT STATUS

CLUSTER IMPORT CANCEL

SYNCSLOTS (Internal)

TRANSFERSLOTS (Internal)

Operator WorkFlow

High-Level Migration Flow

Inter-Node Slot Import Protocol

Major Design Considerations

CLUSTER IMPORT Execution Model (ASYNC)

Epoch Bumping Strategies (Concensusless)

Streaming Format (AOF)

Observability for Atomic Slot Migration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`CLUSTER IMPORT` Execution Model (ASYNC)