Skip to content

Commit bb7e5bf

Browse files
pizhenweisigemptyFujiZzhuojiang123zhangyiming1201
committed
Introduce Valkey Over RDMA protocol
RDMA is the abbreviation of remote direct memory access. It is a technology that enables computers in a network to exchange data in the main memory without involving the processor, cache, or operating system of either computer. This means RDMA has a better performance than TCP, the test results show Valkey Over RDMA has a ~2.5X QPS and lower latency. In recent years, RDMA gets popular in the data center, especially RoCE(RDMA over Converged Ethernet) architecture has been widely used. Introduce Valkey Over RDMA protocol as a new transport for Valkey. For now, we defined 4 commands: - GetServerFeature & SetClientFeature: the two commands are used to negotiate features for further extension. There is no feature definition in this version. Flow control and multi-buffer may be supported in the future, this needs feature negotiation. - Keepalive - RegisterXferMemory: the heart to transfer the real payload. The 'TX buffer' and 'RX buffer' are designed by RDMA remote memory with RDMA write/write with imm, it's similar to several mechanisms introduced by papers(but not same): - Socksdirect: datacenter sockets can be fast and compatible <https://dl.acm.org/doi/10.1145/3341302.3342071> - LITE Kernel RDMA Support for Datacenter Applications <https://dl.acm.org/doi/abs/10.1145/3132747.3132762> - FaRM: Fast Remote Memory <https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf> Co-authored-by: Xinhao Kong <[email protected]> Co-authored-by: Huaping Zhou <[email protected]> Co-authored-by: zhuo jiang <[email protected]> Co-authored-by: Yiming Zhang <[email protected]> Co-authored-by: Jianxi Ye <[email protected]> Signed-off-by: zhenwei pi <[email protected]>
1 parent 6cff0d6 commit bb7e5bf

File tree

1 file changed

+123
-0
lines changed

1 file changed

+123
-0
lines changed

RDMA.md

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
RDMA Support
2+
============
3+
4+
Connections
5+
-----------
6+
7+
RDMA operations also go through a connection abstraction layer that hides
8+
I/O and read/write event handling from the caller.
9+
10+
Valkey works under a stream-oriented protocol while RDMA is a message protocol, so additional work is required to support RDMA-based Valkey.
11+
12+
## Protocol
13+
In Valkey, separate control-plane(to exchange control message) and data-plane(to
14+
transfer the real payload for Valkey).
15+
16+
### Control message
17+
For control message, use a fixed 32 bytes message which defines structures:
18+
```
19+
typedef struct ValkeyRdmaFeature {
20+
/* defined as following Opcodes */
21+
uint16_t opcode;
22+
/* select features */
23+
uint16_t select;
24+
uint8_t rsvd[20];
25+
/* feature bits */
26+
uint64_t features;
27+
} ValkeyRdmaFeature;
28+
29+
typedef struct ValkeyRdmaKeepalive {
30+
/* defined as following Opcodes */
31+
uint16_t opcode;
32+
uint8_t rsvd[30];
33+
} ValkeyRdmaKeepalive;
34+
35+
typedef struct ValkeyRdmaMemory {
36+
/* defined as following Opcodes */
37+
uint16_t opcode;
38+
uint8_t rsvd[14];
39+
/* address of a transfer buffer which is used to receive remote streaming data,
40+
* aka 'RX buffer address'. The remote side should use this as 'TX buffer address' */
41+
uint64_t addr;
42+
/* length of the 'RX buffer' */
43+
uint32_t length;
44+
/* the RDMA remote key of 'RX buffer' */
45+
uint32_t key;
46+
} ValkeyRdmaMemory;
47+
48+
typedef union ValkeyRdmaCmd {
49+
ValkeyRdmaFeature feature;
50+
ValkeyRdmaKeepalive keepalive;
51+
ValkeyRdmaMemory memory;
52+
} ValkeyRdmaCmd;
53+
```
54+
55+
### Opcodes
56+
|Command| Value | Description |
57+
| :----: | :----: | :----: |
58+
| GetServerFeature | 0 | required, get the features offered by Valkey server |
59+
| SetClientFeature | 1 | required, negotiate features and set it to Valkey server |
60+
| Keepalive | 2 | required, detect unexpected orphan connection |
61+
| RegisterXferMemory | 3 | required, tell the 'RX transfer buffer' information to the remote side, and the remote side uses this as 'TX transfer buffer' |
62+
63+
### Operations of RDMA
64+
- To send a control message by RDMA '**ibv_post_send**' with opcode '**IBV_WR_SEND**' with structure
65+
'ValkeyRdmaCmd'.
66+
- To receive a control message by RDMA '**ibv_post_recv**', and the received buffer
67+
size should be size of 'ValkeyRdmaCmd'.
68+
- To transfer stream data by RDMA '**ibv_post_send**' with opcode '**IBV_WR_RDMA_WRITE**'(optional) and
69+
'**IBV_WR_RDMA_WRITE_WITH_IMM**'(required), to write data segments into a connection by
70+
RDMA [WRITE][WRITE][WRITE]...[WRITE WITH IMM], the length of total buffer is described by
71+
immediate data(unsigned int 32).
72+
73+
74+
### Maximum WQE(s) of RDMA
75+
Currently no specific restriction is defined in this protocol. Recommended WQEs is 1024.
76+
Flow control for WQE MAY be defined/implemented in the future.
77+
78+
79+
### The workflow of this protocol
80+
```
81+
valkey-server
82+
listen RDMA port
83+
valkey-client
84+
-------------------RDMA connect-------------------->
85+
accept connection
86+
<--------------- Establish RDMA --------------------
87+
88+
--------Get server feature [@IBV_WR_SEND] --------->
89+
90+
--------Set client feature [@IBV_WR_SEND] --------->
91+
setup RX buffer
92+
<---- Register transfer memory [@IBV_WR_SEND] ------
93+
[@ibv_post_recv]
94+
setup TX buffer
95+
----- Register transfer memory [@IBV_WR_SEND] ----->
96+
[@ibv_post_recv]
97+
setup TX buffer
98+
-- Valkey commands [@IBV_WR_RDMA_WRITE_WITH_IMM] -->
99+
<- Valkey response [@IBV_WR_RDMA_WRITE_WITH_IMM] ---
100+
.......
101+
-- Valkey commands [@IBV_WR_RDMA_WRITE_WITH_IMM] -->
102+
<- Valkey response [@IBV_WR_RDMA_WRITE_WITH_IMM] ---
103+
.......
104+
105+
106+
RX is full
107+
----- Register transfer memory [@IBV_WR_SEND] ----->
108+
[@ibv_post_recv]
109+
setup TX buffer
110+
<- Valkey response [@IBV_WR_RDMA_WRITE_WITH_IMM] ---
111+
.......
112+
113+
RX is full
114+
<---- Register transfer memory [@IBV_WR_SEND] ------
115+
[@ibv_post_recv]
116+
setup TX buffer
117+
-- Valkey commands [@IBV_WR_RDMA_WRITE_WITH_IMM] -->
118+
<- Valkey response [@IBV_WR_RDMA_WRITE_WITH_IMM] ---
119+
.......
120+
121+
-------------------RDMA disconnect----------------->
122+
<------------------RDMA disconnect------------------
123+
```

0 commit comments

Comments
 (0)