Skip to content

Conversation

@pizhenwei
Copy link
Contributor

@pizhenwei pizhenwei commented May 9, 2024

Adds an option to build RDMA support as a module:

make BUILD_RDMA=module

To start valkey-server with RDMA, use a command line like the following:

./src/valkey-server --loadmodule src/valkey-rdma.so \
    port=6379 bind=xx.xx.xx.xx
  • Implement server side of connection module only, this means we can NOT
    compile RDMA support as built-in.

  • Add necessary information in README.md

  • Support 'CONFIG SET/GET', for example, 'CONFIG Set rdma.port 6380', then
    check this by 'rdma res show cm_id' and valkey-cli (with RDMA support,
    but not implemented in this patch).

  • The full listeners show like:

    listener0:name=tcp,bind=*,bind=-::*,port=6379
    listener1:name=unix,bind=/var/run/valkey.sock
    listener2:name=rdma,bind=xx.xx.xx.xx,bind=yy.yy.yy.yy,port=6379
    listener3:name=tls,bind=*,bind=-::*,port=16379
    

Because the lack of RDMA support from TCL, use a simple C program to test
Valkey Over RDMA (under tests/rdma/). This is a quite raw version with basic
library dependence: libpthread, libibverbs, librdmacm. Run using the script:

./runtest-rdma [ OPTIONS ]

To run RDMA in GitHub actions, a kernel module RXE for emulated soft RDMA, needs
to be installed. The kernel module source code is fetched a repo containing only
the RXE kernel driver from the Linux kernel, but stored in an separate repo to
avoid cloning the whole Linux kernel repo.


Since 2021/06, I created a PR for Redis Over RDMA proposal. Then I did some work to fully abstract connection and make TLS dynamically loadable, a new connection type could be built into Redis statically, or a separated shared library(loaded by Redis on startup) since Redis 7.2.0.

Base on the new connection framework, I created a new PR, some guys(@xiezhq-hermann @zhangyiming1201 @JSpewock @uvletter @FujiZ) noticed, played and tested this PR. However, because of the lack of time and knowledge from the maintainers, this PR has been pending about 2 years.

Related doc: Introduce Valkey Over RDMA specification. (same as Redis, and this should be same)

Changes in this PR:

  • implement Valkey Over RDMA. (compact the Valkey style)

Finally, if this feature is considered to merge, I volunteer to maintain it.

@codecov
Copy link

codecov bot commented May 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.24%. Comparing base (9948f07) to head (9d24d15).

Additional details and impacted files
@@            Coverage Diff            @@
##           unstable     #477   +/-   ##
=========================================
  Coverage     70.23%   70.24%           
=========================================
  Files           112      112           
  Lines         60602    60602           
=========================================
+ Hits          42566    42568    +2     
+ Misses        18036    18034    -2     

see 9 files with indirect coverage changes

@pizhenwei
Copy link
Contributor Author

This PR could be tested by client.

To build client with RDMA:

make BUILD_RDMA=yes -j16

To test by commands:

Config of redis: appendonly no, port 6379, rdma-port 6379, appendonly no,
                 server_cpulist 12, bgsave_cpulist 16.
For RDMA: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma \
	  --server_cpulist 2 --bio_cpulist 3 --aof_rewrite_cpulist 3 --bgsave_cpulist 3
For TCP: ./redis-benchmark -h HOST -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

@madolson madolson added the major-decision-pending Major decision pending by TSC team label May 12, 2024
@hz-cheng
Copy link

hz-cheng commented May 20, 2024

Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Valkey over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.4xlarge, 16 vCPUs, 64G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP sockets, the use of RDMA can significantly enhance performance.

Test command of server side:

./src/valkey-server --port 6379 \
  --loadmodule src/valkey-rdma.so port=6380 bind=11.0.0.114 \
  --loglevel verbose --protected-mode no \
  --server_cpulist 12 --bgsave_cpulist 16 --appendonly no 

Test command of client side:

# Test for RDMA
./src/redis-benchmark -h 11.0.0.114 -p 6380 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma

# Test for TCP socket
./src/redis-benchmark -h 11.0.0.114 -p 6379 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

The performance test results are as shown in the following table. Apart from LRANGE_100 (performance improvement but not substantially), in other scenarios (PING, SET, GET) the throughput can be increased by at least 76%, and the average (AVG) and P99 latencies can be reduced by at least 40%.

RDMA TCP RDMA/TCP
PING_INLINE
Throughput 666577.81 366394.31 181.93%
Latency-AVG 0.044 0.08 55.00%
Latency-P99 0.063 0.127 49.61%
PING_MBULK
Throughput 688657.81 395397.56 174.17%
Latency-AVG 0.042 0.073 57.53%
Latency-P99 0.063 0.119 52.94%
SET
Throughput 434744.78 157726.22 275.63%
Latency-AVG 0.068 0.188 36.17%
Latency-P99 0.111 0.183 60.66%
GET
Throughput 562587.94 319478.59 176.10%
Latency-AVG 0.052 0.091 57.14%
Latency-P99 0.079 0.151 52.32%
LRANGE
Throughput 526260.38 211434.36 248.90%
Latency-AVG 0.056 0.14 40.00%
Latency-P99 0.079 0.159 49.69%
LRANGE_100
Throughput 57106.96 49498.34 115.37%
Latency-AVG 0.427 0.499 85.57%
Latency-P99 4.207 13.367 31.47%

@pizhenwei
Copy link
Contributor Author

Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Redis over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.4xlarge, 16 vCPUs, 64G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP sockets, the use of RDMA can significantly enhance performance.

Test command of server side:

./src/valkey-server --port 6379 \
  --loadmodule src/valkey-rdma.so port=6380 bind=11.0.0.114 \
  --loglevel verbose --protected-mode no \
  --server_cpulist 12 --bgsave_cpulist 16 --appendonly no 

Test command of client side:

# Test for RDMA
./src/redis-benchmark -h 11.0.0.114 -p 6380 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma

# Test for TCP socket
./src/redis-benchmark -h 11.0.0.114 -p 6379 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

The performance test results are as shown in the following table. Apart from LRANGE_100 (performance improvement but not substantially), in other scenarios (PING, SET, GET) the throughput can be increased by at least 76%, and the average (AVG) and P99 latencies can be reduced by at least 40%.

指标 RDMA TCP RDMA/TCP
PING_INLINE
Throughput 666577.81 366394.31 181.93%
Latency-AVG 0.044 0.08 55.00%
Latency-P99 0.063 0.127 49.61%
PING_MBULK
Throughput 688657.81 395397.56 174.17%
Latency-AVG 0.042 0.073 57.53%
Latency-P99 0.063 0.119 52.94%
SET
Throughput 434744.78 157726.22 275.63%
Latency-AVG 0.068 0.188 36.17%
Latency-P99 0.111 0.183 60.66%
GET
Throughput 562587.94 319478.59 176.10%
Latency-AVG 0.052 0.091 57.14%
Latency-P99 0.079 0.151 52.32%
LRANGE
Throughput 526260.38 211434.36 248.90%
Latency-AVG 0.056 0.14 40.00%
Latency-P99 0.079 0.159 49.69%
LRANGE_100
Throughput 57106.96 49498.34 115.37%
Latency-AVG 0.427 0.499 85.57%
Latency-P99 4.207 13.367 31.47%

Hi, @hz-cheng

I notice that you are the author of alibaba-cloud erdma driver for both linux kernel and rdma-core. Cooooooooool!

@hz-cheng
Copy link

hz-cheng commented May 21, 2024

More, If necessary, I could try reaching out to relevant colleagues to see if we can offer some Alibaba Cloud ECS instances to the community for free, so that the community can use and test Valkey over RDMA, as well as for future CI/CD purposes.

@baronwangr
Copy link

Is there a corresponding client that enables RDMA?

@pizhenwei
Copy link
Contributor Author

Is there a corresponding client that enables RDMA?

See this comment please.

@pizhenwei
Copy link
Contributor Author

Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Redis over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.4xlarge, 16 vCPUs, 64G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP sockets, the use of RDMA can significantly enhance performance.
Test command of server side:

./src/valkey-server --port 6379 \
  --loadmodule src/valkey-rdma.so port=6380 bind=11.0.0.114 \
  --loglevel verbose --protected-mode no \
  --server_cpulist 12 --bgsave_cpulist 16 --appendonly no 

Test command of client side:

# Test for RDMA
./src/redis-benchmark -h 11.0.0.114 -p 6380 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma

# Test for TCP socket
./src/redis-benchmark -h 11.0.0.114 -p 6379 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

The performance test results are as shown in the following table. Apart from LRANGE_100 (performance improvement but not substantially), in other scenarios (PING, SET, GET) the throughput can be increased by at least 76%, and the average (AVG) and P99 latencies can be reduced by at least 40%.
指标 RDMA TCP RDMA/TCP
PING_INLINE
Throughput 666577.81 366394.31 181.93%
Latency-AVG 0.044 0.08 55.00%
Latency-P99 0.063 0.127 49.61%
PING_MBULK
Throughput 688657.81 395397.56 174.17%
Latency-AVG 0.042 0.073 57.53%
Latency-P99 0.063 0.119 52.94%
SET
Throughput 434744.78 157726.22 275.63%
Latency-AVG 0.068 0.188 36.17%
Latency-P99 0.111 0.183 60.66%
GET
Throughput 562587.94 319478.59 176.10%
Latency-AVG 0.052 0.091 57.14%
Latency-P99 0.079 0.151 52.32%
LRANGE
Throughput 526260.38 211434.36 248.90%
Latency-AVG 0.056 0.14 40.00%
Latency-P99 0.079 0.159 49.69%
LRANGE_100
Throughput 57106.96 49498.34 115.37%
Latency-AVG 0.427 0.499 85.57%
Latency-P99 4.207 13.367 31.47%

Hi, @hz-cheng

I notice that you are the author of alibaba-cloud erdma driver for both linux kernel and rdma-core. Cooooooooool!

Hi @madolson ,
The feedback from the cloud vendor(alibaba-cloud) side shows the improvement, this means lots of end-user will enjoy it easily. Please let me know any concern about this feature.

@zuiderkwast
Copy link
Contributor

Almost doubled throughput is impressive. I don't know much about RDMA. It's many lines of code, but all of it is the module. That's great, but what are the risks of breaking it if we change something in the connection abstractions? We need to be aware that when we merge this, we will have to keep maintaining this.

Is it possible to use TLS with RDMA?

@madolson
Copy link
Member

@pizhenwei The numbers do look great. I haven't gotten a chance to look at it yet, hopefully some time this week.

@pizhenwei
Copy link
Contributor Author

Almost doubled throughput is impressive. I don't know much about RDMA. It's many lines of code, but all of it is the module. That's great, but what are the risks of breaking it if we change something in the connection abstractions? We need to be aware that when we merge this, we will have to keep maintaining this.

Hi,

Because the valkey-rdma.so(if built as a module) uses the struct ConnectionType as ABI, the rdma support must change together with the core connection abstractions.

To avoid the ricks from the mismatched struct ConnectionType, the module side check the version strictly(so does valkey-tls.so) like:

    /* Connection modules MUST be part of the same build as valkey. */
    if (strcmp(REDIS_BUILD_ID_RAW, serverBuildIdRaw())) {
        serverLog(LL_NOTICE, "Connection type %s was not built together with the valkey-server used.", CONN_TYPE_RDMA);
        return VALKEYMODULE_ERR;
    }

Once the core connection abstraction changes, all the connection types should do compat work, this rule is also applicative for rdma. I volunteer to maintain this rdma support.

PS: I have experience on open source community like Linux kernel, QEMU, Redis, SPDK, libiscsi, tgt, atop, utils-linux and procps-ns.

Is it possible to use TLS with RDMA?

As far as I can see, we can't use TLS with RDMA currently. I read document of openssl Abstract Record Layer, TLS with RDMA is workable in theory. But it would be amount of work.

@hwware
Copy link
Member

hwware commented May 28, 2024

@pizhenwei Thanks for your contribution and @hz-cheng Thanks for your perfect number.
First, I need to say I like this feature. But I have 3 curious points here

  1. In the RDMA.md, you mentioned that Valkey Over RDMA is only supported by Linux, I am not sure if it is supported by Centos
    and MacOS? Or you mean it is not supported by Windows?

  2. Is there possible to integrate with core codes directly instead of working as a module in Valkey? any risky or difficulty?

  3. You mention both RDMA and TCP enable at the same time? Is there any benefit for it? it means some specific clients or replica nodes connect to RDMA port? Could you please describe a little bit more? Thanks

Let's core team member discuss this important feature, and send you feedback ASAP, Thanks

@pizhenwei
Copy link
Contributor Author

@pizhenwei Thanks for your contribution and @hz-cheng Thanks for your perfect number. First, I need to say I like this feature. But I have 3 curious points here

  1. In the RDMA.md, you mentioned that Valkey Over RDMA is only supported by Linux, I am not sure if it is supported by Centos
    and MacOS? Or you mean it is not supported by Windows?

There are two parts of this PR:

  • Valkey Over RDMA protocol. This defines the transmission type RC(like TCP), communication commands, and payload exchange mechanism. This depends on RDMA(aka Infiniband) specification only, but OS and hardware independent.
  • The implement of Linux. I developed and tested this feature on Ubuntu 2004/2204 and Debian 9/10. I guess @hz-cheng tested this on a newer version of Linux distribution because erdma got supported recently. I don't test it on CentOS/RHEL/Suse, but I believe it should work fine if hardware driver is ready.

I have no experience on windows RDMA, I read document and found that windows does support RDMA, but not Linux style Verbs API. This means that we need a windows version in the future. (I imagine rdma-windows.c is needed).

  1. Is there possible to integrate with core codes directly instead of working as a module in Valkey? any risky or difficulty?

It's quite easy to build RDMA support into Valkey with a few lines change. If so, the valkey-server has to link libibvers.so and librdmacm.so.

Let's look at the dynamic shared libraries of module version:

# ldd valkey-server 
	linux-vdso.so.1 (0x00007ffdbc546000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f41f0fa1000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f41f0800000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f41f10a8000)
# ldd valkey-rdma.so 
	linux-vdso.so.1 (0x00007fff83bbb000)
	librdmacm.so.1 => /lib/x86_64-linux-gnu/librdmacm.so.1 (0x00007f7e7c29e000)
	libibverbs.so.1 => /lib/x86_64-linux-gnu/libibverbs.so.1 (0x00007f7e7c27b000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f7e7c000000)
	libnl-3.so.200 => /lib/x86_64-linux-gnu/libnl-3.so.200 (0x00007f7e7c258000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f7e7c2f5000)
	libnl-route-3.so.200 => /lib/x86_64-linux-gnu/libnl-route-3.so.200 (0x00007f7e7bf7d000)

If a user starts valkey-server with rdma module, valkey-server loads the additional shared libraries on demand.

If building RDMA into valkey is necessary, please let me know.

  1. You mention both RDMA and TCP enable at the same time? Is there any benefit for it? it means some specific clients or replica nodes connect to RDMA port? Could you please describe a little bit more? Thanks

Let's core team member discuss this important feature, and send you feedback ASAP, Thanks

Currently, the valkey-server support 3 connection type:

  • unix domain socket
  • TCP/IP
  • TLS over TCP/IP (can't use the same port as TCP/IP)

Run valkey-server by command: ./valkey-server --unixsocket /run/valkey.sock --port 6379 --tls-port 6380 ...
then valkey-server listens on /run/valkey.sock , TCP/IP port 6379 & 6380 together. The 3 transports could be used together in theory. However, in the production env, we usually prefer TCP/IP in a trusted env for the better performance, or TLS in an untrusted for security.

Once loading RDMA:./valkey-server --unixsocket /run/valkey.sock --port 6379 --tls-port 6380 ... --loadmodule valkey-rdma.so port=6379 ...
then valkey-server listens on /run/valkey.sock, TCP/IP port 6379 & 6380, and RDMA port 6379 together. The 4 transports could be used together in theory.

The RDMA has better performance in a good network env like @hz-cheng's and my test report, but I tested mlx5 with packet drop rate 0.001, TCP performance affects a few, but RDMA performance drops a lot. I imagine a topo like:

DC: data center

valkey-client   valkey-client
         |            |
        TCP          RDMA
         |            |
         +----- valkey-server  -----TCP-----  valkey-server(replica)

  DC A            DC B                              DC C

It's possible to use RDMA within a short distance, or TCP over a long distance.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to get this merged. It's a good contribution. I like that it's a module. When it's merged, we can let clients implement it and test it.

The RDMA port is neither a TCP port nor an UDP port? We've been talking about the possibility of adding QUIC in the future (optional dependency, maybe as a module too) and that can be on the same port too, right, since it's UDP?

@pizhenwei
Copy link
Contributor Author

I'd like to get this merged. It's a good contribution. I like that it's a module. When it's merged, we can let clients implement it and test it.

Actually, the client side(for C only) is ready(as you see, several guys and me have already got the test report). Once the server side gets merged, I'll create PR for client as soon as possible.

The RDMA port is neither a TCP port nor an UDP port?

Right.

We've been talking about the possibility of adding QUIC in the future (optional dependency, maybe as a module too) and that can be on the same port too, right, since it's UDP?

Right.

@coderyanghang
Copy link

The RDMA has better performance in a good network env like @hz-cheng's and my test report, but I tested mlx5 with packet drop rate 0.001, TCP performance affects a few, but RDMA performance drops a lot. I imagine a topo like:

@pizhenwei Actually the latest RDMA technology such as Alibaba Cloud Elastic RDMA doesn't encounter this performance drop when packet drop rate 0.001,because the latest RDMA technology widely supports SACK lossy optimization.

@pizhenwei
Copy link
Contributor Author

The RDMA has better performance in a good network env like @hz-cheng's and my test report, but I tested mlx5 with packet drop rate 0.001, TCP performance affects a few, but RDMA performance drops a lot. I imagine a topo like:

@pizhenwei Actually the latest RDMA technology such as Alibaba Cloud Elastic RDMA doesn't encounter this performance drop when packet drop rate 0.001,because the latest RDMA technology widely supports SACK lossy optimization.

Hi,
As far as I know, SACK is not part of IB specification, and the stock of hardwares don't support SACK. I don't think the valkey-server should be limited to deploy on the latest hardware only.

Let's focus on 'why does Valkey need to enable both TCP/IP and RDMA together' or 'enabling both TCP/IP and RDMA is useful or not in the real scenario', but not extend the topic to 'the latest RDMA technology' here.

@pizhenwei
Copy link
Contributor Author

Hi @zuiderkwast ,

I create a new PR for the document part, and force pushed a new version here.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now. Just some minor comments.

pizhenwei added a commit to pizhenwei/valkey-doc that referenced this pull request May 31, 2024
RDMA is the abbreviation of remote direct memory access. It is a
technology that enables computers in a network to exchange data in
the main memory without involving the processor, cache, or operating
system of either computer. This means RDMA has a better performance
than TCP, the test results show Valkey Over RDMA has a ~2.5X QPS and
lower latency.

In recent years, RDMA gets popular in the data center, especially
RoCE(RDMA over Converged Ethernet) architecture has been widely used.
Cloud Vendors also start to support RDMA instance in order to accelerate
networking performance. End-user would enjoy the improvement easily.

Introduce Valkey Over RDMA protocol as a new transport for Valkey. For
now, we defined 4 commands:
- GetServerFeature & SetClientFeature: the two commands are used to
  negotiate features for further extension. There is no feature
  definition in this version. Flow control and multi-buffer may be
  supported in the future, this needs feature negotiation.
- Keepalive
- RegisterXferMemory: the heart to transfer the real payload.

The 'TX buffer' and 'RX buffer' are designed by RDMA remote memory
with RDMA write/write with imm, it's similar to several mechanisms
introduced by papers(but not same):
- Socksdirect: datacenter sockets can be fast and compatible
  <https://dl.acm.org/doi/10.1145/3341302.3342071>
- LITE Kernel RDMA Support for Datacenter Applications
  <https://dl.acm.org/doi/abs/10.1145/3132747.3132762>
- FaRM: Fast Remote Memory
  <https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf>

Link: valkey-io/valkey#477
Co-authored-by: Xinhao Kong <[email protected]>
Co-authored-by: Huaping Zhou <[email protected]>
Co-authored-by: zhuo jiang <[email protected]>
Co-authored-by: Yiming Zhang <[email protected]>
Co-authored-by: Jianxi Ye <[email protected]>
Signed-off-by: zhenwei pi <[email protected]>
pizhenwei added a commit to pizhenwei/valkey-doc that referenced this pull request May 31, 2024
RDMA is the abbreviation of remote direct memory access. It is a
technology that enables computers in a network to exchange data in
the main memory without involving the processor, cache, or operating
system of either computer. This means RDMA has a better performance
than TCP, the test results show Valkey Over RDMA has a ~2.5X QPS and
lower latency.

In recent years, RDMA gets popular in the data center, especially
RoCE(RDMA over Converged Ethernet) architecture has been widely used.
Cloud Vendors also start to support RDMA instance in order to accelerate
networking performance. End-user would enjoy the improvement easily.

Introduce Valkey Over RDMA protocol as a new transport for Valkey. For
now, we defined 4 commands:
- GetServerFeature & SetClientFeature: the two commands are used to
  negotiate features for further extension. There is no feature
  definition in this version. Flow control and multi-buffer may be
  supported in the future, this needs feature negotiation.
- Keepalive
- RegisterXferMemory: the heart to transfer the real payload.

The 'TX buffer' and 'RX buffer' are designed by RDMA remote memory
with RDMA write/write with imm, it's similar to several mechanisms
introduced by papers(but not same):
- Socksdirect: datacenter sockets can be fast and compatible
  <https://dl.acm.org/doi/10.1145/3341302.3342071>
- LITE Kernel RDMA Support for Datacenter Applications
  <https://dl.acm.org/doi/abs/10.1145/3132747.3132762>
- FaRM: Fast Remote Memory
  <https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf>

Link: valkey-io/valkey#477
Co-authored-by: Xinhao Kong <[email protected]>
Co-authored-by: Huaping Zhou <[email protected]>
Co-authored-by: zhuo jiang <[email protected]>
Co-authored-by: Yiming Zhang <[email protected]>
Co-authored-by: Jianxi Ye <[email protected]>
Signed-off-by: zhenwei pi <[email protected]>
pizhenwei added a commit to pizhenwei/valkey-doc that referenced this pull request May 31, 2024
RDMA is the abbreviation of remote direct memory access. It is a
technology that enables computers in a network to exchange data in
the main memory without involving the processor, cache, or operating
system of either computer. This means RDMA has a better performance
than TCP, the test results show Valkey Over RDMA has a ~2.5X QPS and
lower latency.

In recent years, RDMA gets popular in the data center, especially
RoCE(RDMA over Converged Ethernet) architecture has been widely used.
Cloud Vendors also start to support RDMA instance in order to accelerate
networking performance. End-user would enjoy the improvement easily.

Introduce Valkey Over RDMA protocol as a new transport for Valkey. For
now, we defined 4 commands:
- GetServerFeature & SetClientFeature: the two commands are used to
  negotiate features for further extension. There is no feature
  definition in this version. Flow control and multi-buffer may be
  supported in the future, this needs feature negotiation.
- Keepalive
- RegisterXferMemory: the heart to transfer the real payload.

The 'TX buffer' and 'RX buffer' are designed by RDMA remote memory
with RDMA write/write with imm, it's similar to several mechanisms
introduced by papers(but not same):
- Socksdirect: datacenter sockets can be fast and compatible
  <https://dl.acm.org/doi/10.1145/3341302.3342071>
- LITE Kernel RDMA Support for Datacenter Applications
  <https://dl.acm.org/doi/abs/10.1145/3132747.3132762>
- FaRM: Fast Remote Memory
  <https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf>

Link: valkey-io/valkey#477
Co-authored-by: Xinhao Kong <[email protected]>
Co-authored-by: Huaping Zhou <[email protected]>
Co-authored-by: zhuo jiang <[email protected]>
Co-authored-by: Yiming Zhang <[email protected]>
Co-authored-by: Jianxi Ye <[email protected]>
Signed-off-by: zhenwei pi <[email protected]>
pizhenwei added a commit to pizhenwei/valkey-doc that referenced this pull request May 31, 2024
RDMA is the abbreviation of remote direct memory access. It is a
technology that enables computers in a network to exchange data in
the main memory without involving the processor, cache, or operating
system of either computer. This means RDMA has a better performance
than TCP, the test results show Valkey Over RDMA has a ~2.5X QPS and
lower latency.

In recent years, RDMA gets popular in the data center, especially
RoCE(RDMA over Converged Ethernet) architecture has been widely used.
Cloud Vendors also start to support RDMA instance in order to accelerate
networking performance. End-user would enjoy the improvement easily.

Introduce Valkey Over RDMA protocol as a new transport for Valkey. For
now, we defined 4 commands:
- GetServerFeature & SetClientFeature: the two commands are used to
  negotiate features for further extension. There is no feature
  definition in this version. Flow control and multi-buffer may be
  supported in the future, this needs feature negotiation.
- Keepalive
- RegisterXferMemory: the heart to transfer the real payload.

The 'TX buffer' and 'RX buffer' are designed by RDMA remote memory
with RDMA write/write with imm, it's similar to several mechanisms
introduced by papers(but not same):
- Socksdirect: datacenter sockets can be fast and compatible
  <https://dl.acm.org/doi/10.1145/3341302.3342071>
- LITE Kernel RDMA Support for Datacenter Applications
  <https://dl.acm.org/doi/abs/10.1145/3132747.3132762>
- FaRM: Fast Remote Memory
  <https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf>

Link: valkey-io/valkey#477
Co-authored-by: Xinhao Kong <[email protected]>
Co-authored-by: Huaping Zhou <[email protected]>
Co-authored-by: zhuo jiang <[email protected]>
Co-authored-by: Yiming Zhang <[email protected]>
Co-authored-by: Jianxi Ye <[email protected]>
Signed-off-by: zhenwei pi <[email protected]>
pizhenwei added a commit to pizhenwei/valkey-doc that referenced this pull request May 31, 2024
RDMA is the abbreviation of remote direct memory access. It is a
technology that enables computers in a network to exchange data in
the main memory without involving the processor, cache, or operating
system of either computer. This means RDMA has a better performance
than TCP, the test results show Valkey Over RDMA has a ~2.5X QPS and
lower latency.

In recent years, RDMA gets popular in the data center, especially
RoCE(RDMA over Converged Ethernet) architecture has been widely used.
Cloud Vendors also start to support RDMA instance in order to accelerate
networking performance. End-user would enjoy the improvement easily.

Introduce Valkey Over RDMA protocol as a new transport for Valkey. For
now, we defined 4 commands:
- GetServerFeature & SetClientFeature: the two commands are used to
  negotiate features for further extension. There is no feature
  definition in this version. Flow control and multi-buffer may be
  supported in the future, this needs feature negotiation.
- Keepalive
- RegisterXferMemory: the heart to transfer the real payload.

The 'TX buffer' and 'RX buffer' are designed by RDMA remote memory
with RDMA write/write with imm, it's similar to several mechanisms
introduced by papers(but not same):
- Socksdirect: datacenter sockets can be fast and compatible
  <https://dl.acm.org/doi/10.1145/3341302.3342071>
- LITE Kernel RDMA Support for Datacenter Applications
  <https://dl.acm.org/doi/abs/10.1145/3132747.3132762>
- FaRM: Fast Remote Memory
  <https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf>

Link: valkey-io/valkey#477
Co-authored-by: Xinhao Kong <[email protected]>
Co-authored-by: Huaping Zhou <[email protected]>
Co-authored-by: zhuo jiang <[email protected]>
Co-authored-by: Yiming Zhang <[email protected]>
Co-authored-by: Jianxi Ye <[email protected]>
Signed-off-by: zhenwei pi <[email protected]>
pizhenwei added a commit to pizhenwei/valkey-doc that referenced this pull request May 31, 2024
RDMA is the abbreviation of remote direct memory access. It is a
technology that enables computers in a network to exchange data in
the main memory without involving the processor, cache, or operating
system of either computer. This means RDMA has a better performance
than TCP, the test results show Valkey Over RDMA has a ~2.5X QPS and
lower latency.

In recent years, RDMA gets popular in the data center, especially
RoCE(RDMA over Converged Ethernet) architecture has been widely used.
Cloud Vendors also start to support RDMA instance in order to accelerate
networking performance. End-user would enjoy the improvement easily.

Introduce Valkey Over RDMA protocol as a new transport for Valkey. For
now, we defined 4 commands:
- GetServerFeature & SetClientFeature: the two commands are used to
  negotiate features for further extension. There is no feature
  definition in this version. Flow control and multi-buffer may be
  supported in the future, this needs feature negotiation.
- Keepalive
- RegisterXferMemory: the heart to transfer the real payload.

The 'TX buffer' and 'RX buffer' are designed by RDMA remote memory
with RDMA write/write with imm, it's similar to several mechanisms
introduced by papers(but not same):
- Socksdirect: datacenter sockets can be fast and compatible
  <https://dl.acm.org/doi/10.1145/3341302.3342071>
- LITE Kernel RDMA Support for Datacenter Applications
  <https://dl.acm.org/doi/abs/10.1145/3132747.3132762>
- FaRM: Fast Remote Memory
  <https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf>

Link: valkey-io/valkey#477
Co-authored-by: Xinhao Kong <[email protected]>
Co-authored-by: Huaping Zhou <[email protected]>
Co-authored-by: zhuo jiang <[email protected]>
Co-authored-by: Yiming Zhang <[email protected]>
Co-authored-by: Jianxi Ye <[email protected]>
Signed-off-by: zhenwei pi <[email protected]>
@enjoy-binbin
Copy link
Member

great, it finally get merged, although i did not participate in the review (not familiar with it and also don't have enough time to dive in), thank you for your time and the work.

@zuiderkwast
Copy link
Contributor

@enjoy-binbin Yeah, I merged it. :) I'm not very familiar with RDMA either, but all of the code is in separate files and it is not compiled by default, so I'm confident it doesn't break anything. In the future, if we want it to be officially supported (not experimental) we should probably...

  • think about if we want it as a module or in the core;
  • harden it by rejecting any commands that involve fork, such as PSYNC, over an RDMA connection.

@pizhenwei
Copy link
Contributor Author

Thanks to all the folks (@zuiderkwast @enjoy-binbin @PingXie @madolson @hwware @daniel-house @coderyanghang @zvi-code @hz-cheng @baronwangr) in this long and interesting journey!

Please feel free to contact me [email protected] on any issue and feedback.

@madolson
Copy link
Member

Thanks @pizhenwei ! It's a cool feature, and hopefully we can get other folks to use it in production soon.

@pizhenwei
Copy link
Contributor Author

Thanks @pizhenwei ! It's a cool feature, and hopefully we can get other folks to use it in production soon.

Sure, I'm working on libvalkey to support RDMA ASAP.

@JSpewock
Copy link

Nice work @pizhenwei , glad to see it finally got merged!

@asafpamzn
Copy link

I'm happy to see this commit merged. @pizhenwei, I opened an issue to support this as part of vlakey-glide valkey-io/valkey-glide#1963. Would you be interested in collaboration. The main advantage of valkey-glide is that we can implement it once and it shall be available in all supported languages currently, java, python and node.js, but in the future more to come.

Feel free to share your ideas for client implementation at vlakey-glide valkey-io/valkey-glide#1963

@pizhenwei
Copy link
Contributor Author

pizhenwei commented Jul 18, 2024

I'm happy to see this commit merged. @pizhenwei, I opened an issue to support this as part of vlakey-glide valkey-io/valkey-glide#1963. Would you be interested in collaboration. The main advantage of valkey-glide is that we can implement it once and it shall be available in all supported languages currently, java, python and node.js, but in the future more to come.

Feel free to share your ideas for client implementation at vlakey-glide valkey-io/valkey-glide#1963

Hi,
This would help a lot for this new transport, glad to see it. Frankly, I'm not familiar with RDMA support in Rust ecosystem, I read the README of glide only, I think I can dive into Rust RDMA and GLIDE about two weeks later(I'm working on libvalkey support now).

pizhenwei added a commit to pizhenwei/libvalkey that referenced this pull request Jul 23, 2024
Valkey Over RDMA[1] has been supported as experimental feature since
Valkey 8.0. Support RDMA transport for the client side.

RDMA is not a builtin feature, supported as module only, so we have to
run test.sh with more argument @VALKEY_RDMA_MODULE and @VALKEY_RDMA_ADDR.

An example to run test.sh:
VALKEY_RDMA_MODULE=/path/to/valkey-rdma.so VALKEY_RDMA_ADDR=192.168.122.1 TEST_RDMA=1 ./test.sh

 ...
 Testing against RDMA connection (192.168.122.1:56379):
 valkey-io#138 Is able to deliver commands: PASSED
 valkey-io#139 Is a able to send commands verbatim: PASSED
 valkey-io#140 %s String interpolation works: PASSED
 valkey-io#141 %b String interpolation works: PASSED
 valkey-io#142 Binary reply length is correct: PASSED
 valkey-io#143 Can parse nil replies: PASSED
 valkey-io#144 Can parse integer replies: PASSED
 valkey-io#145 Can parse multi bulk replies: PASSED
 valkey-io#146 Can handle nested multi bulk replies: PASSED
 valkey-io#147 Send command by passing argc/argv: PASSED
 valkey-io#148 Can pass NULL to valkeyGetReply: PASSED
 valkey-io#149 RESP3 PUSH messages are handled out of band by default: PASSED
 valkey-io#150 We can set a custom RESP3 PUSH handler: PASSED
 valkey-io#151 We properly handle a NIL invalidation payload: PASSED
 valkey-io#152 With no handler, PUSH replies come in-band: PASSED
 valkey-io#153 With no PUSH handler, no replies are lost: PASSED
 valkey-io#154 We set a default RESP3 handler for valkeyContext: PASSED
 valkey-io#155 We don't set a default RESP3 push handler for valkeyAsyncContext: PASSED
 valkey-io#156 Our VALKEY_OPT_NO_PUSH_AUTOFREE flag works: PASSED
 valkey-io#157 We can use valkeyOptions to set a custom PUSH handler for valkeyContext: PASSED
 valkey-io#158 We can use valkeyOptions to set a custom PUSH handler for valkeyAsyncContext: PASSED
 valkey-io#159 We can use valkeyOptions to set privdata: PASSED
 valkey-io#160 Our privdata destructor fires when we free the context: PASSED
 valkey-io#161 Successfully completes a command when the timeout is not exceeded: PASSED
 valkey-io#162 Does not return a reply when the command times out: SKIPPED
 valkey-io#163 Reconnect properly reconnects after a timeout: PASSED
 valkey-io#164 Reconnect properly uses owned parameters: PASSED
 valkey-io#165 Returns I/O error when the connection is lost: PASSED
 valkey-io#166 Returns I/O error on socket timeout: PASSED
 valkey-io#167 Set error when an invalid timeout usec value is used during connect: PASSED
 valkey-io#168 Set error when an invalid timeout sec value is used during connect: PASSED
 valkey-io#169 Append format command: PASSED
 valkey-io#170 Throughput:
	(1000x PING: 0.010s)
	(1000x LRANGE with 500 elements: 0.060s)
	(1000x INCRBY: 0.012s)
	(10000x PING (pipelined): 0.066s)
	(10000x LRANGE with 500 elements (pipelined): 0.523s)
	(10000x INCRBY (pipelined): 0.024s)
 ...

Link[1]: valkey-io/valkey#477
Signed-off-by: zhenwei pi <[email protected]>
pizhenwei added a commit to pizhenwei/libvalkey that referenced this pull request Jul 23, 2024
Valkey Over RDMA[1] has been supported as experimental feature since
Valkey 8.0. Support RDMA transport for the client side.

RDMA is not a builtin feature, supported as module only, so we have to
run test.sh with more argument @VALKEY_RDMA_MODULE and @VALKEY_RDMA_ADDR.

An example to run test.sh:
VALKEY_RDMA_MODULE=/path/to/valkey-rdma.so VALKEY_RDMA_ADDR=192.168.122.1 TEST_RDMA=1 ./test.sh

 ...
 Testing against RDMA connection (192.168.122.1:56379):
 valkey-io#138 Is able to deliver commands: PASSED
 valkey-io#139 Is a able to send commands verbatim: PASSED
 valkey-io#140 %s String interpolation works: PASSED
 valkey-io#141 %b String interpolation works: PASSED
 valkey-io#142 Binary reply length is correct: PASSED
 valkey-io#143 Can parse nil replies: PASSED
 valkey-io#144 Can parse integer replies: PASSED
 valkey-io#145 Can parse multi bulk replies: PASSED
 valkey-io#146 Can handle nested multi bulk replies: PASSED
 valkey-io#147 Send command by passing argc/argv: PASSED
 valkey-io#148 Can pass NULL to valkeyGetReply: PASSED
 valkey-io#149 RESP3 PUSH messages are handled out of band by default: PASSED
 valkey-io#150 We can set a custom RESP3 PUSH handler: PASSED
 valkey-io#151 We properly handle a NIL invalidation payload: PASSED
 valkey-io#152 With no handler, PUSH replies come in-band: PASSED
 valkey-io#153 With no PUSH handler, no replies are lost: PASSED
 valkey-io#154 We set a default RESP3 handler for valkeyContext: PASSED
 valkey-io#155 We don't set a default RESP3 push handler for valkeyAsyncContext: PASSED
 valkey-io#156 Our VALKEY_OPT_NO_PUSH_AUTOFREE flag works: PASSED
 valkey-io#157 We can use valkeyOptions to set a custom PUSH handler for valkeyContext: PASSED
 valkey-io#158 We can use valkeyOptions to set a custom PUSH handler for valkeyAsyncContext: PASSED
 valkey-io#159 We can use valkeyOptions to set privdata: PASSED
 valkey-io#160 Our privdata destructor fires when we free the context: PASSED
 valkey-io#161 Successfully completes a command when the timeout is not exceeded: PASSED
 valkey-io#162 Does not return a reply when the command times out: SKIPPED
 valkey-io#163 Reconnect properly reconnects after a timeout: PASSED
 valkey-io#164 Reconnect properly uses owned parameters: PASSED
 valkey-io#165 Returns I/O error when the connection is lost: PASSED
 valkey-io#166 Returns I/O error on socket timeout: PASSED
 valkey-io#167 Set error when an invalid timeout usec value is used during connect: PASSED
 valkey-io#168 Set error when an invalid timeout sec value is used during connect: PASSED
 valkey-io#169 Append format command: PASSED
 valkey-io#170 Throughput:
	(1000x PING: 0.010s)
	(1000x LRANGE with 500 elements: 0.060s)
	(1000x INCRBY: 0.012s)
	(10000x PING (pipelined): 0.066s)
	(10000x LRANGE with 500 elements (pipelined): 0.523s)
	(10000x INCRBY (pipelined): 0.024s)
 ...

Link[1]: valkey-io/valkey#477
Signed-off-by: zhenwei pi <[email protected]>
pizhenwei added a commit to pizhenwei/libvalkey that referenced this pull request Jul 24, 2024
Valkey Over RDMA[1] has been supported as experimental feature since
Valkey 8.0. Support RDMA transport for the client side.

RDMA is not a builtin feature, supported as module only, so we have to
run test.sh with more argument @VALKEY_RDMA_MODULE and @VALKEY_RDMA_ADDR.

An example to run test.sh:
VALKEY_RDMA_MODULE=/path/to/valkey-rdma.so VALKEY_RDMA_ADDR=192.168.122.1 TEST_RDMA=1 ./test.sh

 ...
 Testing against RDMA connection (192.168.122.1:56379):
 valkey-io#138 Is able to deliver commands: PASSED
 valkey-io#139 Is a able to send commands verbatim: PASSED
 valkey-io#140 %s String interpolation works: PASSED
 valkey-io#141 %b String interpolation works: PASSED
 valkey-io#142 Binary reply length is correct: PASSED
 valkey-io#143 Can parse nil replies: PASSED
 valkey-io#144 Can parse integer replies: PASSED
 valkey-io#145 Can parse multi bulk replies: PASSED
 valkey-io#146 Can handle nested multi bulk replies: PASSED
 valkey-io#147 Send command by passing argc/argv: PASSED
 valkey-io#148 Can pass NULL to valkeyGetReply: PASSED
 valkey-io#149 RESP3 PUSH messages are handled out of band by default: PASSED
 valkey-io#150 We can set a custom RESP3 PUSH handler: PASSED
 valkey-io#151 We properly handle a NIL invalidation payload: PASSED
 valkey-io#152 With no handler, PUSH replies come in-band: PASSED
 valkey-io#153 With no PUSH handler, no replies are lost: PASSED
 valkey-io#154 We set a default RESP3 handler for valkeyContext: PASSED
 valkey-io#155 We don't set a default RESP3 push handler for valkeyAsyncContext: PASSED
 valkey-io#156 Our VALKEY_OPT_NO_PUSH_AUTOFREE flag works: PASSED
 valkey-io#157 We can use valkeyOptions to set a custom PUSH handler for valkeyContext: PASSED
 valkey-io#158 We can use valkeyOptions to set a custom PUSH handler for valkeyAsyncContext: PASSED
 valkey-io#159 We can use valkeyOptions to set privdata: PASSED
 valkey-io#160 Our privdata destructor fires when we free the context: PASSED
 valkey-io#161 Successfully completes a command when the timeout is not exceeded: PASSED
 valkey-io#162 Does not return a reply when the command times out: SKIPPED
 valkey-io#163 Reconnect properly reconnects after a timeout: PASSED
 valkey-io#164 Reconnect properly uses owned parameters: PASSED
 valkey-io#165 Returns I/O error when the connection is lost: PASSED
 valkey-io#166 Returns I/O error on socket timeout: PASSED
 valkey-io#167 Set error when an invalid timeout usec value is used during connect: PASSED
 valkey-io#168 Set error when an invalid timeout sec value is used during connect: PASSED
 valkey-io#169 Append format command: PASSED
 valkey-io#170 Throughput:
	(1000x PING: 0.010s)
	(1000x LRANGE with 500 elements: 0.060s)
	(1000x INCRBY: 0.012s)
	(10000x PING (pipelined): 0.066s)
	(10000x LRANGE with 500 elements (pipelined): 0.523s)
	(10000x INCRBY (pipelined): 0.024s)
 ...

Link[1]: valkey-io/valkey#477
Signed-off-by: zhenwei pi <[email protected]>
pizhenwei added a commit to pizhenwei/libvalkey that referenced this pull request Jul 26, 2024
Valkey Over RDMA[1] has been supported as experimental feature since
Valkey 8.0. Support RDMA transport for the client side.

RDMA is not a builtin feature, supported as module only, so we have to
run test.sh with more argument @VALKEY_RDMA_MODULE and @VALKEY_RDMA_ADDR.

An example to run test.sh:
VALKEY_RDMA_MODULE=/path/to/valkey-rdma.so VALKEY_RDMA_ADDR=192.168.122.1 TEST_RDMA=1 ./test.sh

 ...
 Testing against RDMA connection (192.168.122.1:56379):
 valkey-io#138 Is able to deliver commands: PASSED
 valkey-io#139 Is a able to send commands verbatim: PASSED
 valkey-io#140 %s String interpolation works: PASSED
 valkey-io#141 %b String interpolation works: PASSED
 valkey-io#142 Binary reply length is correct: PASSED
 valkey-io#143 Can parse nil replies: PASSED
 valkey-io#144 Can parse integer replies: PASSED
 valkey-io#145 Can parse multi bulk replies: PASSED
 valkey-io#146 Can handle nested multi bulk replies: PASSED
 valkey-io#147 Send command by passing argc/argv: PASSED
 valkey-io#148 Can pass NULL to valkeyGetReply: PASSED
 valkey-io#149 RESP3 PUSH messages are handled out of band by default: PASSED
 valkey-io#150 We can set a custom RESP3 PUSH handler: PASSED
 valkey-io#151 We properly handle a NIL invalidation payload: PASSED
 valkey-io#152 With no handler, PUSH replies come in-band: PASSED
 valkey-io#153 With no PUSH handler, no replies are lost: PASSED
 valkey-io#154 We set a default RESP3 handler for valkeyContext: PASSED
 valkey-io#155 We don't set a default RESP3 push handler for valkeyAsyncContext: PASSED
 valkey-io#156 Our VALKEY_OPT_NO_PUSH_AUTOFREE flag works: PASSED
 valkey-io#157 We can use valkeyOptions to set a custom PUSH handler for valkeyContext: PASSED
 valkey-io#158 We can use valkeyOptions to set a custom PUSH handler for valkeyAsyncContext: PASSED
 valkey-io#159 We can use valkeyOptions to set privdata: PASSED
 valkey-io#160 Our privdata destructor fires when we free the context: PASSED
 valkey-io#161 Successfully completes a command when the timeout is not exceeded: PASSED
 valkey-io#162 Does not return a reply when the command times out: SKIPPED
 valkey-io#163 Reconnect properly reconnects after a timeout: PASSED
 valkey-io#164 Reconnect properly uses owned parameters: PASSED
 valkey-io#165 Returns I/O error when the connection is lost: PASSED
 valkey-io#166 Returns I/O error on socket timeout: PASSED
 valkey-io#167 Set error when an invalid timeout usec value is used during connect: PASSED
 valkey-io#168 Set error when an invalid timeout sec value is used during connect: PASSED
 valkey-io#169 Append format command: PASSED
 valkey-io#170 Throughput:
	(1000x PING: 0.010s)
	(1000x LRANGE with 500 elements: 0.060s)
	(1000x INCRBY: 0.012s)
	(10000x PING (pipelined): 0.066s)
	(10000x LRANGE with 500 elements (pipelined): 0.523s)
	(10000x INCRBY (pipelined): 0.024s)
 ...

Thanks to Michael Grunder for lots of review suggestions!

Link[1]: valkey-io/valkey#477
Signed-off-by: zhenwei pi <[email protected]>
pizhenwei added a commit to pizhenwei/libvalkey that referenced this pull request Jul 26, 2024
Valkey Over RDMA[1] has been supported as experimental feature since
Valkey 8.0. Support RDMA transport for the client side.

RDMA is not a builtin feature, supported as module only, so we have to
run test.sh with more argument @VALKEY_RDMA_MODULE and @VALKEY_RDMA_ADDR.

An example to run test.sh:
VALKEY_RDMA_MODULE=/path/to/valkey-rdma.so VALKEY_RDMA_ADDR=192.168.122.1 TEST_RDMA=1 ./test.sh

 ...
 Testing against RDMA connection (192.168.122.1:56379):
 valkey-io#138 Is able to deliver commands: PASSED
 valkey-io#139 Is a able to send commands verbatim: PASSED
 valkey-io#140 %s String interpolation works: PASSED
 valkey-io#141 %b String interpolation works: PASSED
 valkey-io#142 Binary reply length is correct: PASSED
 valkey-io#143 Can parse nil replies: PASSED
 valkey-io#144 Can parse integer replies: PASSED
 valkey-io#145 Can parse multi bulk replies: PASSED
 valkey-io#146 Can handle nested multi bulk replies: PASSED
 valkey-io#147 Send command by passing argc/argv: PASSED
 valkey-io#148 Can pass NULL to valkeyGetReply: PASSED
 valkey-io#149 RESP3 PUSH messages are handled out of band by default: PASSED
 valkey-io#150 We can set a custom RESP3 PUSH handler: PASSED
 valkey-io#151 We properly handle a NIL invalidation payload: PASSED
 valkey-io#152 With no handler, PUSH replies come in-band: PASSED
 valkey-io#153 With no PUSH handler, no replies are lost: PASSED
 valkey-io#154 We set a default RESP3 handler for valkeyContext: PASSED
 valkey-io#155 We don't set a default RESP3 push handler for valkeyAsyncContext: PASSED
 valkey-io#156 Our VALKEY_OPT_NO_PUSH_AUTOFREE flag works: PASSED
 valkey-io#157 We can use valkeyOptions to set a custom PUSH handler for valkeyContext: PASSED
 valkey-io#158 We can use valkeyOptions to set a custom PUSH handler for valkeyAsyncContext: PASSED
 valkey-io#159 We can use valkeyOptions to set privdata: PASSED
 valkey-io#160 Our privdata destructor fires when we free the context: PASSED
 valkey-io#161 Successfully completes a command when the timeout is not exceeded: PASSED
 valkey-io#162 Does not return a reply when the command times out: SKIPPED
 valkey-io#163 Reconnect properly reconnects after a timeout: PASSED
 valkey-io#164 Reconnect properly uses owned parameters: PASSED
 valkey-io#165 Returns I/O error when the connection is lost: PASSED
 valkey-io#166 Returns I/O error on socket timeout: PASSED
 valkey-io#167 Set error when an invalid timeout usec value is used during connect: PASSED
 valkey-io#168 Set error when an invalid timeout sec value is used during connect: PASSED
 valkey-io#169 Append format command: PASSED
 valkey-io#170 Throughput:
	(1000x PING: 0.010s)
	(1000x LRANGE with 500 elements: 0.060s)
	(1000x INCRBY: 0.012s)
	(10000x PING (pipelined): 0.066s)
	(10000x LRANGE with 500 elements (pipelined): 0.523s)
	(10000x INCRBY (pipelined): 0.024s)
 ...

Thanks to Michael Grunder for lots of review suggestions!

Link[1]: valkey-io/valkey#477
Signed-off-by: zhenwei pi <[email protected]>
pizhenwei added a commit to pizhenwei/libvalkey that referenced this pull request Jul 30, 2024
Valkey Over RDMA[1] has been supported as experimental feature since
Valkey 8.0. Support RDMA transport for the client side.

RDMA is not a builtin feature, supported as module only, so we have to
run test.sh with more argument @VALKEY_RDMA_MODULE and @VALKEY_RDMA_ADDR.

An example to run test.sh:
VALKEY_RDMA_MODULE=/path/to/valkey-rdma.so VALKEY_RDMA_ADDR=192.168.122.1 TEST_RDMA=1 ./test.sh

 ...
 Testing against RDMA connection (192.168.122.1:56379):
 valkey-io#138 Is able to deliver commands: PASSED
 valkey-io#139 Is a able to send commands verbatim: PASSED
 valkey-io#140 %s String interpolation works: PASSED
 valkey-io#141 %b String interpolation works: PASSED
 valkey-io#142 Binary reply length is correct: PASSED
 valkey-io#143 Can parse nil replies: PASSED
 valkey-io#144 Can parse integer replies: PASSED
 valkey-io#145 Can parse multi bulk replies: PASSED
 valkey-io#146 Can handle nested multi bulk replies: PASSED
 valkey-io#147 Send command by passing argc/argv: PASSED
 valkey-io#148 Can pass NULL to valkeyGetReply: PASSED
 valkey-io#149 RESP3 PUSH messages are handled out of band by default: PASSED
 valkey-io#150 We can set a custom RESP3 PUSH handler: PASSED
 valkey-io#151 We properly handle a NIL invalidation payload: PASSED
 valkey-io#152 With no handler, PUSH replies come in-band: PASSED
 valkey-io#153 With no PUSH handler, no replies are lost: PASSED
 valkey-io#154 We set a default RESP3 handler for valkeyContext: PASSED
 valkey-io#155 We don't set a default RESP3 push handler for valkeyAsyncContext: PASSED
 valkey-io#156 Our VALKEY_OPT_NO_PUSH_AUTOFREE flag works: PASSED
 valkey-io#157 We can use valkeyOptions to set a custom PUSH handler for valkeyContext: PASSED
 valkey-io#158 We can use valkeyOptions to set a custom PUSH handler for valkeyAsyncContext: PASSED
 valkey-io#159 We can use valkeyOptions to set privdata: PASSED
 valkey-io#160 Our privdata destructor fires when we free the context: PASSED
 valkey-io#161 Successfully completes a command when the timeout is not exceeded: PASSED
 valkey-io#162 Does not return a reply when the command times out: SKIPPED
 valkey-io#163 Reconnect properly reconnects after a timeout: PASSED
 valkey-io#164 Reconnect properly uses owned parameters: PASSED
 valkey-io#165 Returns I/O error when the connection is lost: PASSED
 valkey-io#166 Returns I/O error on socket timeout: PASSED
 valkey-io#167 Set error when an invalid timeout usec value is used during connect: PASSED
 valkey-io#168 Set error when an invalid timeout sec value is used during connect: PASSED
 valkey-io#169 Append format command: PASSED
 valkey-io#170 Throughput:
	(1000x PING: 0.010s)
	(1000x LRANGE with 500 elements: 0.060s)
	(1000x INCRBY: 0.012s)
	(10000x PING (pipelined): 0.066s)
	(10000x LRANGE with 500 elements (pipelined): 0.523s)
	(10000x INCRBY (pipelined): 0.024s)
 ...

Thanks to Michael Grunder for lots of review suggestions!

Link[1]: valkey-io/valkey#477
Signed-off-by: zhenwei pi <[email protected]>
michael-grunder pushed a commit to valkey-io/libvalkey that referenced this pull request Aug 1, 2024
Valkey Over RDMA[1] has been supported as experimental feature since
Valkey 8.0. Support RDMA transport for the client side.

RDMA is not a builtin feature, supported as module only, so we have to
run test.sh with more argument @VALKEY_RDMA_MODULE and @VALKEY_RDMA_ADDR.

An example to run test.sh:
VALKEY_RDMA_MODULE=/path/to/valkey-rdma.so VALKEY_RDMA_ADDR=192.168.122.1 TEST_RDMA=1 ./test.sh

 ...
 Testing against RDMA connection (192.168.122.1:56379):
 #138 Is able to deliver commands: PASSED
 #139 Is a able to send commands verbatim: PASSED
 #140 %s String interpolation works: PASSED
 #141 %b String interpolation works: PASSED
 #142 Binary reply length is correct: PASSED
 #143 Can parse nil replies: PASSED
 #144 Can parse integer replies: PASSED
 #145 Can parse multi bulk replies: PASSED
 #146 Can handle nested multi bulk replies: PASSED
 #147 Send command by passing argc/argv: PASSED
 #148 Can pass NULL to valkeyGetReply: PASSED
 #149 RESP3 PUSH messages are handled out of band by default: PASSED
 #150 We can set a custom RESP3 PUSH handler: PASSED
 #151 We properly handle a NIL invalidation payload: PASSED
 #152 With no handler, PUSH replies come in-band: PASSED
 #153 With no PUSH handler, no replies are lost: PASSED
 #154 We set a default RESP3 handler for valkeyContext: PASSED
 #155 We don't set a default RESP3 push handler for valkeyAsyncContext: PASSED
 #156 Our VALKEY_OPT_NO_PUSH_AUTOFREE flag works: PASSED
 #157 We can use valkeyOptions to set a custom PUSH handler for valkeyContext: PASSED
 #158 We can use valkeyOptions to set a custom PUSH handler for valkeyAsyncContext: PASSED
 #159 We can use valkeyOptions to set privdata: PASSED
 #160 Our privdata destructor fires when we free the context: PASSED
 #161 Successfully completes a command when the timeout is not exceeded: PASSED
 #162 Does not return a reply when the command times out: SKIPPED
 #163 Reconnect properly reconnects after a timeout: PASSED
 #164 Reconnect properly uses owned parameters: PASSED
 #165 Returns I/O error when the connection is lost: PASSED
 #166 Returns I/O error on socket timeout: PASSED
 #167 Set error when an invalid timeout usec value is used during connect: PASSED
 #168 Set error when an invalid timeout sec value is used during connect: PASSED
 #169 Append format command: PASSED
 #170 Throughput:
	(1000x PING: 0.010s)
	(1000x LRANGE with 500 elements: 0.060s)
	(1000x INCRBY: 0.012s)
	(10000x PING (pipelined): 0.066s)
	(10000x LRANGE with 500 elements (pipelined): 0.523s)
	(10000x INCRBY (pipelined): 0.024s)
 ...

Thanks to Michael Grunder for lots of review suggestions!

Link[1]: valkey-io/valkey#477
Signed-off-by: zhenwei pi <[email protected]>
michael-grunder pushed a commit to michael-grunder/libvalkey that referenced this pull request Aug 1, 2024
Valkey Over RDMA[1] has been supported as experimental feature since
Valkey 8.0. Support RDMA transport for the client side.

RDMA is not a builtin feature, supported as module only, so we have to
run test.sh with more argument @VALKEY_RDMA_MODULE and @VALKEY_RDMA_ADDR.

An example to run test.sh:
VALKEY_RDMA_MODULE=/path/to/valkey-rdma.so VALKEY_RDMA_ADDR=192.168.122.1 TEST_RDMA=1 ./test.sh

 ...
 Testing against RDMA connection (192.168.122.1:56379):
 valkey-io#138 Is able to deliver commands: PASSED
 valkey-io#139 Is a able to send commands verbatim: PASSED
 valkey-io#140 %s String interpolation works: PASSED
 valkey-io#141 %b String interpolation works: PASSED
 valkey-io#142 Binary reply length is correct: PASSED
 valkey-io#143 Can parse nil replies: PASSED
 valkey-io#144 Can parse integer replies: PASSED
 valkey-io#145 Can parse multi bulk replies: PASSED
 valkey-io#146 Can handle nested multi bulk replies: PASSED
 valkey-io#147 Send command by passing argc/argv: PASSED
 valkey-io#148 Can pass NULL to valkeyGetReply: PASSED
 valkey-io#149 RESP3 PUSH messages are handled out of band by default: PASSED
 valkey-io#150 We can set a custom RESP3 PUSH handler: PASSED
 valkey-io#151 We properly handle a NIL invalidation payload: PASSED
 valkey-io#152 With no handler, PUSH replies come in-band: PASSED
 valkey-io#153 With no PUSH handler, no replies are lost: PASSED
 valkey-io#154 We set a default RESP3 handler for valkeyContext: PASSED
 valkey-io#155 We don't set a default RESP3 push handler for valkeyAsyncContext: PASSED
 valkey-io#156 Our VALKEY_OPT_NO_PUSH_AUTOFREE flag works: PASSED
 valkey-io#157 We can use valkeyOptions to set a custom PUSH handler for valkeyContext: PASSED
 valkey-io#158 We can use valkeyOptions to set a custom PUSH handler for valkeyAsyncContext: PASSED
 valkey-io#159 We can use valkeyOptions to set privdata: PASSED
 valkey-io#160 Our privdata destructor fires when we free the context: PASSED
 valkey-io#161 Successfully completes a command when the timeout is not exceeded: PASSED
 valkey-io#162 Does not return a reply when the command times out: SKIPPED
 valkey-io#163 Reconnect properly reconnects after a timeout: PASSED
 valkey-io#164 Reconnect properly uses owned parameters: PASSED
 valkey-io#165 Returns I/O error when the connection is lost: PASSED
 valkey-io#166 Returns I/O error on socket timeout: PASSED
 valkey-io#167 Set error when an invalid timeout usec value is used during connect: PASSED
 valkey-io#168 Set error when an invalid timeout sec value is used during connect: PASSED
 valkey-io#169 Append format command: PASSED
 valkey-io#170 Throughput:
	(1000x PING: 0.010s)
	(1000x LRANGE with 500 elements: 0.060s)
	(1000x INCRBY: 0.012s)
	(10000x PING (pipelined): 0.066s)
	(10000x LRANGE with 500 elements (pipelined): 0.523s)
	(10000x INCRBY (pipelined): 0.024s)
 ...

Thanks to Michael Grunder for lots of review suggestions!

Link[1]: valkey-io/valkey#477
Signed-off-by: zhenwei pi <[email protected]>
@pizhenwei pizhenwei deleted the feature-rdma branch August 15, 2024 03:46
pizhenwei added a commit to pizhenwei/valkey-doc that referenced this pull request Aug 27, 2024
RDMA is the abbreviation of remote direct memory access. It is a
technology that enables computers in a network to exchange data in
the main memory without involving the processor, cache, or operating
system of either computer. This means RDMA has a better performance
than TCP, the test results show Valkey Over RDMA has a ~2.5X QPS and
lower latency.

In recent years, RDMA gets popular in the data center, especially
RoCE(RDMA over Converged Ethernet) architecture has been widely used.
Cloud Vendors also start to support RDMA instance in order to accelerate
networking performance. End-user would enjoy the improvement easily.

Introduce Valkey Over RDMA protocol as a new transport for Valkey. For
now, we defined 4 commands:
- GetServerFeature & SetClientFeature: the two commands are used to
  negotiate features for further extension. There is no feature
  definition in this version. Flow control and multi-buffer may be
  supported in the future, this needs feature negotiation.
- Keepalive
- RegisterXferMemory: the heart to transfer the real payload.

The 'TX buffer' and 'RX buffer' are designed by RDMA remote memory
with RDMA write/write with imm, it's similar to several mechanisms
introduced by papers(but not same):
- Socksdirect: datacenter sockets can be fast and compatible
  <https://dl.acm.org/doi/10.1145/3341302.3342071>
- LITE Kernel RDMA Support for Datacenter Applications
  <https://dl.acm.org/doi/abs/10.1145/3132747.3132762>
- FaRM: Fast Remote Memory
  <https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dragojevic.pdf>

Thanks to Daniel House for review suggestions!

Link: valkey-io/valkey#477
Co-authored-by: Xinhao Kong <[email protected]>
Co-authored-by: Huaping Zhou <[email protected]>
Co-authored-by: zhuo jiang <[email protected]>
Co-authored-by: Yiming Zhang <[email protected]>
Co-authored-by: Jianxi Ye <[email protected]>
Signed-off-by: zhenwei pi <[email protected]>
@become-nice
Copy link

Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Valkey over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.4xlarge, 16 vCPUs, 64G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP sockets, the use of RDMA can significantly enhance performance.

Test command of server side:

./src/valkey-server --port 6379 \
  --loadmodule src/valkey-rdma.so port=6380 bind=11.0.0.114 \
  --loglevel verbose --protected-mode no \
  --server_cpulist 12 --bgsave_cpulist 16 --appendonly no 

Test command of client side:

# Test for RDMA
./src/redis-benchmark -h 11.0.0.114 -p 6380 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100 --rdma

# Test for TCP socket
./src/redis-benchmark -h 11.0.0.114 -p 6379 -c 30 -n 10000000 -r 1000000000 \
          --threads 8 -d 512 -t ping,set,get,lrange_100

The performance test results are as shown in the following table. Apart from LRANGE_100 (performance improvement but not substantially), in other scenarios (PING, SET, GET) the throughput can be increased by at least 76%, and the average (AVG) and P99 latencies can be reduced by at least 40%.

RDMA TCP RDMA/TCP
PING_INLINE
Throughput 666577.81 366394.31 181.93%
Latency-AVG 0.044 0.08 55.00%
Latency-P99 0.063 0.127 49.61%
PING_MBULK
Throughput 688657.81 395397.56 174.17%
Latency-AVG 0.042 0.073 57.53%
Latency-P99 0.063 0.119 52.94%
SET
Throughput 434744.78 157726.22 275.63%
Latency-AVG 0.068 0.188 36.17%
Latency-P99 0.111 0.183 60.66%
GET
Throughput 562587.94 319478.59 176.10%
Latency-AVG 0.052 0.091 57.14%
Latency-P99 0.079 0.151 52.32%
LRANGE
Throughput 526260.38 211434.36 248.90%
Latency-AVG 0.056 0.14 40.00%
Latency-P99 0.079 0.159 49.69%
LRANGE_100
Throughput 57106.96 49498.34 115.37%
Latency-AVG 0.427 0.499 85.57%
Latency-P99 4.207 13.367 31.47%

When I enabled the multi-threaded I/O model, I observed that the performance of SET and GET operations in TCP/IP mode was twice as fast as RDMA. Surprisingly, under the multi-threaded I/O configuration, RDMA performed even worse than a single main thread.

@pizhenwei
Copy link
Contributor Author

Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Valkey over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.4xlarge, 16 vCPUs, 64G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP

When I enabled the multi-threaded I/O model, I observed that the performance of SET and GET operations in TCP/IP mode was twice as fast as RDMA. Surprisingly, under the multi-threaded I/O configuration, RDMA performed even worse than a single main thread.

Frankly, I never test Valkey Over RDMA under the multi-threaded I/O. Generally multi threads need to synchronize state and send signals to wakeup another workers, this will lead performance drop under high throughput. I'll dive into it later.

@become-nice
Copy link

Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Valkey over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.4xlarge, 16 vCPUs, 64G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP

When I enabled the multi-threaded I/O model, I observed that the performance of SET and GET operations in TCP/IP mode was twice as fast as RDMA. Surprisingly, under the multi-threaded I/O configuration, RDMA performed even worse than a single main thread.

Frankly, I never test Valkey Over RDMA under the multi-threaded I/O. Generally multi threads need to synchronize state and send signals to wakeup another workers, this will lead performance drop under high throughput. I'll dive into it later.

Thank you for your reply. May I ask: When you tested the single-threaded performance, was PFC enabled?

@pizhenwei
Copy link
Contributor Author

Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Valkey over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.4xlarge, 16 vCPUs, 64G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP

When I enabled the multi-threaded I/O model, I observed that the performance of SET and GET operations in TCP/IP mode was twice as fast as RDMA. Surprisingly, under the multi-threaded I/O configuration, RDMA performed even worse than a single main thread.

Frankly, I never test Valkey Over RDMA under the multi-threaded I/O. Generally multi threads need to synchronize state and send signals to wakeup another workers, this will lead performance drop under high throughput. I'll dive into it later.

Thank you for your reply. May I ask: When you tested the single-threaded performance, was PFC enabled?

Yes.

@maobaolong
Copy link

@pizhenwei Thanks for this great feature, is there a doc to use rdma by docke?

@pizhenwei
Copy link
Contributor Author

@pizhenwei Thanks for this great feature, is there a doc to use rdma by docke?

For example, docker run --net=host --device=/dev/infiniband/uverbs0 --device=/dev/infiniband/rdma_cm \ -t -i centos /bin/bash

@maobaolong
Copy link

@pizhenwei Thanks for this great feature, is there a doc to use rdma by docke?

For example, docker run --net=host --device=/dev/infiniband/uverbs0 --device=/dev/infiniband/rdma_cm \ -t -i centos /bin/bash

Thanks for your reply, sorry for the misunderstand, actually, I mean how to start a valkey server by rdma mode and run an use case example inside a docker container. It would be nice if there are an existing docker image, if not, I have to build one by myself.

@become-nice
Copy link

Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Valkey over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.4xlarge, 16 vCPUs, 64G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP

When I enabled the multi-threaded I/O model, I observed that the performance of SET and GET operations in TCP/IP mode was twice as fast as RDMA. Surprisingly, under the multi-threaded I/O configuration, RDMA performed even worse than a single main thread.

Frankly, I never test Valkey Over RDMA under the multi-threaded I/O. Generally multi threads need to synchronize state and send signals to wakeup another workers, this will lead performance drop under high throughput. I'll dive into it later.

Thank you for your reply. May I ask: When you tested the single-threaded performance, was PFC enabled?

Yes.

I'm currently facing an issue during testing. I've observed that regardless of whether I'm running in single-thread or multi-thread mode, the aeMain function is always called in the main thread. This part of the code execution is not being delegated to IO threads, which appears to be causing performance degradation with RDMA in multi-threaded mode.

In the documentation, I read that Valkey supposedly delegates event processing to IO threads, but I haven't been able to find the relevant function calls in the actual code.

Could someone help me understand this discrepancy? Is event processing actually being delegated to IO threads in Valkey/Redis, and if so, where is this implemented in the code?
截屏2025-04-25 15 23 17

@pizhenwei
Copy link
Contributor Author

pizhenwei commented Apr 27, 2025

Many cloud providers offer RDMA acceleration on their cloud platforms, and I think that there is a foundational basis for the application of Valkey over RDMA. We performed some performance tests on this PR on the 8th generation ECS instances (g8ae.4xlarge, 16 vCPUs, 64G DDR) provided by Alibaba Cloud. Test results indicate that, compared to TCP

When I enabled the multi-threaded I/O model, I observed that the performance of SET and GET operations in TCP/IP mode was twice as fast as RDMA. Surprisingly, under the multi-threaded I/O configuration, RDMA performed even worse than a single main thread.

Frankly, I never test Valkey Over RDMA under the multi-threaded I/O. Generally multi threads need to synchronize state and send signals to wakeup another workers, this will lead performance drop under high throughput. I'll dive into it later.

Thank you for your reply. May I ask: When you tested the single-threaded performance, was PFC enabled?

Yes.

I'm currently facing an issue during testing. I've observed that regardless of whether I'm running in single-thread or multi-thread mode, the aeMain function is always called in the main thread. This part of the code execution is not being delegated to IO threads, which appears to be causing performance degradation with RDMA in multi-threaded mode.

In the documentation, I read that Valkey supposedly delegates event processing to IO threads, but I haven't been able to find the relevant function calls in the actual code.

Could someone help me understand this discrepancy? Is event processing actually being delegated to IO threads in Valkey/Redis, and if so, where is this implemented in the code? 截屏2025-04-25 15 23 17

Please open an issue to track this problem, use @pizhenwei to notify me, I am on vacation, I could dive into it later.

Valkey Over RDMA provides quite higher performance, do you hit performance limitation in real productions or just run benchmark? From the point of my view, the best practice would be (see RDMA section in valkey.conf):

  • use a single thread (this will reduce the overhead from multi-thread inner processor communication, IPC brings higher latency in a virtual machine, please see LAPIC ICR virtualization of x86 processor)
  • pin thread
  • Pin hardware IRQ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

major-decision-approved Major decision approved by TSC team

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.