Skip to content

[NEW] Reply Offload #1353

@alexander-shabanov

Description

@alexander-shabanov

Reply Offload

The idea for this proposal is brought up @touitou-dan and @uriyage.

Problem

In Valkey, when main thread builds a reply to a command, it copies data from an robj to client’s reply buffer (i.e. client cob). Later, when reply buffers are written to client’s connection, this data is copied again from the reply buffer by write/writev. So, robj data is copied twice.

Proposition

We suggest to optimize reply handling and eliminate one data copy done on main thread as follows. If IO threads are active it will eliminate completely expensive memory access to robj value (robj->ptr) on the main thread as well.

The main thread will write a pointer to robj into a reply buffer instead of writing robj data. The thread writing to client’s connection, either IO thread or main thread if IO threads inactive, will write corresponding part of reply to client’s connection directly from the robj object. Since regular data and pointers will be mixed within the reply buffers, a serialization approach will be necessary to organize the data in the reply buffers.

The writing thread will need to build offloaded replies from robj pointers on the fly and use writev to write to client’s connection because reply data will be scattered - part in reply buffers (i.e. regular non offloaded replies) and part in robj (i.e. offloaded replies). For example, if “GET greeting” command is issued and “greeting” key is associated with “hello” value then valkey is expected to reply $5\r\nhello\r\n . So simplified code in writing thread will look like this:

robj *value_obj;
memcpy(&value_obj, c->buf + some_pos, sizeof(value_obj));
char *str = value_obj->ptr;
size_t str_len = stringObjectLen(value_obj);
            
struct iovec iov[3];
char* prefix = "$5\r\n";
char* suffix = "\r\n";
iov[0].iov_base = prefix;
iov[0].iov_len = 4;
iov[1].iov_base = str;
iov[1].iov_len = str_len;
iov[2].iov_base = suffix;
iov[2].iov_len = 2;

connWritev(c->conn, iov, 3)

The proper generalized implementation will write to client’s connection content of all replies, regular and offloaded ones, using single writev call.

The performance improvement has been measured using proof of concept implementation and setup described at this article. The TPS for GET commands for data size 512 byte increased from 1.07 million to 1.3 million requests per second, for data size 4096 increased from 750,000 to 900,000. The TPS for GET commands for data size 512 byte with iothreads disabled no noticeable change, with and without around 190,000.

The Reply Offload technique is based on ideas outlined at Reaching 1 million requests per second on a single Valkey instance and provides an additional improvement to major ones implemented at #758, #763, #861.

Scope

This document proposes to apply Reply Offload to string objects. Specifically, to commands using addReplyBulk for building reply with robj objects of type OBJ_STRING and encoding OBJ_ENCODING_RAW . The Reply Offload is straightforward for this case and will benefit frequently used commands like GET and MGET . In future application of Reply Offload will be extended for more complex object types.

Implementation

Existing _addReplyToBuffer and _addReplyProtoToList functions will be extended to prepend raw data written into reply buffers with CLIENT_REPLY_PAYLOAD_DATA type and corresponding size (i.e. payload header).

Additionally, new _addReplyOffloadToBuffer and _addReplyOffloadToList will be introduced to pack robj pointer into reply buffers using payload header with CLIENT_REPLY_PAYLOAD_ROBJ_PTR type .

The main thread will detect replies eligible for offloading (i.e. robj with OBJ_ENCODING_RAW encoding), increment robj reference counter and offload them using _addReplyOffloadToBuffer / _addReplyOffloadToList. The robj reference counter will be decremented back on the main thread when write is completed in postWriteToClient callback.

A new header will be inserted only if _addReply functions need to write payload type different from the last one; otherwise, last header will be updated and raw data or ptr will be appended.

In the diagram below: reply buffer [16k] is c→buf in the code and reply list is c→reply.
ReplyOffloadSerialization drawio(4)

typedef enum {
    CLIENT_REPLY_PAYLOAD_DATA = 1,
    CLIENT_REPLY_PAYLOAD_ROBJ_PTR = 2,
} clientReplyPayloadType;

/* Reply payload header */
typedef struct payloadHeader {
    uint8_t type;
    uint32_t size;
} payloadHeader;

In the writing thread, either IO thread or main if IO threads inactive, if a client in reply offload mode than _writeToClient function will always choose writevToClient flow. The writevToClient will process data in reply buffers according to their headers. Specifically, it will pack reply offload data (robj->ptr) directly into iov (array of iovec) as explained in the Proposition section.

Configuration

The “io-threads-reply-offload” config setting will be introduced to enable or disable reply offload optimization in the code. It should be gracefully applied (i.e. switch on / off on a specific client only when no in-flight replies).

Implementation Challenges

The challenges for possible Reply Offload implementations are:

  • mix raw data and pointers inside reply buffers
  • maintain strict order of replies
  • minimize memory consumption increase by client output buffers
  • eliminate/minimize decrease of performance for use cases (commands) not suitable for reply offload
  • minimize complexity of code changes

Alternative Implementation

Above we suggested implementation that strives to optimally address all challenges. Below is a short description of less optimal alternative.

Alternative more simple implementation can be introduction of flag field on clientReplyBlock struct with possible values CLIENT_REPLY_PAYLOAD_RAW_DATA and CLIENT_REPLY_PAYLOAD_RAW_OBJ_PTR and putting into buf of clientReplyBlock either raw data or robj pointer(s) with no mixing of data and pointers in the same buf. So, each time when a payload different from last one should be added to reply buffers a new clientReplyBlock should be allocated and added to reply list. The default buf on client struct can be used the same way, either for raw data or for robj pointer(s)

The alternative implementation has more profound negative impact on memory consumption by client output buffers and on performance in mixed workloads (e.g. cmd1, cmd2, cmd3, cmd4 - where cmd1 and cmd3 suitable for offload and cmd2 and cmd4 not suitable will require to create at least 3 clientReplyBlock objects).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions