Skip to content

Commit 1f8fd82

Browse files
authored
perf(umbp): optimize PoolClient BatchPut/BatchGet (#403)
- Per-pair RDMA transfer merge (get + put): collapse pages sharing a (localMR, remoteMR) pair into one BatchRead/BatchWrite with a single TransferStatus/uid; per-pair failures map back to per-key. Also fixes an unlocked dram_memories read race and exposes RDMA knobs via MORI_IO_* env. - BatchGet: split into PartitionBatchGetTargets + ExecuteBatchGetPlan on one Submit/Wait path; the zero-copy path submits all peers and runs local DRAM/SSD + remote SSD inside the in-flight window, then waits all. - BatchPut: mirror the same submit/wait split (SubmitRemoteBatchPut / WaitRemoteBatchPut); multi-peer writes posted before waiting, deferred local puts run in the in-flight window, staging stays per-peer serial, slot commit/abort finalized at wait. - Cleanup: regroup the Put/Get hot paths into clear sections; best-effort logging on BatchAbortSlots RPC failure.
1 parent 1db01d8 commit 1f8fd82

6 files changed

Lines changed: 1529 additions & 476 deletions

File tree

docs/MORI-IO-GUIDE.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -166,9 +166,11 @@ See `examples/io/example.py` for more complete examples including batch transfer
166166
| Setting | Description |
167167
|---------|-------------|
168168
| `set_log_level(level)` | Set MORI-IO log verbosity level |
169-
| `RdmaBackendConfig.qp_per_transfer` | Queue pairs per transfer (default 1, increase for higher bandwidth; with multi-NIC these QPs are spread across NICs, so aim for ≥2 QP per NIC) |
170-
| `RdmaBackendConfig.poll_cq_mode` | CQ polling mode: `POLLING` (busy-wait, lower latency) or `EVENT` (interrupt-driven) |
171-
| `RdmaBackendConfig.enable_notification` | Enable target-side completion notifications (default `True`) |
169+
| `RdmaBackendConfig.qp_per_transfer` | Queue pairs per transfer (default 1, increase for higher bandwidth; with multi-NIC these QPs are spread across NICs, so aim for ≥2 QP per NIC). Env: `MORI_IO_QP_PER_TRANSFER` |
170+
| `RdmaBackendConfig.poll_cq_mode` | CQ polling mode: `POLLING` (busy-wait, lower latency) or `EVENT` (interrupt-driven). Env: `MORI_IO_POLL_CQ_MODE` (`0`/`polling`, `1`/`event`) |
171+
| `RdmaBackendConfig.post_batch_size` | WRs per `ibv_post_send` chain (default `-1` = auto). Env: `MORI_IO_POST_BATCH_SIZE` |
172+
| `RdmaBackendConfig.num_worker_threads` | Worker threads for batch posting (default `1`; ignored when chunking / multi-NIC forces inline posting). Env: `MORI_IO_NUM_WORKER_THREADS` |
173+
| `RdmaBackendConfig.enable_notification` | Enable target-side completion notifications (default `True`). Env: `MORI_IO_ENABLE_NOTIFICATION` |
172174
| `RdmaBackendConfig.enable_transfer_chunking` | Split a large single transfer into `chunk_bytes` chunks pipelined across QPs (default `False`). Lifts single-transfer bandwidth from single-outstanding (~28 GB/s) to NIC line rate. Forces single-thread inline posting (ignores `num_worker_threads`). Env: `MORI_IO_ENABLE_CHUNKING` |
173175
| `RdmaBackendConfig.chunk_bytes` | Chunk size when chunking is on (default `65536` = 64 KB). Messages ≤ this are unchanged. Env: `MORI_IO_CHUNK_BYTES` |
174176
| `RdmaBackendConfig.max_chunks_per_transfer` | Cap on chunks per transfer to bound WR/SQ usage (default `64`). Env: `MORI_IO_MAX_CHUNKS` |

src/io/rdma/backend_impl.cpp

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,10 @@
2626

2727
#include <algorithm>
2828
#include <cstdlib>
29+
#include <cstring>
2930
#include <limits>
3031
#include <memory>
32+
#include <optional>
3133
#include <shared_mutex>
3234
#include <stdexcept>
3335
#include <string>
@@ -73,6 +75,27 @@ bool UsesInlineOnly(const RdmaBackendConfig& config) {
7375
return config.enableTransferChunking || config.numNicsPerTransfer > 1;
7476
}
7577

78+
// Parse MORI_IO_POLL_CQ_MODE: "0"/"polling" or "1"/"event" (case-insensitive).
79+
// Returns nullopt for anything else so env::Override warns and keeps the
80+
// config value (no silent fallback).
81+
std::optional<PollCqMode> ParsePollCqMode(const char* raw) {
82+
if (std::strcmp(raw, "0") == 0 || mori::env::detail::EqualsIgnoreCase(raw, "polling")) {
83+
return PollCqMode::POLLING;
84+
}
85+
if (std::strcmp(raw, "1") == 0 || mori::env::detail::EqualsIgnoreCase(raw, "event")) {
86+
return PollCqMode::EVENT;
87+
}
88+
return std::nullopt;
89+
}
90+
91+
// Parse MORI_IO_POST_BATCH_SIZE: only -1 (auto) or a positive int are valid.
92+
// Rejects 0 / negatives (which the post path would silently clamp to 1) so
93+
// env::Override warns and keeps the config value.
94+
std::optional<int> ParsePostBatchSize(const char* raw) {
95+
if (std::strcmp(raw, "-1") == 0) return -1;
96+
return mori::env::detail::ParsePositiveInt(raw);
97+
}
98+
7699
int ResolveRequestedNics(const RdmaBackendConfig& config, const TopoKey& local,
77100
const TopoKey& remote) {
78101
if (local.loc == MemoryLocationType::GPU || remote.loc == MemoryLocationType::GPU) {
@@ -1235,6 +1258,14 @@ RdmaBackend::RdmaBackend(EngineKey k, const IOEngineConfig& engConfig,
12351258
mori::env::detail::ParsePositiveInt);
12361259
env::Override("MORI_IO_NUM_NICS_PER_TRANSFER", config.numNicsPerTransfer,
12371260
mori::env::detail::ParsePositiveInt);
1261+
// Perf-sweep / ops overrides for knobs that otherwise only have a config
1262+
// field. postBatchSize accepts any int (-1 = auto), the rest are positive.
1263+
env::Override("MORI_IO_QP_PER_TRANSFER", config.qpPerTransfer,
1264+
mori::env::detail::ParsePositiveInt);
1265+
env::Override("MORI_IO_POST_BATCH_SIZE", config.postBatchSize, ParsePostBatchSize);
1266+
env::Override("MORI_IO_NUM_WORKER_THREADS", config.numWorkerThreads,
1267+
mori::env::detail::ParsePositiveInt);
1268+
env::Override("MORI_IO_POLL_CQ_MODE", config.pollCqMode, ParsePollCqMode);
12381269
ValidateRdmaNotificationConfig(config);
12391270
ValidateRdmaTransferConfig(config);
12401271

0 commit comments

Comments
 (0)