Summary
PoolClient::ExecuteRemotePutTransfers / ExecuteRemoteGetTransfers (src/umbp/distributed/pool_client.cpp) currently issue one IOEngine::BatchWrite/BatchRead element per page — each element carries a single (local_offset, remote_offset, size). For a value spanning N pages (× many keys per peer), this produces N single-transfer elements.
I'd like to propose grouping transfers that share the same (local_desc.id, remote_desc.id) into one multi-offset BatchWrite/BatchRead element, and check whether the maintainers see any reason not to.
Why it matters
- No contiguous-WR coalescing today.
RdmaBatchReadWrite merges adjacent remote offsets into fewer/larger WRs, but only within a single BatchReadWrite call. Feeding pages one-at-a-time defeats that, so contiguous pages become N separate WRs on the wire.
- Per-page control overhead. The per-element engine path re-runs
SelectBackend + GetOrCreateSessionCached + allocates a CqCallbackMeta for every page, even though all pages of a Put target the same peer/session.
- Completion load. N completions per Put for the single CQ-poll thread instead of a few.
For comparison, SGLang's mori connector already issues one sess.batch_write(...) per group via the session API, so it doesn't hit this — UMBP's PoolClient looks like the outlier here.
Preliminary data (loopback micro-bench, not yet cross-machine)
Prototyped the grouping locally and measured bench_pool_client_batch_put --scenario all_zc (in-process master+peer loopback), wall_ms / 30 iters:
| per-item-pages |
batch |
before |
after |
reduction |
| 64 |
256 |
10.95 ms |
1.63 ms |
~85% |
| 128 |
256 |
24.55 ms |
3.21 ms |
~87% |
Full ctest -R umbp stays green (18/18, incl. umbp_cross_node_smoke round-trip).
Test environment: AMD MI308X (gfx942), ROCm 7.2.0, mori built BUILD_UMBP=ON BUILD_TESTS=ON at commit edc18f10; the numbers above are from the in-process loopback bench (single host, master + peer in one process), so they isolate the host-side software/WR-issue path rather than real NIC bandwidth.
Caveat: loopback overstates the win (no wire time). On real cross-machine RDMA the wire dominates and the control-overhead savings shrink — though the WR-coalescing benefit should still help, especially for many small pages. I haven't measured this on real hardware yet.
Questions for maintainers
- Is there a reason
PoolClient intentionally uses the per-element engine API rather than the session/batch API here (ordering, failure-attribution granularity, something subtle in the allocator/page layout)?
- Would a per-
(local,remote) grouping be welcome, given the loopback signal — and what would you consider sufficient validation (a cross-machine PD-bench number, a specific scenario)?
- Any preference on failure granularity becoming per-group rather than per-page?
Happy to open a PR with the change + a cross-machine measurement if this direction sounds good. Local prototype diff is ~+123/-54 in pool_client.cpp, no public header/proto/ABI change.
Summary
PoolClient::ExecuteRemotePutTransfers/ExecuteRemoteGetTransfers(src/umbp/distributed/pool_client.cpp) currently issue oneIOEngine::BatchWrite/BatchReadelement per page — each element carries a single(local_offset, remote_offset, size). For a value spanning N pages (× many keys per peer), this produces N single-transfer elements.I'd like to propose grouping transfers that share the same
(local_desc.id, remote_desc.id)into one multi-offsetBatchWrite/BatchReadelement, and check whether the maintainers see any reason not to.Why it matters
RdmaBatchReadWritemerges adjacent remote offsets into fewer/larger WRs, but only within a singleBatchReadWritecall. Feeding pages one-at-a-time defeats that, so contiguous pages become N separate WRs on the wire.SelectBackend+GetOrCreateSessionCached+ allocates aCqCallbackMetafor every page, even though all pages of a Put target the same peer/session.For comparison, SGLang's mori connector already issues one
sess.batch_write(...)per group via the session API, so it doesn't hit this — UMBP's PoolClient looks like the outlier here.Preliminary data (loopback micro-bench, not yet cross-machine)
Prototyped the grouping locally and measured
bench_pool_client_batch_put --scenario all_zc(in-process master+peer loopback), wall_ms / 30 iters:Full
ctest -R umbpstays green (18/18, incl.umbp_cross_node_smokeround-trip).Test environment: AMD MI308X (gfx942), ROCm 7.2.0, mori built
BUILD_UMBP=ON BUILD_TESTS=ONat commitedc18f10; the numbers above are from the in-process loopback bench (single host, master + peer in one process), so they isolate the host-side software/WR-issue path rather than real NIC bandwidth.Caveat: loopback overstates the win (no wire time). On real cross-machine RDMA the wire dominates and the control-overhead savings shrink — though the WR-coalescing benefit should still help, especially for many small pages. I haven't measured this on real hardware yet.
Questions for maintainers
PoolClientintentionally uses the per-element engine API rather than the session/batch API here (ordering, failure-attribution granularity, something subtle in the allocator/page layout)?(local,remote)grouping be welcome, given the loopback signal — and what would you consider sufficient validation (a cross-machine PD-bench number, a specific scenario)?Happy to open a PR with the change + a cross-machine measurement if this direction sounds good. Local prototype diff is ~
+123/-54inpool_client.cpp, no public header/proto/ABI change.