[UMBP] PoolClient issues one RDMA WR per page — batch per (local,remote) to enable WR coalescing

### Summary
`PoolClient::ExecuteRemotePutTransfers` / `ExecuteRemoteGetTransfers` (`src/umbp/distributed/pool_client.cpp`) currently issue **one `IOEngine::BatchWrite`/`BatchRead` element per page** — each element carries a single `(local_offset, remote_offset, size)`. For a value spanning N pages (× many keys per peer), this produces N single-transfer elements.

I'd like to propose grouping transfers that share the same `(local_desc.id, remote_desc.id)` into **one multi-offset `BatchWrite`/`BatchRead` element**, and check whether the maintainers see any reason not to.

### Why it matters
- **No contiguous-WR coalescing today.** `RdmaBatchReadWrite` merges adjacent remote offsets into fewer/larger WRs, but only *within a single `BatchReadWrite` call*. Feeding pages one-at-a-time defeats that, so contiguous pages become N separate WRs on the wire.
- **Per-page control overhead.** The per-element engine path re-runs `SelectBackend` + `GetOrCreateSessionCached` + allocates a `CqCallbackMeta` for every page, even though all pages of a Put target the same peer/session.
- **Completion load.** N completions per Put for the single CQ-poll thread instead of a few.

For comparison, SGLang's mori connector already issues one `sess.batch_write(...)` per group via the session API, so it doesn't hit this — UMBP's PoolClient looks like the outlier here.

### Preliminary data (loopback micro-bench, not yet cross-machine)
Prototyped the grouping locally and measured `bench_pool_client_batch_put --scenario all_zc` (in-process master+peer loopback), wall_ms / 30 iters:

| per-item-pages | batch | before | after | reduction |
|---:|---:|---:|---:|---:|
| 64  | 256 | 10.95 ms | 1.63 ms | ~85% |
| 128 | 256 | 24.55 ms | 3.21 ms | ~87% |

Full `ctest -R umbp` stays green (18/18, incl. `umbp_cross_node_smoke` round-trip).

**Test environment:** AMD MI308X (gfx942), ROCm 7.2.0, mori built `BUILD_UMBP=ON BUILD_TESTS=ON` at commit `edc18f10`; the numbers above are from the **in-process loopback** bench (single host, master + peer in one process), so they isolate the host-side software/WR-issue path rather than real NIC bandwidth.

**Caveat:** loopback overstates the win (no wire time). On real cross-machine RDMA the wire dominates and the control-overhead savings shrink — though the WR-coalescing benefit should still help, especially for many small pages. I haven't measured this on real hardware yet.

### Questions for maintainers
1. Is there a reason `PoolClient` intentionally uses the per-element engine API rather than the session/batch API here (ordering, failure-attribution granularity, something subtle in the allocator/page layout)?
2. Would a per-`(local,remote)` grouping be welcome, given the loopback signal — and what would you consider sufficient validation (a cross-machine PD-bench number, a specific scenario)?
3. Any preference on failure granularity becoming per-group rather than per-page?

Happy to open a PR with the change + a cross-machine measurement if this direction sounds good. Local prototype diff is ~`+123/-54` in `pool_client.cpp`, no public header/proto/ABI change.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UMBP] PoolClient issues one RDMA WR per page — batch per (local,remote) to enable WR coalescing #400

Summary

Why it matters

Preliminary data (loopback micro-bench, not yet cross-machine)

Questions for maintainers

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[UMBP] PoolClient issues one RDMA WR per page — batch per (local,remote) to enable WR coalescing #400

Description

Summary

Why it matters

Preliminary data (loopback micro-bench, not yet cross-machine)

Questions for maintainers

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions