Skip to content

[UMBP] PoolClient issues one RDMA WR per page — batch per (local,remote) to enable WR coalescing #400

Description

@staryxchen

Summary

PoolClient::ExecuteRemotePutTransfers / ExecuteRemoteGetTransfers (src/umbp/distributed/pool_client.cpp) currently issue one IOEngine::BatchWrite/BatchRead element per page — each element carries a single (local_offset, remote_offset, size). For a value spanning N pages (× many keys per peer), this produces N single-transfer elements.

I'd like to propose grouping transfers that share the same (local_desc.id, remote_desc.id) into one multi-offset BatchWrite/BatchRead element, and check whether the maintainers see any reason not to.

Why it matters

  • No contiguous-WR coalescing today. RdmaBatchReadWrite merges adjacent remote offsets into fewer/larger WRs, but only within a single BatchReadWrite call. Feeding pages one-at-a-time defeats that, so contiguous pages become N separate WRs on the wire.
  • Per-page control overhead. The per-element engine path re-runs SelectBackend + GetOrCreateSessionCached + allocates a CqCallbackMeta for every page, even though all pages of a Put target the same peer/session.
  • Completion load. N completions per Put for the single CQ-poll thread instead of a few.

For comparison, SGLang's mori connector already issues one sess.batch_write(...) per group via the session API, so it doesn't hit this — UMBP's PoolClient looks like the outlier here.

Preliminary data (loopback micro-bench, not yet cross-machine)

Prototyped the grouping locally and measured bench_pool_client_batch_put --scenario all_zc (in-process master+peer loopback), wall_ms / 30 iters:

per-item-pages batch before after reduction
64 256 10.95 ms 1.63 ms ~85%
128 256 24.55 ms 3.21 ms ~87%

Full ctest -R umbp stays green (18/18, incl. umbp_cross_node_smoke round-trip).

Test environment: AMD MI308X (gfx942), ROCm 7.2.0, mori built BUILD_UMBP=ON BUILD_TESTS=ON at commit edc18f10; the numbers above are from the in-process loopback bench (single host, master + peer in one process), so they isolate the host-side software/WR-issue path rather than real NIC bandwidth.

Caveat: loopback overstates the win (no wire time). On real cross-machine RDMA the wire dominates and the control-overhead savings shrink — though the WR-coalescing benefit should still help, especially for many small pages. I haven't measured this on real hardware yet.

Questions for maintainers

  1. Is there a reason PoolClient intentionally uses the per-element engine API rather than the session/batch API here (ordering, failure-attribution granularity, something subtle in the allocator/page layout)?
  2. Would a per-(local,remote) grouping be welcome, given the loopback signal — and what would you consider sufficient validation (a cross-machine PD-bench number, a specific scenario)?
  3. Any preference on failure granularity becoming per-group rather than per-page?

Happy to open a PR with the change + a cross-machine measurement if this direction sounds good. Local prototype diff is ~+123/-54 in pool_client.cpp, no public header/proto/ABI change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions