Skip to content

refactor(core,application): include hygiene + host/device separation#405

Merged
jhchouuu merged 18 commits into
mainfrom
jiahzhou/core-include-refactor
Jun 17, 2026
Merged

refactor(core,application): include hygiene + host/device separation#405
jhchouuu merged 18 commits into
mainfrom
jiahzhou/core-include-refactor

Conversation

@jhchouuu

@jhchouuu jhchouuu commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

Standalone include-hygiene refactor of include/mori/core/ and include/mori/application/ — IWYU cleanup with strict host/device header separation and fixed layering inversions. Logic-preserving, with one bnxt codegen optimization (§5). This benefits main directly: device translation units parse less, the layering is enforceable, and the RDMA device path no longer drags system RDMA / host-driver headers.

What changes

1. Host/device separation + IWYU

  • core/ no longer includes anything from application/ (layering inversion removed); moved the shared bits (udma_barrier.h, sdma anvil_device/pkt structs) down into core/.
  • Device-safe headers stop dragging host-only headers — notably removed <hsakmt/hsakmt.h> (host KFD driver API) that was riding into every device TU via the sdma path.
  • Added the missing transitive includes that were being leeched (<cassert> / <cstring> / <cstdint> / <endian.h> / …), so headers are self-contained (e.g. the shmem JIT compiles internal.hpp as a standalone device TU).

2. Device RDMA path freed of system RDMA / host headers

  • Device-safe WcStatus mirror of ibv_wc_status → device CQE decoders no longer pull <infiniband/verbs.h>. Value parity guarded by static_assert in a host TU.
  • Vendored mlx5 WQE/CQE device ABI → drops <infiniband/mlx5dv.h> from device code (mori-prefixed structs, sizeof/offsetof parity guard).
  • bnxt/ionic firmware ABI made self-contained via <linux/types.h> + <stdint.h>.
  • Net: device TUs that include the RDMA path now pull 0 of verbs.h / mlx5dv.h / hsakmt / application/.

3. Dead-code removal

  • Removed dead Torch bootstrap (torch_bootstrap.{hpp,cpp} + binding); mori_application no longer links libtorch.
  • Removed dead host RDMA primitives (host_primitives.hpp, mlx5_host_primitives.hpp) and dead host helpers in ionic_fw.h (so the firmware ABI header no longer carries <infiniband/verbs.h>).
  • Removed a duplicate snake_case warp/lane helper API in core/utils that mirrored the in-use CamelCase one (−115 lines).

4. Structure / naming

  • core/utils.hppcore/utils/utils.hpp (matches the X/X.hpp aggregator convention).
  • Host-only *_dv.h shims moved from core/ to application/, so core/ is strictly device-safe.
  • bnxt provides its own WQE-slot copy instead of pulling the p2p device header.
  • mlx5 vendored names are mori-prefixed (Mlx5* / MORI_MLX5_*) while bnxt/ionic keep upstream names — intentional: the system mlx5dv.h defines the same ::mlx5_* names and a single RDMA TU can see both, so the vendored copies must not collide; bnxt/ionic have no installed userspace ABI header, so their upstream names are unambiguous.

5. perf (BNXT)

  • WQE control segments are written as single aligned 128-bit stores (global_store_dwordx4, branch-free), replacing a generic variable-length copy loop. static_assert enforces one-segment-per-slot; the source is read via memcpy to avoid a misaligned load on the packed ABI structs.

Verification

  • Build green on BNXT (gfx942), all modules.
  • Runtime IBGDA (P2P-off) green: concurrent_put / put_imm / put_signal / atomic_nonfetch / atomic_fetch (with the CI RoCE env MORI_RDMA_SL=3 / MORI_RDMA_TC=104). ccl SDMA allgather/all2all 8/8.
  • pip install . (full build incl. UMBP+gRPC) succeeds.
  • mlx5 ABI vendoring built + verified on an MLX5 host; CI covers mlx5/AINIC runtime, ionic, internode, and JAX.

jhchouuu and others added 14 commits June 17, 2026 02:47
Clean up the core/ and application/ include graph; no logic changes.

- Layering: core/ no longer includes application/. Relocate the low-level SDMA
  device handle + DMA-barrier headers down into core (anvil_device.hpp,
  sdma_pkt_struct.h, udma_barrier.h).
- Dead code: remove the unused C++ Torch bootstrap path (also drops the libtorch
  link from libmori_application) and the dead host-side RDMA primitives
  (host_primitives.hpp / mlx5_host_primitives.hpp).
- Host/device separation: keep host-only deps out of device TUs — drop
  <hsakmt/hsakmt.h> (host KFD driver API) and the host ibverbs-DV bnxt_re_dv.h
  out of the device headers; split providers/utils.h so each provider's CQE
  decoder lives in its own provider header; drop the hsakmt/anvil drag from
  application_device_types.hpp; make primitives.hpp a pure ibverbs-handle header.
- IWYU: add direct includes for transitively-relied-on symbols, remove unused
  includes across core + application, add a device-only rdma_device.hpp aggregator.

Validated on BNXT: full build (all modules) clean; shmem P2P/IBGDA, ccl SDMA
allgather/all2all, ops intranode all pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erbs.h> out of device TUs

The provider CQE error decoders (Mlx5/Bnxt/IonicHandleErrorCqe) returned
ibv_wc_status and providers/utils.h's status stringifier took it too, so every
provider device header dragged the host <infiniband/verbs.h> into device
translation units (via utils.h / a direct include).

Add a device-safe mori::core::WcStatus enum (a value-for-value mirror of
ibv_wc_status) in core_device_types.hpp and switch the decoders +
WcStatusString to it. Value parity with ibv_wc_status is guarded by
static_asserts in src/application/transport/rdma/rdma.cpp (a host TU that sees
both). shmem's only caller uses `auto` + stringify, so it's unaffected (renamed
IbvWcStatusString -> WcStatusString).

Result (verified via hipcc -M): the bnxt and ionic device headers no longer pull
infiniband/verbs.h at all; mlx5 still does, but only because mlx5dv.h (its WQE/CQE
ABI) requires it. On BNXT the device RDMA code is now verbs.h-free. Full build
clean; IBGDA get over BNXT passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Plan to make the mlx5 device code <infiniband/verbs.h>-free by vendoring the mlx5
WQE/CQE hardware-ABI structs + constants (like bnxt/ionic), with the exact
inventory, the mori-prefixed naming to avoid clashing with system mlx5dv, the
sizeof/offsetof parity guard, and the on-hardware verification steps. Temporary —
remove once the refactor lands.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nd/mlx5dv.h>

The mlx5 device header pulled in <infiniband/mlx5dv.h>, dragging the system
<infiniband/verbs.h> into every mlx5 device translation unit. Vendor the small,
stable subset of mlx5 WQE/CQE hardware-ABI structs + constants the device code
uses (mori::core::Mlx5*, MORI_MLX5_*) so the mlx5 device path becomes verbs-free,
matching how bnxt/ionic already get their ABI from repo-vendored firmware headers.

mori-prefixed names avoid the ambiguity a same-name copy hit: host/device/example
TUs mix the system ::mlx5_* with the vendored types. The one shmem device call
site (Mlx5CollapsedCqDrain) is moved onto the vendored names too.

A dedicated host TU (mlx5_abi_parity.cpp) includes both the system header and the
vendored one and static_asserts size/offset parity, turning any layout drift into
a compile error — these structs overlay NIC hardware memory.

Verified on MLX5 hardware (gfx942): parity asserts pass; the mlx5 device header is
verbs-free (hipcc -M = 0). IBGDA/RDMA data path green — async_ll IBGDA (68 passed),
shmem API (6 passed), internode dispatch/combine (180 GB/s), allgather (8/8).
Pre-existing concurrent_get thread-scope / concurrent_put loopback failures were
confirmed identical on a stash-baseline build, so they are unrelated to this change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…/verbs.h>

The GPU device translation units (ops/shmem/collective .hip kernels) transitively
pulled the system <infiniband/verbs.h> and <infiniband/mlx5dv.h> through two chains:

  A) core/transport/rdma/primitives.hpp included <infiniband/verbs.h> for the
     host-only IBVerbsHandle (ibv_qp*/ibv_cq*/... pointers).
  B) the shmem/ops device headers pulled application.hpp (the host RDMA stack ->
     mlx5.hpp -> mlx5_dv.h -> system mlx5dv.h).

Chain A: forward-declare the ibv_* structs in primitives.hpp instead of including
verbs.h (the handle only holds pointers); host TUs that dereference them include
verbs.h directly. The vendored firmware headers (bnxt_re_hsi.h, ionic_fw.h) used
__u*/__le*/IBV_* that used to arrive via that verbs.h, so make them self-contained
(<linux/types.h>) and move ionic's two host-only ibv<->ionic mapping helpers (dead
code referencing IBV_ACCESS_*/IBV_WC_*) behind a host guard.

Chain B: shmem_api.hpp / dispatch_combine.hpp include the full application.hpp only
on the host; device and mixed-hipcc compiles get the device-safe
application_device_types.hpp (plus a forward decl of host-only BootstrapNetwork).
The shmem device kernel headers drop their host-only shmem_api.hpp include. Move
ATOMIC_IBUF_SLOT_SIZE to application_device_types.hpp so device kernels can use it
without the host stack. A few host-logic TUs that had been getting <memory> /
CHECK_HIP_ERROR transitively via application.hpp now include them directly.

Result (hipcc -M, gfx942): every device .hip -- shmem_kernels, ep_*, cast,
ccl_kernels -- and core.hpp / rdma_device.hpp pull zero infiniband/{verbs,mlx5dv}.h.
Full build green (host+device+collective+ops+io+examples). Verified on MLX5
hardware: async_ll IBGDA (68 passed), shmem API (6 passed), internode
dispatch/combine (185 GB/s).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two cosmetic cleanups, no behavior change:
- providers/utils.h holds only provider-agnostic helpers (WcStatusString,
  get_num_wqes_in_atomic), so move it up out of providers/ and give it the C++
  extension: providers/utils.h -> transport/rdma/utils.hpp.
- primitives.hpp defines just the host IBVerbsHandle wrapper -> rename to
  ibverbs_handle.hpp so the name matches its content.

Includers updated. Full build (BNXT) clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bnxt_device_primitives.hpp used p2p's ThreadCopy (a trivial vectorized memcpy) to
lay out WQE slots, which dragged the whole p2p/device_primitives.hpp — and its
<hip/hip_fp8.h>/<hip/hip_bfloat16.h>/data_types — into the bnxt device path for
one function. Inline a local BnxtThreadCopy (distinct name to avoid clashing with
p2p's ThreadCopy in TUs that include both) and drop the p2p include.

bnxt device header no longer pulls p2p/device_primitives/hip_fp8/hip_bfloat16/
data_types (hipcc -M = 0). Behavior identical (verbatim copy logic). Full build
(BNXT) clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the generic variable-length BnxtThreadCopy loop with BnxtWriteSlot/
BnxtZeroSlot helpers that emit one aligned uint4 store per 16-byte WQE segment
(static_asserted to one slot), guaranteeing branch-free dwordx4 codegen.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Matches the directory-aggregator convention (core/core.hpp,
application/application.hpp); resolves utils.hpp coexisting beside the
core/utils/ folder. Updates all 7 includers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
utils.hpp carried two parallel APIs: the in-use CamelCase set
(GetActiveLaneNum/IsFirstActiveLane/...) and a near-dead snake_case mirror
(get_active_lane_num/is_first_active_lane/get_flat_*/lowerID/...). Only
spin_lock_*_shared used the snake helpers; rewrite those to the CamelCase
equivalents and remove the duplicated set (-115 lines).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The handoff is done (vendored in ccaeba3); drop the temporary note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d-code tidy

Three provider PollCq call sites still referenced the old IbvWcStatusString
name (hidden inside MORI_PRINTF, which is a no-op by default — so the gap only
surfaced under -DMORI_ENABLE_DEBUG_PRINTF). Also collapse the now-dead
BUILD_TORCH_BOOTSTRAP rpath branch and drop a stale TorchBootstrapNetwork
comment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… build

The shmem JIT/bitcode path compiles internal.hpp as a standalone device TU.
Previously core/utils/utils.hpp (blockDim/threadIdx/__ballot/...) and
internal.hpp (assert in device code) leeched <hip/hip_runtime.h>/<cassert>
transitively via application_device_types.hpp; the IWYU cleanup removed that
chain, breaking the JIT compile. Include them directly where used.

Reproduced + verified against the exact JIT clang invocation (shim.bc builds).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors MORI core/application headers for stricter include hygiene and host/device separation, while removing the (now-dead) PyTorch bootstrap path and de-verbifying device RDMA headers via vendored ABI mirrors + parity guards.

Changes:

  • Removed Torch bootstrap init surface (C++ + pybind + CMake) and simplified RPATH handling accordingly.
  • Enforced host/device header separation (forward decls + device-safe type mirrors) to keep verbs.h / mlx5dv.h / hsakmt.h out of device TUs.
  • Vendored mlx5 WQE/CQE device ABI + added compile-time parity guards; made bnxt/ionic firmware ABIs self-contained.

Reviewed changes

Copilot reviewed 67 out of 69 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/shmem/init.cpp Removes Torch process-group init entrypoint.
src/pybind/pybind_shmem.cpp Drops pybind exposure for Torch bootstrap init.
src/pybind/CMakeLists.txt Simplifies pybind RPATH handling (no torch path).
src/io/rdma/common.cpp Adds explicit verbs.h include for host dereferences.
src/io/rdma/backend_impl.cpp Adds explicit verbs.h include for host dereferences.
src/application/transport/rdma/rdma.cpp Adds WcStatusibv_wc_status parity static_asserts.
src/application/transport/rdma/providers/mlx5/mlx5_abi_parity.cpp New TU that asserts vendored mlx5 ABI layout parity vs system headers.
src/application/transport/rdma/providers/ibverbs/ibverbs.cpp Adds explicit verbs.h include for host dereferences.
src/application/CMakeLists.txt Removes Torch bootstrap option/linking; adds mlx5 ABI parity TU.
src/application/bootstrap/torch_bootstrap.cpp Deletes Torch bootstrap implementation.
include/mori/shmem/shmem_sdma_kernels.hpp Removes unnecessary include to reduce device include surface.
include/mori/shmem/shmem_p2p_kernels.hpp Removes unnecessary include to reduce device include surface.
include/mori/shmem/shmem_ibgda_kernels.hpp Switches device CQE decoding to vendored mlx5 + WcStatusString.
include/mori/shmem/shmem_api.hpp Host/device include split to avoid pulling host RDMA stack into device compiles.
include/mori/shmem/internal.hpp Updates utils include path; adds missing cassert for device asserts.
include/mori/ops/dispatch_combine/dispatch_combine.hpp Uses device-safe application types under device compilation.
include/mori/core/utils/utils.hpp Moves/cleans device helpers; makes HIP runtime inclusion explicit for device builds.
include/mori/core/utils/udma_barrier.h Moves vendored udma barriers into core utils.
include/mori/core/transport/sdma/sdma_pkt_struct.h Adds SDMA packet struct definitions into core transport.
include/mori/core/transport/sdma/device_primitives.hpp Points to core SDMA device handle + core utils.
include/mori/core/transport/sdma/anvil_device.hpp Drops host-only hsakmt.h; uses hsakmttypes.h only.
include/mori/core/transport/rdma/utils.hpp New provider-agnostic RDMA helpers (device-safe WcStatusString).
include/mori/core/transport/rdma/rdma.hpp Light-weight aggregator; avoids dragging host verbs into device.
include/mori/core/transport/rdma/rdma_device.hpp Device-only RDMA aggregator header.
include/mori/core/transport/rdma/providers/utils.h Removes old mixed host/device provider utilities header.
include/mori/core/transport/rdma/providers/mlx5/mlx5_host_primitives.hpp Removes dead mlx5 host primitives header.
include/mori/core/transport/rdma/providers/mlx5/mlx5_device_primitives.hpp Switches device mlx5 path to vendored ABI + WcStatus.
include/mori/core/transport/rdma/providers/mlx5/mlx5_defs.hpp Vendored mlx5 WQE/CQE subset + constants (mori-prefixed).
include/mori/core/transport/rdma/providers/ionic/ionic_fw.h Makes ionic FW header self-contained; gates host-only verbs helpers.
include/mori/core/transport/rdma/providers/ionic/ionic_device_primitives.hpp Adds device-safe CQE status mapping to WcStatus.
include/mori/core/transport/rdma/providers/bnxt/bnxt_re_hsi.h Makes bnxt FW header self-contained via <linux/types.h>.
include/mori/core/transport/rdma/providers/bnxt/bnxt_device_primitives.hpp Adds fast 128-bit slot writes + WcStatus mapping.
include/mori/core/transport/rdma/ibverbs_handle.hpp Forward-declares ibv types to keep verbs.h out of device path.
include/mori/core/transport/rdma/host_primitives.hpp Removes dead host primitives header.
include/mori/core/transport/rdma/device_primitives.hpp Removes stale include comment; keeps device primitives focused.
include/mori/core/transport/rdma/core_device_types.hpp Adds device-safe WcStatus mirror + endian include.
include/mori/core/transport/p2p/device_primitives.hpp Updates utils include path; include adjustments.
include/mori/core/profiler/kernel_profiler.hpp Adds missing standard includes to be self-contained.
include/mori/core/core.hpp Updates utils include path; pulls SDMA/P2P only for device compiles.
include/mori/collective/core/allreduce_manager.hpp Adds missing <memory> include.
include/mori/collective/core/all2all_manager.hpp Adds missing <memory> include.
include/mori/collective/allreduce/twoshot_allreduce_sdma_class.hpp Adds missing <atomic> include.
include/mori/application/utils/check.hpp Adds missing libc headers for errno/stdio/strerror usage.
include/mori/application/transport/sdma/anvil.hpp Updates include path for core anvil device header; adds missing std headers.
include/mori/application/transport/rdma/rdma.hpp Adds missing standard includes; clarifies ATOMIC_IBUF_SLOT_SIZE location.
include/mori/application/transport/p2p/p2p.hpp Adds missing fixed-width type includes.
include/mori/application/topology/system.hpp Fixes/adjusts includes (string).
include/mori/application/topology/pci.hpp Adds missing <utility> include.
include/mori/application/topology/node.hpp Adds missing <cstdint> include.
include/mori/application/topology/gpu.hpp Adds missing <cstdint> include.
include/mori/application/memory/va_manager.hpp Removes unused <vector> include.
include/mori/application/memory/symmetric_memory.hpp Adjusts includes (adds <map>, drops linux/types).
include/mori/application/context/context.hpp Adds missing <memory> include.
include/mori/application/bootstrap/local_bootstrap.hpp Removes outdated mention of Torch bootstrap.
include/mori/application/bootstrap/bootstrap.hpp Removes Torch bootstrap include wiring.
include/mori/application/application_device_types.hpp Adds device-safe RDMA types exposure + forward decls; moves constants; type tweaks.
examples/utils/args_parser.hpp Switches to core device types include (atomicType).
examples/shmem/put_thread_allgather.cpp Adds explicit anvil include for CHECK_HIP_ERROR.
examples/sdma/sdma_rate_kernel.h Updates include path for core anvil device header.
examples/sdma/sdma_latency_kernel.h Updates include path for core anvil device header.
examples/sdma/sdma_bw_kernel.h Updates include path for core anvil device header.
examples/local_rdma_ops/write_inline_gpu.cpp Updates udma barrier include path.
examples/local_rdma_ops/write_gpu.cpp Updates udma barrier include path.
examples/local_rdma_ops/send_recv.cpp Updates udma barrier include path.
examples/local_rdma_ops/send_recv_gpu.cpp Updates udma barrier include path.
examples/local_rdma_ops/atomic_gpu.cpp Updates udma barrier include path.
examples/dist_rdma_ops/dist_write.cpp Updates udma barrier include path.
examples/benchmarks/atomic_perf.cpp Updates udma barrier include path.
examples/application/ibverbs_test.cpp Updates udma barrier include path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread include/mori/core/transport/p2p/device_primitives.hpp
Comment thread include/mori/core/transport/rdma/rdma.hpp Outdated
jhchouuu and others added 4 commits June 17, 2026 08:50
…28-bit load

The bnxt ABI segments are __attribute__((packed)) (alignof 1), so reading &seg
through a uint4* was a misaligned 128-bit load (UB). memcpy the bytes into a
16B-aligned temp (folds away under -O) then do one aligned 128-bit store. ISA
verified: one global_store_dwordx4 per slot, no loads.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the host-only *_dv.h shims (bnxt_re_dv/ionic_dv/mlx5_dv — they pull system
verbs/mlx5dv and are used only by application/) out of core/ into
application/transport/rdma/providers/. Drop the dead host helpers
to_ionic_mr_flags/ionic_to_ibv_status from ionic_fw.h (no callers) so the
firmware ABI header no longer carries <infiniband/verbs.h>; add <stdint.h> for
the uint*_t it uses directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Compress the mlx5_defs.hpp header block; unname the unused params of
get_num_wqes_in_atomic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… comment

p2p/device_primitives.hpp uses fabsf/fmaxf but only included <cassert> after the
IWYU pass — relying on transitive HIP math decls is fragile, so include <cmath>
directly. Also update the stale rdma.hpp header comment (the host primitives
layer it mentioned was removed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jhchouuu jhchouuu merged commit 241b56a into main Jun 17, 2026
13 checks passed
@jhchouuu jhchouuu deleted the jiahzhou/core-include-refactor branch June 17, 2026 09:28
jhchouuu added a commit that referenced this pull request Jun 17, 2026
Merge latest `main` into `dev/cco` and adapt cco's GDA layer to the changes it
brings (notably #395 bnxt collapsed-CQ + #405 include-hygiene refactor).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants