refactor(core,application): include hygiene + host/device separation#405
Merged
Conversation
Clean up the core/ and application/ include graph; no logic changes. - Layering: core/ no longer includes application/. Relocate the low-level SDMA device handle + DMA-barrier headers down into core (anvil_device.hpp, sdma_pkt_struct.h, udma_barrier.h). - Dead code: remove the unused C++ Torch bootstrap path (also drops the libtorch link from libmori_application) and the dead host-side RDMA primitives (host_primitives.hpp / mlx5_host_primitives.hpp). - Host/device separation: keep host-only deps out of device TUs — drop <hsakmt/hsakmt.h> (host KFD driver API) and the host ibverbs-DV bnxt_re_dv.h out of the device headers; split providers/utils.h so each provider's CQE decoder lives in its own provider header; drop the hsakmt/anvil drag from application_device_types.hpp; make primitives.hpp a pure ibverbs-handle header. - IWYU: add direct includes for transitively-relied-on symbols, remove unused includes across core + application, add a device-only rdma_device.hpp aggregator. Validated on BNXT: full build (all modules) clean; shmem P2P/IBGDA, ccl SDMA allgather/all2all, ops intranode all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erbs.h> out of device TUs The provider CQE error decoders (Mlx5/Bnxt/IonicHandleErrorCqe) returned ibv_wc_status and providers/utils.h's status stringifier took it too, so every provider device header dragged the host <infiniband/verbs.h> into device translation units (via utils.h / a direct include). Add a device-safe mori::core::WcStatus enum (a value-for-value mirror of ibv_wc_status) in core_device_types.hpp and switch the decoders + WcStatusString to it. Value parity with ibv_wc_status is guarded by static_asserts in src/application/transport/rdma/rdma.cpp (a host TU that sees both). shmem's only caller uses `auto` + stringify, so it's unaffected (renamed IbvWcStatusString -> WcStatusString). Result (verified via hipcc -M): the bnxt and ionic device headers no longer pull infiniband/verbs.h at all; mlx5 still does, but only because mlx5dv.h (its WQE/CQE ABI) requires it. On BNXT the device RDMA code is now verbs.h-free. Full build clean; IBGDA get over BNXT passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Plan to make the mlx5 device code <infiniband/verbs.h>-free by vendoring the mlx5 WQE/CQE hardware-ABI structs + constants (like bnxt/ionic), with the exact inventory, the mori-prefixed naming to avoid clashing with system mlx5dv, the sizeof/offsetof parity guard, and the on-hardware verification steps. Temporary — remove once the refactor lands. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nd/mlx5dv.h> The mlx5 device header pulled in <infiniband/mlx5dv.h>, dragging the system <infiniband/verbs.h> into every mlx5 device translation unit. Vendor the small, stable subset of mlx5 WQE/CQE hardware-ABI structs + constants the device code uses (mori::core::Mlx5*, MORI_MLX5_*) so the mlx5 device path becomes verbs-free, matching how bnxt/ionic already get their ABI from repo-vendored firmware headers. mori-prefixed names avoid the ambiguity a same-name copy hit: host/device/example TUs mix the system ::mlx5_* with the vendored types. The one shmem device call site (Mlx5CollapsedCqDrain) is moved onto the vendored names too. A dedicated host TU (mlx5_abi_parity.cpp) includes both the system header and the vendored one and static_asserts size/offset parity, turning any layout drift into a compile error — these structs overlay NIC hardware memory. Verified on MLX5 hardware (gfx942): parity asserts pass; the mlx5 device header is verbs-free (hipcc -M = 0). IBGDA/RDMA data path green — async_ll IBGDA (68 passed), shmem API (6 passed), internode dispatch/combine (180 GB/s), allgather (8/8). Pre-existing concurrent_get thread-scope / concurrent_put loopback failures were confirmed identical on a stash-baseline build, so they are unrelated to this change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…/verbs.h>
The GPU device translation units (ops/shmem/collective .hip kernels) transitively
pulled the system <infiniband/verbs.h> and <infiniband/mlx5dv.h> through two chains:
A) core/transport/rdma/primitives.hpp included <infiniband/verbs.h> for the
host-only IBVerbsHandle (ibv_qp*/ibv_cq*/... pointers).
B) the shmem/ops device headers pulled application.hpp (the host RDMA stack ->
mlx5.hpp -> mlx5_dv.h -> system mlx5dv.h).
Chain A: forward-declare the ibv_* structs in primitives.hpp instead of including
verbs.h (the handle only holds pointers); host TUs that dereference them include
verbs.h directly. The vendored firmware headers (bnxt_re_hsi.h, ionic_fw.h) used
__u*/__le*/IBV_* that used to arrive via that verbs.h, so make them self-contained
(<linux/types.h>) and move ionic's two host-only ibv<->ionic mapping helpers (dead
code referencing IBV_ACCESS_*/IBV_WC_*) behind a host guard.
Chain B: shmem_api.hpp / dispatch_combine.hpp include the full application.hpp only
on the host; device and mixed-hipcc compiles get the device-safe
application_device_types.hpp (plus a forward decl of host-only BootstrapNetwork).
The shmem device kernel headers drop their host-only shmem_api.hpp include. Move
ATOMIC_IBUF_SLOT_SIZE to application_device_types.hpp so device kernels can use it
without the host stack. A few host-logic TUs that had been getting <memory> /
CHECK_HIP_ERROR transitively via application.hpp now include them directly.
Result (hipcc -M, gfx942): every device .hip -- shmem_kernels, ep_*, cast,
ccl_kernels -- and core.hpp / rdma_device.hpp pull zero infiniband/{verbs,mlx5dv}.h.
Full build green (host+device+collective+ops+io+examples). Verified on MLX5
hardware: async_ll IBGDA (68 passed), shmem API (6 passed), internode
dispatch/combine (185 GB/s).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two cosmetic cleanups, no behavior change: - providers/utils.h holds only provider-agnostic helpers (WcStatusString, get_num_wqes_in_atomic), so move it up out of providers/ and give it the C++ extension: providers/utils.h -> transport/rdma/utils.hpp. - primitives.hpp defines just the host IBVerbsHandle wrapper -> rename to ibverbs_handle.hpp so the name matches its content. Includers updated. Full build (BNXT) clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bnxt_device_primitives.hpp used p2p's ThreadCopy (a trivial vectorized memcpy) to lay out WQE slots, which dragged the whole p2p/device_primitives.hpp — and its <hip/hip_fp8.h>/<hip/hip_bfloat16.h>/data_types — into the bnxt device path for one function. Inline a local BnxtThreadCopy (distinct name to avoid clashing with p2p's ThreadCopy in TUs that include both) and drop the p2p include. bnxt device header no longer pulls p2p/device_primitives/hip_fp8/hip_bfloat16/ data_types (hipcc -M = 0). Behavior identical (verbatim copy logic). Full build (BNXT) clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the generic variable-length BnxtThreadCopy loop with BnxtWriteSlot/ BnxtZeroSlot helpers that emit one aligned uint4 store per 16-byte WQE segment (static_asserted to one slot), guaranteeing branch-free dwordx4 codegen. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Matches the directory-aggregator convention (core/core.hpp, application/application.hpp); resolves utils.hpp coexisting beside the core/utils/ folder. Updates all 7 includers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
utils.hpp carried two parallel APIs: the in-use CamelCase set (GetActiveLaneNum/IsFirstActiveLane/...) and a near-dead snake_case mirror (get_active_lane_num/is_first_active_lane/get_flat_*/lowerID/...). Only spin_lock_*_shared used the snake helpers; rewrite those to the CamelCase equivalents and remove the duplicated set (-115 lines). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The handoff is done (vendored in ccaeba3); drop the temporary note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d-code tidy Three provider PollCq call sites still referenced the old IbvWcStatusString name (hidden inside MORI_PRINTF, which is a no-op by default — so the gap only surfaced under -DMORI_ENABLE_DEBUG_PRINTF). Also collapse the now-dead BUILD_TORCH_BOOTSTRAP rpath branch and drop a stale TorchBootstrapNetwork comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… build The shmem JIT/bitcode path compiles internal.hpp as a standalone device TU. Previously core/utils/utils.hpp (blockDim/threadIdx/__ballot/...) and internal.hpp (assert in device code) leeched <hip/hip_runtime.h>/<cassert> transitively via application_device_types.hpp; the IWYU cleanup removed that chain, breaking the JIT compile. Include them directly where used. Reproduced + verified against the exact JIT clang invocation (shim.bc builds). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Refactors MORI core/application headers for stricter include hygiene and host/device separation, while removing the (now-dead) PyTorch bootstrap path and de-verbifying device RDMA headers via vendored ABI mirrors + parity guards.
Changes:
- Removed Torch bootstrap init surface (C++ + pybind + CMake) and simplified RPATH handling accordingly.
- Enforced host/device header separation (forward decls + device-safe type mirrors) to keep
verbs.h/mlx5dv.h/hsakmt.hout of device TUs. - Vendored mlx5 WQE/CQE device ABI + added compile-time parity guards; made bnxt/ionic firmware ABIs self-contained.
Reviewed changes
Copilot reviewed 67 out of 69 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/shmem/init.cpp | Removes Torch process-group init entrypoint. |
| src/pybind/pybind_shmem.cpp | Drops pybind exposure for Torch bootstrap init. |
| src/pybind/CMakeLists.txt | Simplifies pybind RPATH handling (no torch path). |
| src/io/rdma/common.cpp | Adds explicit verbs.h include for host dereferences. |
| src/io/rdma/backend_impl.cpp | Adds explicit verbs.h include for host dereferences. |
| src/application/transport/rdma/rdma.cpp | Adds WcStatus ↔ ibv_wc_status parity static_asserts. |
| src/application/transport/rdma/providers/mlx5/mlx5_abi_parity.cpp | New TU that asserts vendored mlx5 ABI layout parity vs system headers. |
| src/application/transport/rdma/providers/ibverbs/ibverbs.cpp | Adds explicit verbs.h include for host dereferences. |
| src/application/CMakeLists.txt | Removes Torch bootstrap option/linking; adds mlx5 ABI parity TU. |
| src/application/bootstrap/torch_bootstrap.cpp | Deletes Torch bootstrap implementation. |
| include/mori/shmem/shmem_sdma_kernels.hpp | Removes unnecessary include to reduce device include surface. |
| include/mori/shmem/shmem_p2p_kernels.hpp | Removes unnecessary include to reduce device include surface. |
| include/mori/shmem/shmem_ibgda_kernels.hpp | Switches device CQE decoding to vendored mlx5 + WcStatusString. |
| include/mori/shmem/shmem_api.hpp | Host/device include split to avoid pulling host RDMA stack into device compiles. |
| include/mori/shmem/internal.hpp | Updates utils include path; adds missing cassert for device asserts. |
| include/mori/ops/dispatch_combine/dispatch_combine.hpp | Uses device-safe application types under device compilation. |
| include/mori/core/utils/utils.hpp | Moves/cleans device helpers; makes HIP runtime inclusion explicit for device builds. |
| include/mori/core/utils/udma_barrier.h | Moves vendored udma barriers into core utils. |
| include/mori/core/transport/sdma/sdma_pkt_struct.h | Adds SDMA packet struct definitions into core transport. |
| include/mori/core/transport/sdma/device_primitives.hpp | Points to core SDMA device handle + core utils. |
| include/mori/core/transport/sdma/anvil_device.hpp | Drops host-only hsakmt.h; uses hsakmttypes.h only. |
| include/mori/core/transport/rdma/utils.hpp | New provider-agnostic RDMA helpers (device-safe WcStatusString). |
| include/mori/core/transport/rdma/rdma.hpp | Light-weight aggregator; avoids dragging host verbs into device. |
| include/mori/core/transport/rdma/rdma_device.hpp | Device-only RDMA aggregator header. |
| include/mori/core/transport/rdma/providers/utils.h | Removes old mixed host/device provider utilities header. |
| include/mori/core/transport/rdma/providers/mlx5/mlx5_host_primitives.hpp | Removes dead mlx5 host primitives header. |
| include/mori/core/transport/rdma/providers/mlx5/mlx5_device_primitives.hpp | Switches device mlx5 path to vendored ABI + WcStatus. |
| include/mori/core/transport/rdma/providers/mlx5/mlx5_defs.hpp | Vendored mlx5 WQE/CQE subset + constants (mori-prefixed). |
| include/mori/core/transport/rdma/providers/ionic/ionic_fw.h | Makes ionic FW header self-contained; gates host-only verbs helpers. |
| include/mori/core/transport/rdma/providers/ionic/ionic_device_primitives.hpp | Adds device-safe CQE status mapping to WcStatus. |
| include/mori/core/transport/rdma/providers/bnxt/bnxt_re_hsi.h | Makes bnxt FW header self-contained via <linux/types.h>. |
| include/mori/core/transport/rdma/providers/bnxt/bnxt_device_primitives.hpp | Adds fast 128-bit slot writes + WcStatus mapping. |
| include/mori/core/transport/rdma/ibverbs_handle.hpp | Forward-declares ibv types to keep verbs.h out of device path. |
| include/mori/core/transport/rdma/host_primitives.hpp | Removes dead host primitives header. |
| include/mori/core/transport/rdma/device_primitives.hpp | Removes stale include comment; keeps device primitives focused. |
| include/mori/core/transport/rdma/core_device_types.hpp | Adds device-safe WcStatus mirror + endian include. |
| include/mori/core/transport/p2p/device_primitives.hpp | Updates utils include path; include adjustments. |
| include/mori/core/profiler/kernel_profiler.hpp | Adds missing standard includes to be self-contained. |
| include/mori/core/core.hpp | Updates utils include path; pulls SDMA/P2P only for device compiles. |
| include/mori/collective/core/allreduce_manager.hpp | Adds missing <memory> include. |
| include/mori/collective/core/all2all_manager.hpp | Adds missing <memory> include. |
| include/mori/collective/allreduce/twoshot_allreduce_sdma_class.hpp | Adds missing <atomic> include. |
| include/mori/application/utils/check.hpp | Adds missing libc headers for errno/stdio/strerror usage. |
| include/mori/application/transport/sdma/anvil.hpp | Updates include path for core anvil device header; adds missing std headers. |
| include/mori/application/transport/rdma/rdma.hpp | Adds missing standard includes; clarifies ATOMIC_IBUF_SLOT_SIZE location. |
| include/mori/application/transport/p2p/p2p.hpp | Adds missing fixed-width type includes. |
| include/mori/application/topology/system.hpp | Fixes/adjusts includes (string). |
| include/mori/application/topology/pci.hpp | Adds missing <utility> include. |
| include/mori/application/topology/node.hpp | Adds missing <cstdint> include. |
| include/mori/application/topology/gpu.hpp | Adds missing <cstdint> include. |
| include/mori/application/memory/va_manager.hpp | Removes unused <vector> include. |
| include/mori/application/memory/symmetric_memory.hpp | Adjusts includes (adds <map>, drops linux/types). |
| include/mori/application/context/context.hpp | Adds missing <memory> include. |
| include/mori/application/bootstrap/local_bootstrap.hpp | Removes outdated mention of Torch bootstrap. |
| include/mori/application/bootstrap/bootstrap.hpp | Removes Torch bootstrap include wiring. |
| include/mori/application/application_device_types.hpp | Adds device-safe RDMA types exposure + forward decls; moves constants; type tweaks. |
| examples/utils/args_parser.hpp | Switches to core device types include (atomicType). |
| examples/shmem/put_thread_allgather.cpp | Adds explicit anvil include for CHECK_HIP_ERROR. |
| examples/sdma/sdma_rate_kernel.h | Updates include path for core anvil device header. |
| examples/sdma/sdma_latency_kernel.h | Updates include path for core anvil device header. |
| examples/sdma/sdma_bw_kernel.h | Updates include path for core anvil device header. |
| examples/local_rdma_ops/write_inline_gpu.cpp | Updates udma barrier include path. |
| examples/local_rdma_ops/write_gpu.cpp | Updates udma barrier include path. |
| examples/local_rdma_ops/send_recv.cpp | Updates udma barrier include path. |
| examples/local_rdma_ops/send_recv_gpu.cpp | Updates udma barrier include path. |
| examples/local_rdma_ops/atomic_gpu.cpp | Updates udma barrier include path. |
| examples/dist_rdma_ops/dist_write.cpp | Updates udma barrier include path. |
| examples/benchmarks/atomic_perf.cpp | Updates udma barrier include path. |
| examples/application/ibverbs_test.cpp | Updates udma barrier include path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…28-bit load The bnxt ABI segments are __attribute__((packed)) (alignof 1), so reading &seg through a uint4* was a misaligned 128-bit load (UB). memcpy the bytes into a 16B-aligned temp (folds away under -O) then do one aligned 128-bit store. ISA verified: one global_store_dwordx4 per slot, no loads. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the host-only *_dv.h shims (bnxt_re_dv/ionic_dv/mlx5_dv — they pull system verbs/mlx5dv and are used only by application/) out of core/ into application/transport/rdma/providers/. Drop the dead host helpers to_ionic_mr_flags/ionic_to_ibv_status from ionic_fw.h (no callers) so the firmware ABI header no longer carries <infiniband/verbs.h>; add <stdint.h> for the uint*_t it uses directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Compress the mlx5_defs.hpp header block; unname the unused params of get_num_wqes_in_atomic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… comment p2p/device_primitives.hpp uses fabsf/fmaxf but only included <cassert> after the IWYU pass — relying on transitive HIP math decls is fragile, so include <cmath> directly. Also update the stale rdma.hpp header comment (the host primitives layer it mentioned was removed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jhchouuu
added a commit
that referenced
this pull request
Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Standalone include-hygiene refactor of
include/mori/core/andinclude/mori/application/— IWYU cleanup with strict host/device header separation and fixed layering inversions. Logic-preserving, with one bnxt codegen optimization (§5). This benefitsmaindirectly: device translation units parse less, the layering is enforceable, and the RDMA device path no longer drags system RDMA / host-driver headers.What changes
1. Host/device separation + IWYU
core/no longer includes anything fromapplication/(layering inversion removed); moved the shared bits (udma_barrier.h, sdmaanvil_device/pkt structs) down intocore/.<hsakmt/hsakmt.h>(host KFD driver API) that was riding into every device TU via the sdma path.<cassert>/<cstring>/<cstdint>/<endian.h>/ …), so headers are self-contained (e.g. the shmem JIT compilesinternal.hppas a standalone device TU).2. Device RDMA path freed of system RDMA / host headers
WcStatusmirror ofibv_wc_status→ device CQE decoders no longer pull<infiniband/verbs.h>. Value parity guarded bystatic_assertin a host TU.<infiniband/mlx5dv.h>from device code (mori-prefixed structs,sizeof/offsetofparity guard).<linux/types.h>+<stdint.h>.verbs.h/mlx5dv.h/hsakmt/application/.3. Dead-code removal
torch_bootstrap.{hpp,cpp}+ binding);mori_applicationno longer links libtorch.host_primitives.hpp,mlx5_host_primitives.hpp) and dead host helpers inionic_fw.h(so the firmware ABI header no longer carries<infiniband/verbs.h>).snake_casewarp/lane helper API incore/utilsthat mirrored the in-use CamelCase one (−115 lines).4. Structure / naming
core/utils.hpp→core/utils/utils.hpp(matches theX/X.hppaggregator convention).*_dv.hshims moved fromcore/toapplication/, socore/is strictly device-safe.Mlx5*/MORI_MLX5_*) while bnxt/ionic keep upstream names — intentional: the systemmlx5dv.hdefines the same::mlx5_*names and a single RDMA TU can see both, so the vendored copies must not collide; bnxt/ionic have no installed userspace ABI header, so their upstream names are unambiguous.5. perf (BNXT)
global_store_dwordx4, branch-free), replacing a generic variable-length copy loop.static_assertenforces one-segment-per-slot; the source is read viamemcpyto avoid a misaligned load on the packed ABI structs.Verification
concurrent_put/put_imm/put_signal/atomic_nonfetch/atomic_fetch(with the CI RoCE envMORI_RDMA_SL=3/MORI_RDMA_TC=104). ccl SDMA allgather/all2all 8/8.pip install .(full build incl. UMBP+gRPC) succeeds.