Skip to content

Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms#759

Open
Binyang2014 wants to merge 7 commits intomainfrom
binyli/switch-channel
Open

Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms#759
Binyang2014 wants to merge 7 commits intomainfrom
binyli/switch-channel

Conversation

@Binyang2014
Copy link
Contributor

Summary

This PR addresses a multicast resource leak, fixes cuMemMap offset handling for multicast handles, renames NVLS allreduce algorithm classes for clarity, and adds a new unit test for SwitchChannel.

Bug Fixes

1. Fix multicast allocation handle leak in createMulticast() (gpu_ipc_mem.cc)

GpuIpcMemHandle::createMulticast() called cuMulticastCreate(&allocHandle, ...) but never released the local allocHandle after exporting it to shareable handles (POSIX FD / Fabric). This caused a reference count leak — the multicast object was never freed even after all mappings and imported handles were released.

Per the CUDA Driver API docs for cuMemRelease:

"The memory allocation will be freed when all outstanding mappings to the memory are unmapped and when all outstanding references to the handle (including its shareable counterparts) are also released."

The fix adds cuMemRelease(allocHandle) after export, matching the existing pattern used for regular allocations in GpuIpcMemHandle::create().

Impact: Without this fix, repeated creation/destruction of NVLS connections causes OOM after ~120 iterations when allocating 1GB multicast buffers on H100.

2. Fix cuMemMap offset for multicast handles (gpu_ipc_mem.cc)

cuMemMap requires offset=0 for multicast handles. Previously, the code attempted to map at a non-zero offset within the multicast object, leading to errors when binding multiple buffers to the same NvlsConnection. The fix maps the entire range [0, mcOffset + bufferSize) and returns the pointer offset by mcOffset. This only consumes extra virtual address space; no additional physical memory is used.

Refactoring

3. Rename NVLS allreduce algorithm classes

Renamed for clarity:

  • AllreduceNvlsAllreduceNvlsZeroCopy
  • AllreduceNvlsWithCopyAllreduceNvlsWarpPipeline
  • AllreduceNvlsWithCopy2AllreduceNvlsBlockPipeline

Updated all references in builder, selector, docs, and examples.

4. Move nvlsConnections setup to initialize()

Moved nvlsConnections_ from AlgorithmCtx (which no longer has this member) to individual algorithm class members, initialized in their initialize() methods.

Tests

5. Add TwoChannelsSameConnection test

New unit test that creates two SwitchChannel instances from the same NvlsConnection, performs reduce operations on both, and verifies correctness. This exercises the multi-bind path that triggered the cuMemMap offset fix.

Files Changed

  • src/core/gpu_ipc_mem.cc — multicast handle leak fix + cuMemMap offset fix
  • src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu (renamed)
  • src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu (renamed)
  • src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu (renamed)
  • src/ext/collectives/allreduce/allreduce_nvls_packet.cu — nvlsConnections fix
  • src/ext/collectives/include/allreduce/*.hpp — renamed headers
  • src/ext/collectives/algorithm_collection_builder.cc — updated references
  • src/ext/nccl/algorithm_selector.cc — updated algorithm names
  • test/mp_unit/switch_channel_tests.cu — new test
  • docs/guide/mscclpp-torch-integration.md — updated names
  • examples/torch-integration/customized_comm_with_default_algo.py — updated names

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes NVLS multicast resource lifetime and mapping behavior in the CUDA driver-path, renames NVLS allreduce variants for clearer intent, and adds a unit test that exercises binding multiple buffers to a single NvlsConnection.

Changes:

  • Fix multicast allocation handle lifetime in GpuIpcMemHandle::createMulticast() and correct cuMemMap usage for multicast handles by mapping from offset 0 and returning an adjusted pointer.
  • Rename NVLS allreduce algorithms (zero-copy / warp-pipeline / block-pipeline) and update selector/builder/docs/examples accordingly; move NVLS connection ownership into algorithm instances.
  • Add TwoChannelsSameConnection unit test to cover the multi-bind NVLS path.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/core/gpu_ipc_mem.cc Releases multicast allocation handle after export; fixes multicast mapping to comply with cuMemMap offset constraints.
src/ext/nccl/algorithm_selector.cc Updates selection keys to the renamed NVLS algorithm names.
src/ext/collectives/algorithm_collection_builder.cc Registers renamed NVLS algorithms and updates includes.
src/ext/collectives/include/collective_utils.hpp Removes nvlsConnections from AlgorithmCtx to match new ownership model.
src/ext/collectives/include/allreduce/allreduce_nvls_zero_copy.hpp Adjusts NVLS zero-copy builder state, including larger NVLS buffer sizing and per-algorithm NVLS connection members.
src/ext/collectives/include/allreduce/allreduce_nvls_warp_pipeline.hpp Renames the warp-pipeline NVLS allreduce builder and adds per-algorithm NVLS connection storage.
src/ext/collectives/include/allreduce/allreduce_nvls_block_pipeline.hpp Renames the block-pipeline NVLS allreduce builder and adds per-algorithm NVLS connection storage.
src/ext/collectives/include/allreduce/allreduce_nvls_packet.hpp Adds per-algorithm NVLS connection storage for the packet variant.
src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu Uses algorithm-owned NVLS connections; updates registered algorithm name.
src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu Renames implementation symbols/types and uses algorithm-owned NVLS connections.
src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu Renames implementation symbols/types and uses algorithm-owned NVLS connections.
src/ext/collectives/allreduce/allreduce_nvls_packet.cu Initializes and uses algorithm-owned NVLS connections.
test/mp_unit/switch_channel_tests.cu Adds a multi-bind NVLS unit test using two SwitchChannels from the same connection.
docs/guide/mscclpp-torch-integration.md Updates algorithm name references to renamed NVLS variants.
examples/torch-integration/customized_comm_with_default_algo.py Updates algorithm name lookup to the renamed NVLS warp-pipeline variant.
Comments suppressed due to low confidence (3)

src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu:182

  • Inside initAllreduceContext, auto nvlsOutConnections = this->nvlsOutConnections_; is declared but never used. This will trigger an unused-variable warning (and can fail builds if warnings are treated as errors). Remove the local variable or use it in the subsequent setupNvlsChannels(...) call.
    src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu:195
  • The algorithm is registered under the new name default_allreduce_nvls_zero_copy, but the builder/implementation type is still AllreduceNvls. This makes the NVLS rename inconsistent and harder to reason about. Consider renaming the class/type to AllreduceNvlsZeroCopy (and updating all references) so the type name matches the file and algorithm name.
    src/ext/collectives/include/allreduce/allreduce_nvls_zero_copy.hpp:32
  • nvlsBufferSize_ was increased to 1UL << 34 (16 GiB) and is used as the multicast handle size when creating NVLS connections. This is a very large default and could increase driver/kernel resource usage or fail on systems with tighter limits, even if the bound buffers are much smaller. Consider making this size derived from the actual maximum buffer size(s) you expect to bind (or configurable via an env/config), and ensure failures surface with an actionable error message.

@Binyang2014 Binyang2014 marked this pull request as ready for review March 1, 2026 23:06
@Binyang2014
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants