Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms by Binyang2014 · Pull Request #759 · microsoft/mscclpp

Binyang2014 · 2026-02-28T04:08:46Z

Summary

This PR addresses a multicast resource leak, fixes cuMemMap offset handling for multicast handles, renames NVLS allreduce algorithm classes for clarity, and adds a new unit test for SwitchChannel.

Bug Fixes

1. Fix multicast allocation handle leak in `createMulticast()` (`gpu_ipc_mem.cc`)

GpuIpcMemHandle::createMulticast() called cuMulticastCreate(&allocHandle, ...) but never released the local allocHandle after exporting it to shareable handles (POSIX FD / Fabric). This caused a reference count leak — the multicast object was never freed even after all mappings and imported handles were released.

Per the CUDA Driver API docs for cuMemRelease:

"The memory allocation will be freed when all outstanding mappings to the memory are unmapped and when all outstanding references to the handle (including its shareable counterparts) are also released."

The fix adds cuMemRelease(allocHandle) after export, matching the existing pattern used for regular allocations in GpuIpcMemHandle::create().

Impact: Without this fix, repeated creation/destruction of NVLS connections causes OOM after ~120 iterations when allocating 1GB multicast buffers on H100.

2. Fix `cuMemMap` offset for multicast handles (`gpu_ipc_mem.cc`)

cuMemMap requires offset=0 for multicast handles. Previously, the code attempted to map at a non-zero offset within the multicast object, leading to errors when binding multiple buffers to the same NvlsConnection. The fix maps the entire range [0, mcOffset + bufferSize) and returns the pointer offset by mcOffset. This only consumes extra virtual address space; no additional physical memory is used.

Refactoring

3. Rename NVLS allreduce algorithm classes

Renamed for clarity:

AllreduceNvls → AllreduceNvlsZeroCopy
AllreduceNvlsWithCopy → AllreduceNvlsWarpPipeline
AllreduceNvlsWithCopy2 → AllreduceNvlsBlockPipeline

Updated all references in builder, selector, docs, and examples.

4. Move `nvlsConnections` setup to `initialize()`

Moved nvlsConnections_ from AlgorithmCtx (which no longer has this member) to individual algorithm class members, initialized in their initialize() methods.

Tests

5. Add `TwoChannelsSameConnection` test

New unit test that creates two SwitchChannel instances from the same NvlsConnection, performs reduce operations on both, and verifies correctness. This exercises the multi-bind path that triggered the cuMemMap offset fix.

Files Changed

src/core/gpu_ipc_mem.cc — multicast handle leak fix + cuMemMap offset fix
src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu (renamed)
src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu (renamed)
src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu (renamed)
src/ext/collectives/allreduce/allreduce_nvls_packet.cu — nvlsConnections fix
src/ext/collectives/include/allreduce/*.hpp — renamed headers
src/ext/collectives/algorithm_collection_builder.cc — updated references
src/ext/nccl/algorithm_selector.cc — updated algorithm names
test/mp_unit/switch_channel_tests.cu — new test
docs/guide/mscclpp-torch-integration.md — updated names
examples/torch-integration/customized_comm_with_default_algo.py — updated names

Copilot

Pull request overview

This PR fixes NVLS multicast resource lifetime and mapping behavior in the CUDA driver-path, renames NVLS allreduce variants for clearer intent, and adds a unit test that exercises binding multiple buffers to a single NvlsConnection.

Changes:

Fix multicast allocation handle lifetime in GpuIpcMemHandle::createMulticast() and correct cuMemMap usage for multicast handles by mapping from offset 0 and returning an adjusted pointer.
Rename NVLS allreduce algorithms (zero-copy / warp-pipeline / block-pipeline) and update selector/builder/docs/examples accordingly; move NVLS connection ownership into algorithm instances.
Add TwoChannelsSameConnection unit test to cover the multi-bind NVLS path.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`src/core/gpu_ipc_mem.cc`	Releases multicast allocation handle after export; fixes multicast mapping to comply with `cuMemMap` offset constraints.
`src/ext/nccl/algorithm_selector.cc`	Updates selection keys to the renamed NVLS algorithm names.
`src/ext/collectives/algorithm_collection_builder.cc`	Registers renamed NVLS algorithms and updates includes.
`src/ext/collectives/include/collective_utils.hpp`	Removes `nvlsConnections` from `AlgorithmCtx` to match new ownership model.
`src/ext/collectives/include/allreduce/allreduce_nvls_zero_copy.hpp`	Adjusts NVLS zero-copy builder state, including larger NVLS buffer sizing and per-algorithm NVLS connection members.
`src/ext/collectives/include/allreduce/allreduce_nvls_warp_pipeline.hpp`	Renames the warp-pipeline NVLS allreduce builder and adds per-algorithm NVLS connection storage.
`src/ext/collectives/include/allreduce/allreduce_nvls_block_pipeline.hpp`	Renames the block-pipeline NVLS allreduce builder and adds per-algorithm NVLS connection storage.
`src/ext/collectives/include/allreduce/allreduce_nvls_packet.hpp`	Adds per-algorithm NVLS connection storage for the packet variant.
`src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu`	Uses algorithm-owned NVLS connections; updates registered algorithm name.
`src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu`	Renames implementation symbols/types and uses algorithm-owned NVLS connections.
`src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu`	Renames implementation symbols/types and uses algorithm-owned NVLS connections.
`src/ext/collectives/allreduce/allreduce_nvls_packet.cu`	Initializes and uses algorithm-owned NVLS connections.
`test/mp_unit/switch_channel_tests.cu`	Adds a multi-bind NVLS unit test using two `SwitchChannel`s from the same connection.
`docs/guide/mscclpp-torch-integration.md`	Updates algorithm name references to renamed NVLS variants.
`examples/torch-integration/customized_comm_with_default_algo.py`	Updates algorithm name lookup to the renamed NVLS warp-pipeline variant.

Comments suppressed due to low confidence (3)

src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu:182

Inside initAllreduceContext, auto nvlsOutConnections = this->nvlsOutConnections_; is declared but never used. This will trigger an unused-variable warning (and can fail builds if warnings are treated as errors). Remove the local variable or use it in the subsequent setupNvlsChannels(...) call.
src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu:195
The algorithm is registered under the new name default_allreduce_nvls_zero_copy, but the builder/implementation type is still AllreduceNvls. This makes the NVLS rename inconsistent and harder to reason about. Consider renaming the class/type to AllreduceNvlsZeroCopy (and updating all references) so the type name matches the file and algorithm name.
src/ext/collectives/include/allreduce/allreduce_nvls_zero_copy.hpp:32
nvlsBufferSize_ was increased to 1UL << 34 (16 GiB) and is used as the multicast handle size when creating NVLS connections. This is a very large default and could increase driver/kernel resource usage or fail on systems with tighter limits, even if the bound buffers are much smaller. Consider making this size derived from the actual maximum buffer size(s) you expect to bind (or configurable via an env/config), and ensure failures surface with an actionable error message.

Binyang2014 · 2026-03-01T23:07:08Z

/azp run

azure-pipelines · 2026-03-01T23:07:30Z

Azure Pipelines successfully started running 5 pipeline(s).

Binyang2014 added 5 commits February 28, 2026 04:01

update switch channel

0ed0c8c

Merge branch 'main' into binyli/switch-channel

440b304

lint

aff14df

lint

21db908

update

b28c43b

Binyang2014 requested a review from Copilot February 28, 2026 04:26

Copilot started reviewing on behalf of Binyang2014 February 28, 2026 04:26 View session

Copilot AI reviewed Feb 28, 2026

View reviewed changes

update

1e313a4

Binyang2014 marked this pull request as ready for review March 1, 2026 23:06

update to use nccl release

0d57b09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms#759

Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms#759
Binyang2014 wants to merge 7 commits intomainfrom
binyli/switch-channel

Binyang2014 commented Feb 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Binyang2014 commented Mar 1, 2026

Uh oh!

azure-pipelines bot commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Binyang2014 commented Feb 28, 2026

Summary

Bug Fixes

1. Fix multicast allocation handle leak in createMulticast() (gpu_ipc_mem.cc)

2. Fix cuMemMap offset for multicast handles (gpu_ipc_mem.cc)

Refactoring

3. Rename NVLS allreduce algorithm classes

4. Move nvlsConnections setup to initialize()

Tests

5. Add TwoChannelsSameConnection test

Files Changed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Binyang2014 commented Mar 1, 2026

Uh oh!

azure-pipelines bot commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Fix multicast allocation handle leak in `createMulticast()` (`gpu_ipc_mem.cc`)

2. Fix `cuMemMap` offset for multicast handles (`gpu_ipc_mem.cc`)

4. Move `nvlsConnections` setup to `initialize()`

5. Add `TwoChannelsSameConnection` test