Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms#759
Open
Binyang2014 wants to merge 7 commits intomainfrom
Open
Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms#759Binyang2014 wants to merge 7 commits intomainfrom
Binyang2014 wants to merge 7 commits intomainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes NVLS multicast resource lifetime and mapping behavior in the CUDA driver-path, renames NVLS allreduce variants for clearer intent, and adds a unit test that exercises binding multiple buffers to a single NvlsConnection.
Changes:
- Fix multicast allocation handle lifetime in
GpuIpcMemHandle::createMulticast()and correctcuMemMapusage for multicast handles by mapping from offset 0 and returning an adjusted pointer. - Rename NVLS allreduce algorithms (zero-copy / warp-pipeline / block-pipeline) and update selector/builder/docs/examples accordingly; move NVLS connection ownership into algorithm instances.
- Add
TwoChannelsSameConnectionunit test to cover the multi-bind NVLS path.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/core/gpu_ipc_mem.cc |
Releases multicast allocation handle after export; fixes multicast mapping to comply with cuMemMap offset constraints. |
src/ext/nccl/algorithm_selector.cc |
Updates selection keys to the renamed NVLS algorithm names. |
src/ext/collectives/algorithm_collection_builder.cc |
Registers renamed NVLS algorithms and updates includes. |
src/ext/collectives/include/collective_utils.hpp |
Removes nvlsConnections from AlgorithmCtx to match new ownership model. |
src/ext/collectives/include/allreduce/allreduce_nvls_zero_copy.hpp |
Adjusts NVLS zero-copy builder state, including larger NVLS buffer sizing and per-algorithm NVLS connection members. |
src/ext/collectives/include/allreduce/allreduce_nvls_warp_pipeline.hpp |
Renames the warp-pipeline NVLS allreduce builder and adds per-algorithm NVLS connection storage. |
src/ext/collectives/include/allreduce/allreduce_nvls_block_pipeline.hpp |
Renames the block-pipeline NVLS allreduce builder and adds per-algorithm NVLS connection storage. |
src/ext/collectives/include/allreduce/allreduce_nvls_packet.hpp |
Adds per-algorithm NVLS connection storage for the packet variant. |
src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu |
Uses algorithm-owned NVLS connections; updates registered algorithm name. |
src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu |
Renames implementation symbols/types and uses algorithm-owned NVLS connections. |
src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu |
Renames implementation symbols/types and uses algorithm-owned NVLS connections. |
src/ext/collectives/allreduce/allreduce_nvls_packet.cu |
Initializes and uses algorithm-owned NVLS connections. |
test/mp_unit/switch_channel_tests.cu |
Adds a multi-bind NVLS unit test using two SwitchChannels from the same connection. |
docs/guide/mscclpp-torch-integration.md |
Updates algorithm name references to renamed NVLS variants. |
examples/torch-integration/customized_comm_with_default_algo.py |
Updates algorithm name lookup to the renamed NVLS warp-pipeline variant. |
Comments suppressed due to low confidence (3)
src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu:182
- Inside
initAllreduceContext,auto nvlsOutConnections = this->nvlsOutConnections_;is declared but never used. This will trigger an unused-variable warning (and can fail builds if warnings are treated as errors). Remove the local variable or use it in the subsequentsetupNvlsChannels(...)call.
src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu:195 - The algorithm is registered under the new name
default_allreduce_nvls_zero_copy, but the builder/implementation type is stillAllreduceNvls. This makes the NVLS rename inconsistent and harder to reason about. Consider renaming the class/type toAllreduceNvlsZeroCopy(and updating all references) so the type name matches the file and algorithm name.
src/ext/collectives/include/allreduce/allreduce_nvls_zero_copy.hpp:32 nvlsBufferSize_was increased to1UL << 34(16 GiB) and is used as the multicast handle size when creating NVLS connections. This is a very large default and could increase driver/kernel resource usage or fail on systems with tighter limits, even if the bound buffers are much smaller. Consider making this size derived from the actual maximum buffer size(s) you expect to bind (or configurable via an env/config), and ensure failures surface with an actionable error message.
Contributor
Author
|
/azp run |
|
Azure Pipelines successfully started running 5 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses a multicast resource leak, fixes
cuMemMapoffset handling for multicast handles, renames NVLS allreduce algorithm classes for clarity, and adds a new unit test forSwitchChannel.Bug Fixes
1. Fix multicast allocation handle leak in
createMulticast()(gpu_ipc_mem.cc)GpuIpcMemHandle::createMulticast()calledcuMulticastCreate(&allocHandle, ...)but never released the localallocHandleafter exporting it to shareable handles (POSIX FD / Fabric). This caused a reference count leak — the multicast object was never freed even after all mappings and imported handles were released.Per the CUDA Driver API docs for
cuMemRelease:The fix adds
cuMemRelease(allocHandle)after export, matching the existing pattern used for regular allocations inGpuIpcMemHandle::create().Impact: Without this fix, repeated creation/destruction of NVLS connections causes OOM after ~120 iterations when allocating 1GB multicast buffers on H100.
2. Fix
cuMemMapoffset for multicast handles (gpu_ipc_mem.cc)cuMemMaprequiresoffset=0for multicast handles. Previously, the code attempted to map at a non-zero offset within the multicast object, leading to errors when binding multiple buffers to the sameNvlsConnection. The fix maps the entire range[0, mcOffset + bufferSize)and returns the pointer offset bymcOffset. This only consumes extra virtual address space; no additional physical memory is used.Refactoring
3. Rename NVLS allreduce algorithm classes
Renamed for clarity:
AllreduceNvls→AllreduceNvlsZeroCopyAllreduceNvlsWithCopy→AllreduceNvlsWarpPipelineAllreduceNvlsWithCopy2→AllreduceNvlsBlockPipelineUpdated all references in builder, selector, docs, and examples.
4. Move
nvlsConnectionssetup toinitialize()Moved
nvlsConnections_fromAlgorithmCtx(which no longer has this member) to individual algorithm class members, initialized in theirinitialize()methods.Tests
5. Add
TwoChannelsSameConnectiontestNew unit test that creates two
SwitchChannelinstances from the sameNvlsConnection, performs reduce operations on both, and verifies correctness. This exercises the multi-bind path that triggered thecuMemMapoffset fix.Files Changed
src/core/gpu_ipc_mem.cc— multicast handle leak fix + cuMemMap offset fixsrc/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu(renamed)src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu(renamed)src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu(renamed)src/ext/collectives/allreduce/allreduce_nvls_packet.cu— nvlsConnections fixsrc/ext/collectives/include/allreduce/*.hpp— renamed headerssrc/ext/collectives/algorithm_collection_builder.cc— updated referencessrc/ext/nccl/algorithm_selector.cc— updated algorithm namestest/mp_unit/switch_channel_tests.cu— new testdocs/guide/mscclpp-torch-integration.md— updated namesexamples/torch-integration/customized_comm_with_default_algo.py— updated names