Change snmg index to use updated multi gpu resource API by jinsolp · Pull Request #869 · rapidsai/cuvs

jinsolp · 2025-05-02T23:50:51Z

Description

These changes are dependent on this breaking PR in raft.
This PR itself doesn't introduce any new features, but makes changes to the existing code to work with the breaking changes in the PR in raft.

Signed-off-by: jinsolp <[email protected]>

…/jinsolp/cuvs into mg-change-for-multi-gpu-resource

jinsolp · 2025-05-02T23:59:15Z

+              GIT_REPOSITORY        https://github.com/jinsolp/raft.git
+              GIT_TAG               multi-gpu-resource


This needs to be reverted

Signed-off-by: jinsolp <[email protected]>

viclafargue

Thanks @jinsolp, LGTM apart from the benchmark files. Could you switch the raft::device_resources_snmg to raft::device_resources_snmg_nccl?

jinsolp · 2025-05-05T20:35:19Z

@viclafargue Thanks for pointing out Victor, I also had to change function calls in snmg.cuh so that it works with the change function names in raft!

…/jinsolp/cuvs into mg-change-for-multi-gpu-resource

cjnolet · 2025-05-08T21:19:54Z

@jinsolp don't forget to revert the change in get_raft.cmake before this is merged.

…/jinsolp/cuvs into mg-change-for-multi-gpu-resource

achirkin · 2025-05-13T12:04:29Z

Hi Jinsol, did you have a chance to run the ann_bench with the new changes?
I've just tried this branch on a system with two GPUs, but limited to only one via the env variable CUDA_VISIBLE_DEVICES=0

(cuml_dev) CUDA_VISIBLE_DEVICES=0 ./cpp/build/bench/ann/ANN_BENCH \
  --search \
  --mode=latency \
  --benchmark_filter=cuvs_mg_cagra.dim32_vpq \
  --benchmark_min_time=3s \
  --benchmark_min_warmup_time=0.01 \
  --benchmark_counters_tabular \
  --override_kv=dataset_memory_type:\"host\" \
  --override_kv=query_memory_type:\"pinned\" \
  --override_kv=graph_memory_type:\"host_pinned\" \
  --override_kv=persistent:false \
  --data_prefix=/tmp/data \
  --index_prefix=/ann/index \
  /ann/conf/deep-100M-cagra-persistent.json
[I] [12:00:04.876774] Using the query file '/tmp/data/deep-1B/query.public.10K.fbin'
[I] [12:00:04.876805] Using the ground truth file '/tmp/data/deep-100M/groundtruth.neighbors.ibin'
2025-05-13T12:00:04+00:00
Running ./cpp/build/bench/ann/ANN_BENCH
Run on (32 X 3729.49 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x4)
Load Average: 0.21, 4.55, 12.52
command_line: ./cpp/build/bench/ann/ANN_BENCH --search --mode=latency --benchmark_filter=cuvs_mg_cagra.dim32_vpq --benchmark_min_time=3s --benchmark_min_warmup_time=0.01 --benchmark_counters_tabular --override_kv=dataset_memory_type:"host" --override_kv=query_memory_type:"pinned" --override_kv=graph_memory_type:"host_pinned" --override_kv=persistent:false --data_prefix=/tmp/data --index_prefix=/ann/index /ann/conf/deep-100M-cagra-persistent.json
dataset: deep-100M
dim: 96
distance: euclidean
gpu_driver_version: 13.0
gpu_gpuDirectRDMASupported: 1758070544
gpu_hostNativeAtomicSupported: 0
gpu_mem_bus_width: 6016
gpu_mem_freq: 2619000000.000000
gpu_mem_global_size: 99951181824
gpu_mem_shared_size: 233472
gpu_name: NVIDIA H800 NVL
gpu_pageableMemoryAccess: 1
gpu_pageableMemoryAccessUsesHostPageTables: 0
gpu_runtime_version: 12.8
gpu_sm_count: 132
gpu_sm_freq: 1785000000.000000
host_cores_used: 16
host_cpu_freq_max: 3000000000
host_cpu_freq_min: 1500000000
host_pagesize: 4096
host_processors_sysconf: 32
host_processors_used: 32
host_total_ram_size: 270107455488
host_total_swap_size: 0
max_k: 100
max_n_queries: 10000
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
terminate called after throwing an instance of 'raft::cuda_error'
  what():  CUDA error encountered at: file=~/workspace/raft/cpp/include/raft/core/device_resources_snmg.hpp line=102: call='cudaSetDevice(main_gpu_id_)', Reason=cudaErrorInvalidDevice:invalid device ordinal
Obtained 18 stack frames
#1 in /tmp/build/bench/ann/libcuvs_mg_ann_bench.so: raft::cuda_error::cuda_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) +0x5a [0x7f0a3c323c6a]
#2 in /tmp/build/bench/ann/libcuvs_mg_ann_bench.so: raft::device_resources_snmg::has_resource_factory(raft::resource::resource_type) const +0x1fe [0x7f0a3c32530e]
#3 in /tmp/build/bench/ann/libcuvs_mg_ann_bench.so: raft::resource::get_cuda_stream(raft::resources const&) +0x14 [0x7f0a3c32abd4]
#4 in /tmp/build/bench/ann/libcuvs_mg_ann_bench.so: non-virtual thunk to cuvs::bench::cuvs_mg_cagra<float, unsigned int>::get_sync_stream() const +0xd [0x7f0a3c32af7d]
#5 in ./cpp/build/bench/ann/ANN_BENCH: void cuvs::bench::bench_search<float>(benchmark::State&, cuvs::bench::configuration::index const&, unsigned long, std::shared_ptr<cuvs::bench::dataset<float, int> const>, bool) +0x662 [0x5557863a4342]
#6 in ./cpp/build/bench/ann/ANN_BENCH: benchmark::internal::LambdaBenchmark<benchmark::RegisterBenchmark<void (&)(benchmark::State&, cuvs::bench::configuration::index const&, unsigned long, std::shared_ptr<cuvs::bench::dataset<float, int> const>, bool), cuvs::bench::configuration::index&, unsigned long&, std::shared_ptr<cuvs::bench::dataset<float, int> const>&, bool&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void (&)(benchmark::State&, cuvs::bench::configuration::index const&, unsigned long, std::shared_ptr<cuvs::bench::dataset<float, int> const>, bool), cuvs::bench::configuration::index&, unsigned long&, std::shared_ptr<cuvs::bench::dataset<float, int> const>&, bool&)::{lambda(benchmark::State&)#1}>::Run(benchmark::State&) +0x54 [0x555786364cc4]
#7 in ./cpp/build/bench/ann/ANN_BENCH(+0xc0721) [0x5557863e7721]
#8 in ./cpp/build/bench/ann/ANN_BENCH(+0xa66b0) [0x5557863cd6b0]
#9 in ./cpp/build/bench/ann/ANN_BENCH(+0xa703d) [0x5557863ce03d]
#10 in ./cpp/build/bench/ann/ANN_BENCH(+0xa7626) [0x5557863ce626]
#11 in ./cpp/build/bench/ann/ANN_BENCH(+0xa886e) [0x5557863cf86e]
#12 in ./cpp/build/bench/ann/ANN_BENCH(+0x87d95) [0x5557863aed95]
#13 in ./cpp/build/bench/ann/ANN_BENCH(+0x8992d) [0x5557863b092d]
#14 in ./cpp/build/bench/ann/ANN_BENCH(+0x89a97) [0x5557863b0a97]
#15 in ./cpp/build/bench/ann/ANN_BENCH: cuvs::bench::run_main(int, char**) +0xfa1 [0x5557863ab411]
#16 in /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f0a42d4d1ca]
#17 in /usr/lib/x86_64-linux-gnu/libc.so.6: __libc_start_main +0x8b [0x7f0a42d4d28b]
#18 in ./cpp/build/bench/ann/ANN_BENCH: _start +0x2e [0x5557863584de]

Aborted (core dumped)

jinsolp · 2025-05-13T16:38:24Z

@achirkin I only verified the gtests, let me try running the ANN BENCH thanks for pointing out : )

jinsolp · 2025-05-14T17:21:40Z

Should be fixed now

The existing `nccl_clique.hpp` assumes usage of nccl when using multiple gpus. However, this is not always the case, and we want to generalize the use of multi-gpu resource by decoupling it from nccl usage. This PR is mostly a refactoring of existing code to lazy-load nccl for `device_resources_snmg` resource. ### Dependencies **The merge of this PR will break cuVS**. Thankfully (and hopefully) it will not break anything else since `nccl_clique.hpp` was recently introduced and is only used for snmg neighbors index related code. [This PR in cuVS ](rapidsai/cuvs#869) makes corresponding changes (changing include header file name etc...) to make this code work with cuVS. There are no functional differences. Authors: - Jinsol Park (https://github.com/jinsolp) Approvers: - Victor Lafargue (https://github.com/viclafargue) - Artem M. Chirkin (https://github.com/achirkin) - Divye Gala (https://github.com/divyegala) URL: #2647

…/jinsolp/cuvs into mg-change-for-multi-gpu-resource

achirkin · 2025-05-15T05:47:36Z

/merge

change mg to use device_resources_snmg_nccl

046aaee

jinsolp requested a review from a team as a code owner May 2, 2025 23:50

github-actions Bot added the cpp label May 2, 2025

Merge branch 'branch-25.06' into mg-change-for-multi-gpu-resource

93ed24b

jinsolp requested a review from viclafargue May 2, 2025 23:51

jinsolp self-assigned this May 2, 2025

jinsolp added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels May 2, 2025

jinsolp mentioned this pull request May 2, 2025

Decoupling multi gpu resources from nccl usage rapidsai/raft#2647

Merged

jinsolp added 2 commits May 2, 2025 23:58

cmake file

37bd44d

Signed-off-by: jinsolp <[email protected]>

Merge branch 'mg-change-for-multi-gpu-resource' of https://github.com…

48dca90

…/jinsolp/cuvs into mg-change-for-multi-gpu-resource

jinsolp requested a review from a team as a code owner May 2, 2025 23:58

github-actions Bot added the CMake label May 2, 2025

jinsolp commented May 2, 2025

View reviewed changes

jinsolp mentioned this pull request May 3, 2025

Wrapper for all-neighbors knn graph building #785

Merged

pin branch

7f83024

Signed-off-by: jinsolp <[email protected]>

viclafargue reviewed May 5, 2025

View reviewed changes

cjnolet added this to Unstructured Data Processing May 5, 2025

cjnolet moved this to In Progress in Unstructured Data Processing May 5, 2025

fix for renamed functions

280adf1

jinsolp added 2 commits May 5, 2025 13:35

Merge branch 'branch-25.06' into mg-change-for-multi-gpu-resource

3868e52

Merge branch 'branch-25.06' into mg-change-for-multi-gpu-resource

db12f9a

jinsolp changed the title ~~Change snmg index to use nccl-dependent device resource~~ Change snmg index to use devices_resources_snmg_nccl May 5, 2025

jinsolp added 4 commits May 7, 2025 18:20

revert

e85ca9a

Merge branch 'mg-change-for-multi-gpu-resource' of https://github.com…

0ea8579

…/jinsolp/cuvs into mg-change-for-multi-gpu-resource

Merge branch 'branch-25.06' into mg-change-for-multi-gpu-resource

fb0f539

Merge branch 'branch-25.06' into mg-change-for-multi-gpu-resource

929f40c

viclafargue approved these changes May 8, 2025

View reviewed changes

cjnolet approved these changes May 8, 2025

View reviewed changes

jinsolp added 3 commits May 8, 2025 21:25

revert cmake file

1b80798

Merge branch 'mg-change-for-multi-gpu-resource' of https://github.com…

0e7f8a1

…/jinsolp/cuvs into mg-change-for-multi-gpu-resource

typo

a20ff57

github-actions Bot removed the CMake label May 8, 2025

jinsolp added 2 commits May 8, 2025 14:27

Merge branch 'branch-25.06' into mg-change-for-multi-gpu-resource

07c2bf7

changes and pin branch

237f5e0

github-actions Bot added the CMake label May 12, 2025

jinsolp changed the title ~~Change snmg index to use devices_resources_snmg_nccl~~ Change snmg index to use updated multi gpu resource API May 12, 2025

Merge branch 'branch-25.06' into mg-change-for-multi-gpu-resource

182fcd9

Merge branch 'branch-25.06' into mg-change-for-multi-gpu-resource

3afd86a

Merge branch 'branch-25.06' into mg-change-for-multi-gpu-resource

ddf29bd

jinsolp added 2 commits May 14, 2025 22:55

revert cmake file

026a55c

Merge branch 'mg-change-for-multi-gpu-resource' of https://github.com…

ba29d49

…/jinsolp/cuvs into mg-change-for-multi-gpu-resource

github-actions Bot removed the CMake label May 14, 2025

empty commit to trigger CI

3876058

rapids-bot Bot merged commit 2c4e34c into rapidsai:branch-25.06 May 15, 2025
75 checks passed

github-project-automation Bot moved this from In Progress to Done in Unstructured Data Processing May 15, 2025

jinsolp deleted the mg-change-for-multi-gpu-resource branch May 15, 2025 06:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change snmg index to use updated multi gpu resource API#869

Change snmg index to use updated multi gpu resource API#869
rapids-bot[bot] merged 23 commits intorapidsai:branch-25.06from
jinsolp:mg-change-for-multi-gpu-resource

jinsolp commented May 2, 2025 •

edited

Loading

Uh oh!

jinsolp May 2, 2025

Uh oh!

viclafargue left a comment

Uh oh!

jinsolp commented May 5, 2025

Uh oh!

cjnolet commented May 8, 2025

Uh oh!

achirkin commented May 13, 2025

Uh oh!

jinsolp commented May 13, 2025

Uh oh!

jinsolp commented May 14, 2025

Uh oh!

achirkin commented May 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		GIT_REPOSITORY https://github.com/jinsolp/raft.git
		GIT_TAG multi-gpu-resource

Conversation

jinsolp commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

jinsolp May 2, 2025

Choose a reason for hiding this comment

Uh oh!

viclafargue left a comment

Choose a reason for hiding this comment

Uh oh!

jinsolp commented May 5, 2025

Uh oh!

cjnolet commented May 8, 2025

Uh oh!

achirkin commented May 13, 2025

Uh oh!

jinsolp commented May 13, 2025

Uh oh!

jinsolp commented May 14, 2025

Uh oh!

achirkin commented May 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jinsolp commented May 2, 2025 •

edited

Loading