Skip to content

Change snmg index to use updated multi gpu resource API#869

Merged
rapids-bot[bot] merged 23 commits intorapidsai:branch-25.06from
jinsolp:mg-change-for-multi-gpu-resource
May 15, 2025
Merged

Change snmg index to use updated multi gpu resource API#869
rapids-bot[bot] merged 23 commits intorapidsai:branch-25.06from
jinsolp:mg-change-for-multi-gpu-resource

Conversation

@jinsolp
Copy link
Copy Markdown
Contributor

@jinsolp jinsolp commented May 2, 2025

Description

These changes are dependent on this breaking PR in raft.
This PR itself doesn't introduce any new features, but makes changes to the existing code to work with the breaking changes in the PR in raft.

@jinsolp jinsolp requested a review from a team as a code owner May 2, 2025 23:50
@github-actions github-actions Bot added the cpp label May 2, 2025
@jinsolp jinsolp requested a review from viclafargue May 2, 2025 23:51
@jinsolp jinsolp self-assigned this May 2, 2025
@jinsolp jinsolp added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels May 2, 2025
@jinsolp jinsolp requested a review from a team as a code owner May 2, 2025 23:58
@github-actions github-actions Bot added the CMake label May 2, 2025
Comment thread cpp/cmake/thirdparty/get_raft.cmake Outdated
Comment on lines +48 to +49
GIT_REPOSITORY https://github.com/jinsolp/raft.git
GIT_TAG multi-gpu-resource
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be reverted

Signed-off-by: jinsolp <[email protected]>
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jinsolp, LGTM apart from the benchmark files. Could you switch the raft::device_resources_snmg to raft::device_resources_snmg_nccl?

@jinsolp
Copy link
Copy Markdown
Contributor Author

jinsolp commented May 5, 2025

@viclafargue Thanks for pointing out Victor, I also had to change function calls in snmg.cuh so that it works with the change function names in raft!

@jinsolp jinsolp changed the title Change snmg index to use nccl-dependent device resource Change snmg index to use devices_resources_snmg_nccl May 5, 2025
@cjnolet
Copy link
Copy Markdown
Member

cjnolet commented May 8, 2025

@jinsolp don't forget to revert the change in get_raft.cmake before this is merged.

@github-actions github-actions Bot removed the CMake label May 8, 2025
@github-actions github-actions Bot added the CMake label May 12, 2025
@jinsolp jinsolp changed the title Change snmg index to use devices_resources_snmg_nccl Change snmg index to use updated multi gpu resource API May 12, 2025
@achirkin
Copy link
Copy Markdown
Contributor

Hi Jinsol, did you have a chance to run the ann_bench with the new changes?
I've just tried this branch on a system with two GPUs, but limited to only one via the env variable CUDA_VISIBLE_DEVICES=0

(cuml_dev) CUDA_VISIBLE_DEVICES=0 ./cpp/build/bench/ann/ANN_BENCH \
  --search \
  --mode=latency \
  --benchmark_filter=cuvs_mg_cagra.dim32_vpq \
  --benchmark_min_time=3s \
  --benchmark_min_warmup_time=0.01 \
  --benchmark_counters_tabular \
  --override_kv=dataset_memory_type:\"host\" \
  --override_kv=query_memory_type:\"pinned\" \
  --override_kv=graph_memory_type:\"host_pinned\" \
  --override_kv=persistent:false \
  --data_prefix=/tmp/data \
  --index_prefix=/ann/index \
  /ann/conf/deep-100M-cagra-persistent.json
[I] [12:00:04.876774] Using the query file '/tmp/data/deep-1B/query.public.10K.fbin'
[I] [12:00:04.876805] Using the ground truth file '/tmp/data/deep-100M/groundtruth.neighbors.ibin'
2025-05-13T12:00:04+00:00
Running ./cpp/build/bench/ann/ANN_BENCH
Run on (32 X 3729.49 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x4)
Load Average: 0.21, 4.55, 12.52
command_line: ./cpp/build/bench/ann/ANN_BENCH --search --mode=latency --benchmark_filter=cuvs_mg_cagra.dim32_vpq --benchmark_min_time=3s --benchmark_min_warmup_time=0.01 --benchmark_counters_tabular --override_kv=dataset_memory_type:"host" --override_kv=query_memory_type:"pinned" --override_kv=graph_memory_type:"host_pinned" --override_kv=persistent:false --data_prefix=/tmp/data --index_prefix=/ann/index /ann/conf/deep-100M-cagra-persistent.json
dataset: deep-100M
dim: 96
distance: euclidean
gpu_driver_version: 13.0
gpu_gpuDirectRDMASupported: 1758070544
gpu_hostNativeAtomicSupported: 0
gpu_mem_bus_width: 6016
gpu_mem_freq: 2619000000.000000
gpu_mem_global_size: 99951181824
gpu_mem_shared_size: 233472
gpu_name: NVIDIA H800 NVL
gpu_pageableMemoryAccess: 1
gpu_pageableMemoryAccessUsesHostPageTables: 0
gpu_runtime_version: 12.8
gpu_sm_count: 132
gpu_sm_freq: 1785000000.000000
host_cores_used: 16
host_cpu_freq_max: 3000000000
host_cpu_freq_min: 1500000000
host_pagesize: 4096
host_processors_sysconf: 32
host_processors_used: 32
host_total_ram_size: 270107455488
host_total_swap_size: 0
max_k: 100
max_n_queries: 10000
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
terminate called after throwing an instance of 'raft::cuda_error'
  what():  CUDA error encountered at: file=~/workspace/raft/cpp/include/raft/core/device_resources_snmg.hpp line=102: call='cudaSetDevice(main_gpu_id_)', Reason=cudaErrorInvalidDevice:invalid device ordinal
Obtained 18 stack frames
#1 in /tmp/build/bench/ann/libcuvs_mg_ann_bench.so: raft::cuda_error::cuda_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) +0x5a [0x7f0a3c323c6a]
#2 in /tmp/build/bench/ann/libcuvs_mg_ann_bench.so: raft::device_resources_snmg::has_resource_factory(raft::resource::resource_type) const +0x1fe [0x7f0a3c32530e]
#3 in /tmp/build/bench/ann/libcuvs_mg_ann_bench.so: raft::resource::get_cuda_stream(raft::resources const&) +0x14 [0x7f0a3c32abd4]
#4 in /tmp/build/bench/ann/libcuvs_mg_ann_bench.so: non-virtual thunk to cuvs::bench::cuvs_mg_cagra<float, unsigned int>::get_sync_stream() const +0xd [0x7f0a3c32af7d]
#5 in ./cpp/build/bench/ann/ANN_BENCH: void cuvs::bench::bench_search<float>(benchmark::State&, cuvs::bench::configuration::index const&, unsigned long, std::shared_ptr<cuvs::bench::dataset<float, int> const>, bool) +0x662 [0x5557863a4342]
#6 in ./cpp/build/bench/ann/ANN_BENCH: benchmark::internal::LambdaBenchmark<benchmark::RegisterBenchmark<void (&)(benchmark::State&, cuvs::bench::configuration::index const&, unsigned long, std::shared_ptr<cuvs::bench::dataset<float, int> const>, bool), cuvs::bench::configuration::index&, unsigned long&, std::shared_ptr<cuvs::bench::dataset<float, int> const>&, bool&>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void (&)(benchmark::State&, cuvs::bench::configuration::index const&, unsigned long, std::shared_ptr<cuvs::bench::dataset<float, int> const>, bool), cuvs::bench::configuration::index&, unsigned long&, std::shared_ptr<cuvs::bench::dataset<float, int> const>&, bool&)::{lambda(benchmark::State&)#1}>::Run(benchmark::State&) +0x54 [0x555786364cc4]
#7 in ./cpp/build/bench/ann/ANN_BENCH(+0xc0721) [0x5557863e7721]
#8 in ./cpp/build/bench/ann/ANN_BENCH(+0xa66b0) [0x5557863cd6b0]
#9 in ./cpp/build/bench/ann/ANN_BENCH(+0xa703d) [0x5557863ce03d]
#10 in ./cpp/build/bench/ann/ANN_BENCH(+0xa7626) [0x5557863ce626]
#11 in ./cpp/build/bench/ann/ANN_BENCH(+0xa886e) [0x5557863cf86e]
#12 in ./cpp/build/bench/ann/ANN_BENCH(+0x87d95) [0x5557863aed95]
#13 in ./cpp/build/bench/ann/ANN_BENCH(+0x8992d) [0x5557863b092d]
#14 in ./cpp/build/bench/ann/ANN_BENCH(+0x89a97) [0x5557863b0a97]
#15 in ./cpp/build/bench/ann/ANN_BENCH: cuvs::bench::run_main(int, char**) +0xfa1 [0x5557863ab411]
#16 in /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f0a42d4d1ca]
#17 in /usr/lib/x86_64-linux-gnu/libc.so.6: __libc_start_main +0x8b [0x7f0a42d4d28b]
#18 in ./cpp/build/bench/ann/ANN_BENCH: _start +0x2e [0x5557863584de]

Aborted (core dumped)

@jinsolp
Copy link
Copy Markdown
Contributor Author

jinsolp commented May 13, 2025

@achirkin I only verified the gtests, let me try running the ANN BENCH thanks for pointing out : )

@jinsolp
Copy link
Copy Markdown
Contributor Author

jinsolp commented May 14, 2025

Should be fixed now

rapids-bot Bot pushed a commit to rapidsai/raft that referenced this pull request May 14, 2025
The existing `nccl_clique.hpp` assumes usage of nccl when using multiple gpus. However, this is not always the case, and we want to generalize the use of multi-gpu resource by decoupling it from nccl usage.
This PR is mostly a refactoring of existing code to lazy-load nccl for `device_resources_snmg` resource.

### Dependencies
**The merge of this PR will break cuVS**. Thankfully (and hopefully) it will not break anything else since `nccl_clique.hpp` was recently introduced and is only used for snmg neighbors index related code. 
[This PR in cuVS ](rapidsai/cuvs#869) makes corresponding changes (changing include header file name etc...) to make this code work with cuVS. There are no functional differences.

Authors:
  - Jinsol Park (https://github.com/jinsolp)

Approvers:
  - Victor Lafargue (https://github.com/viclafargue)
  - Artem M. Chirkin (https://github.com/achirkin)
  - Divye Gala (https://github.com/divyegala)

URL: #2647
@github-actions github-actions Bot removed the CMake label May 14, 2025
@achirkin
Copy link
Copy Markdown
Contributor

/merge

@rapids-bot rapids-bot Bot merged commit 2c4e34c into rapidsai:branch-25.06 May 15, 2025
75 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Unstructured Data Processing May 15, 2025
@jinsolp jinsolp deleted the mg-change-for-multi-gpu-resource branch May 15, 2025 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants