Skip to content

Commit 32e9ea6

Browse files
achirkinenp1s0
authored andcommitted
ANN_BENCH: integrate NVTX statistics (rapidsai#1529)
Add the aggregate reporting of NVTX ranges in the output of benchmark executable. ### Usage ```bash # Measure the CPU and GPU runtime of all NVTX ranges nsys launch --trace=cuda,nvtx <ANN_BENCH with arguments> # Measure only the CPU runtime of all NVTX ranges nsys launch --trace=nvtx <ANN_BENCH with arguments> # Do not measure/report any NVTX ranges <ANN_BENCH with arguments> # Do not measure/report any NVTX ranges within benchmark, but use nsys profiling as usual nsys profile ... <ANN_BENCH with arguments> ``` ### Implementation The PR adds a single module `nvtx_stats.hpp` to the benchmark executable; there are no changes to the library at all. The program leverages NVIDIA Nsight Systems CLI to collect and export NVTX statistics and then SQLite API to aggregate it into the benchmark state: 1. Detect if run via `nsys launch`; if so, call `nsys start` / `nsys stop` around benchmark loop; otherwise do nothing. 2. If the report is generated, read it and query all NVTX events and the GPU correlation data using SQLite 3. Aggregate the NVTX events by their short names (without arguments to reduce the number of columns) 4. Add them to the benchmark performance counters with the same averaging strategy as the global CPU/GPU runtime. ### Performance cost If the benchmark is **not** run using `nsys launch`, there's virtually zero overhead in the new functionality. Otherwise, there are overheads: 1. Usual nsys profiling overheads (minimized by disabling unused information via `nsys start` CLI internally). This affects the reported performance the same way as normal nsys profiling does (especially if cuda tracing is enabled). 2. One or more data collection/exporting events per benchmark case. These add some extra time to the benchmark time, but do not affect the counters (they are not the part of the benchmark loop) Closes rapidsai#1367 Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#1529
1 parent 7c8645d commit 32e9ea6

4 files changed

Lines changed: 590 additions & 2 deletions

File tree

cpp/bench/ann/CMakeLists.txt

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ if(CUVS_ANN_BENCH_USE_HNSWLIB OR CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB)
9393
endif()
9494

9595
include(cmake/thirdparty/get_nlohmann_json)
96+
include(cmake/thirdparty/get_sqlite)
9697

9798
if(CUVS_ANN_BENCH_USE_GGNN)
9899
include(cmake/thirdparty/get_ggnn)
@@ -144,6 +145,7 @@ function(ConfigureAnnBench)
144145
${BENCH_NAME}
145146
PRIVATE ${ConfigureAnnBench_LINKS}
146147
nlohmann_json::nlohmann_json
148+
sqlite3
147149
Threads::Threads
148150
$<$<BOOL:${GPU_BUILD}>:CUDA::cudart_static>
149151
$<TARGET_NAME_IF_EXISTS:OpenMP::OpenMP_CXX>
@@ -358,7 +360,7 @@ if(CUVS_ANN_BENCH_SINGLE_EXE)
358360
target_include_directories(ANN_BENCH PRIVATE ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES})
359361

360362
target_link_libraries(
361-
ANN_BENCH PRIVATE raft::raft nlohmann_json::nlohmann_json benchmark::benchmark dl
363+
ANN_BENCH PRIVATE raft::raft nlohmann_json::nlohmann_json sqlite3 benchmark::benchmark dl
362364
$<$<TARGET_EXISTS:CUDA::nvtx3>:CUDA::nvtx3>
363365
)
364366
set_target_properties(

cpp/bench/ann/src/common/benchmark.hpp

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
/*
2-
* SPDX-FileCopyrightText: Copyright (c) 2023-2024, NVIDIA CORPORATION.
2+
* SPDX-FileCopyrightText: Copyright (c) 2023-2025, NVIDIA CORPORATION.
33
* SPDX-License-Identifier: Apache-2.0
44
*/
55
#pragma once
66

77
#include "ann_types.hpp"
88
#include "conf.hpp"
99
#include "dataset.hpp"
10+
#include "nvtx_stats.hpp"
1011
#include "util.hpp"
1112

1213
#include <benchmark/benchmark.h>
@@ -138,6 +139,7 @@ void bench_build(::benchmark::State& state,
138139

139140
cuda_timer gpu_timer{algo};
140141
{
142+
nvtx_stats nvtx_stats{state};
141143
nvtx_case nvtx{state.name()};
142144
/* Note: GPU timing
143145
@@ -293,6 +295,7 @@ void bench_search(::benchmark::State& state,
293295
auto* distances_ptr = reinterpret_cast<float*>(neighbors_ptr + result_elem_count);
294296

295297
{
298+
nvtx_stats nvtx_stats{state};
296299
nvtx_case nvtx{state.name()};
297300

298301
std::unique_ptr<algo<T>> a{nullptr};

0 commit comments

Comments
 (0)