Skip to content

Commit bd603a9

Browse files
authored
Fixing small typo in cuvs bench docs (#586)
Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Ben Frederickson (https://github.com/benfred) URL: #586
1 parent 6371aa3 commit bd603a9

1 file changed

Lines changed: 3 additions & 192 deletions

File tree

docs/source/cuvs_bench/index.rst

Lines changed: 3 additions & 192 deletions
Original file line numberDiff line numberDiff line change
@@ -24,16 +24,6 @@ This tool offers several benefits, including
2424

2525
* `Docker`_
2626

27-
- `How benchmarks are run`_
28-
29-
* `Step 1: Prepare the dataset`_
30-
31-
* `Step 2: Build and search index`_
32-
33-
* `Step 3: Data export`_
34-
35-
* `Step 4: Plot the results`_
36-
3727
- `Running the benchmarks`_
3828

3929
* `End-to-end: smaller-scale benchmarks (<1M to 10M)`_
@@ -75,7 +65,7 @@ Conda
7565
conda activate cuvs_benchmarks
7666
7767
# to install GPU package:
78-
conda install -c rapidsai -c conda-forge -c nvidia cuvs-ann-bench=<rapids_version> cuda-version=11.8*
68+
conda install -c rapidsai -c conda-forge -c nvidia cuvs-bench=<rapids_version> cuda-version=11.8*
7969
8070
# to install CPU package for usage in CPU-only systems:
8171
conda install -c rapidsai -c conda-forge cuvs-bench-cpu
@@ -99,7 +89,7 @@ The following command pulls the nightly container for Python version 3.10, CUDA
9989

10090
.. code-block:: bash
10191
102-
docker pull rapidsai/cuvs-bench:24.12a-cuda12.5-py3.10 #substitute cuvs-bench for the exact desired container.
92+
docker pull rapidsai/cuvs-bench:24.12a-cuda12.5-py3.10 # substitute cuvs-bench for the exact desired container.
10393
10494
The CUDA and python versions can be changed for the supported values:
10595
- Supported CUDA versions: 11.8 and 12.5
@@ -112,185 +102,6 @@ You can see the exact versions as well in the dockerhub site:
112102

113103
**Note:** GPU containers use the CUDA toolkit from inside the container, the only requirement is a driver installed on the host machine that supports that version. So, for example, CUDA 11.8 containers can run in systems with a CUDA 12.x capable driver. Please also note that the Nvidia-Docker runtime from the `Nvidia Container Toolkit <https://github.com/NVIDIA/nvidia-docker>`_ is required to use GPUs inside docker containers.
114104

115-
How benchmarks are run
116-
======================
117-
118-
The `cuvs-bench` package contains lightweight Python scripts to run the benchmarks. There are 4 general steps to running the benchmarks and visualizing the results.
119-
120-
#. Prepare Dataset
121-
122-
#. Build Index and Search Index
123-
124-
#. Data Export
125-
126-
#. Plot Results
127-
128-
Step 1: Prepare the dataset
129-
---------------------------
130-
131-
The script `cuvs_bench.get_dataset` will download and unpack the dataset in directory that the user provides. As of now, only million-scale datasets are supported by this script. For more information on :doc:`datasets and formats <datasets>`.
132-
133-
The usage of this script is:
134-
135-
.. code-block:: bash
136-
137-
usage: get_dataset.py [-h] [--name NAME] [--dataset-path DATASET_PATH] [--normalize]
138-
139-
options:
140-
-h, --help show this help message and exit
141-
--dataset DATASET dataset to download (default: glove-100-angular)
142-
--dataset-path DATASET_PATH
143-
path to download dataset (default: ${RAPIDS_DATASET_ROOT_DIR})
144-
--normalize normalize cosine distance to inner product (default: False)
145-
146-
When option `normalize` is provided to the script, any dataset that has cosine distances
147-
will be normalized to inner product. So, for example, the dataset `glove-100-angular`
148-
will be written at location `datasets/glove-100-inner/`.
149-
150-
Step 2: Build and search index
151-
------------------------------
152-
153-
The script `cuvs_bench.run` will build and search indices for a given dataset and its
154-
specified configuration.
155-
156-
The usage of the script `cuvs_bench.run` is:
157-
158-
.. code-block:: bash
159-
160-
usage: __main__.py [-h] [--subset-size SUBSET_SIZE] [-k COUNT] [-bs BATCH_SIZE] [--dataset-configuration DATASET_CONFIGURATION] [--configuration CONFIGURATION] [--dataset DATASET]
161-
[--dataset-path DATASET_PATH] [--build] [--search] [--algorithms ALGORITHMS] [--groups GROUPS] [--algo-groups ALGO_GROUPS] [-f] [-m SEARCH_MODE]
162-
163-
options:
164-
-h, --help show this help message and exit
165-
--subset-size SUBSET_SIZE
166-
the number of subset rows of the dataset to build the index (default: None)
167-
-k COUNT, --count COUNT
168-
the number of nearest neighbors to search for (default: 10)
169-
-bs BATCH_SIZE, --batch-size BATCH_SIZE
170-
number of query vectors to use in each query trial (default: 10000)
171-
--dataset-configuration DATASET_CONFIGURATION
172-
path to YAML configuration file for datasets (default: None)
173-
--configuration CONFIGURATION
174-
path to YAML configuration file or directory for algorithms Any run groups found in the specified file/directory will automatically override groups of the same name
175-
present in the default configurations, including `base` (default: None)
176-
--dataset DATASET name of dataset (default: glove-100-inner)
177-
--dataset-path DATASET_PATH
178-
path to dataset folder, by default will look in RAPIDS_DATASET_ROOT_DIR if defined, otherwise a datasets subdirectory from the calling directory (default:
179-
os.getcwd()/datasets/)
180-
--build
181-
--search
182-
--algorithms ALGORITHMS
183-
run only comma separated list of named algorithms. If parameters `groups` and `algo-groups` are both undefined, then group `base` is run by default (default: None)
184-
--groups GROUPS run only comma separated groups of parameters (default: base)
185-
--algo-groups ALGO_GROUPS
186-
add comma separated <algorithm>.<group> to run. Example usage: "--algo-groups=cuvs_cagra.large,hnswlib.large" (default: None)
187-
-f, --force re-run algorithms even if their results already exist (default: False)
188-
-m SEARCH_MODE, --search-mode SEARCH_MODE
189-
run search in 'latency' (measure individual batches) or 'throughput' (pipeline batches and measure end-to-end) mode (default: throughput)
190-
-t SEARCH_THREADS, --search-threads SEARCH_THREADS
191-
specify the number threads to use for throughput benchmark. Single value or a pair of min and max separated by ':'. Example --search-threads=1:4. Power of 2 values between 'min' and 'max' will be used. If only 'min' is
192-
specified, then a single test is run with 'min' threads. By default min=1, max=<num hyper threads>. (default: None)
193-
-r, --dry-run dry-run mode will convert the yaml config for the specified algorithms and datasets to the json format that's consumed by the lower-level c++ binaries and then print the command to run execute the benchmarks but
194-
will not actually execute the command. (default: False)
195-
196-
`dataset`: name of the dataset to be searched in `datasets.yaml`_
197-
198-
`dataset-configuration`: optional filepath to custom dataset YAML config which has an entry for arg `dataset`
199-
200-
`configuration`: optional filepath to YAML configuration for an algorithm or to directory that contains YAML configurations for several algorithms. Refer to `Dataset.yaml config`_ for more info.
201-
202-
`algorithms`: runs all algorithms that it can find in YAML configs found by `configuration`. By default, only `base` group will be run.
203-
204-
`groups`: run only specific groups of parameters configurations for an algorithm. Groups are defined in YAML configs (see `configuration`), and by default run `base` group
205-
206-
`algo-groups`: this parameter is helpful to append any specific algorithm+group combination to run the benchmark for in addition to all the arguments from `algorithms` and `groups`. It is of the format `<algorithm>.<group>`, or for example, `cuvs_cagra.large`
207-
208-
For every algorithm run by this script, it outputs an index build statistics JSON file in `<dataset-path/<dataset>/result/build/<{algo},{group}.json>`
209-
and an index search statistics JSON file in `<dataset-path/<dataset>/result/search/<{algo},{group},k{k},bs{batch_size}.json>`. NOTE: The filenames will not have ",{group}" if `group = "base"`.
210-
211-
For every algorithm run by this script, it outputs an index build statistics JSON file in `<dataset-path/<dataset>/result/build/<{algo},{group}.json>`
212-
and an index search statistics JSON file in `<dataset-path/<dataset>/result/search/<{algo},{group},k{k},bs{batch_size}.json>`. NOTE: The filenames will not have ",{group}" if `group = "base"`.
213-
214-
`dataset-path` :
215-
#. data is read from `<dataset-path>/<dataset>`
216-
#. indices are built in `<dataset-path>/<dataset>/index`
217-
#. build/search results are stored in `<dataset-path>/<dataset>/result`
218-
219-
`build` and `search` : if both parameters are not supplied to the script then it is assumed both are `True`.
220-
221-
`indices` and `algorithms` : these parameters ensure that the algorithm specified for an index is available in `algos.yaml` and not disabled, as well as having an associated executable.
222-
223-
Step 3: Data export
224-
-------------------
225-
226-
The script `cuvs_bench.data_export` will convert the intermediate JSON outputs produced by `cuvs_bench.run` to more easily readable CSV files, which are needed to build charts made by `cuvs_bench.plot`.
227-
228-
.. code-block:: bash
229-
230-
usage: data_export.py [-h] [--dataset DATASET] [--dataset-path DATASET_PATH]
231-
232-
options:
233-
-h, --help show this help message and exit
234-
--dataset DATASET dataset to download (default: glove-100-inner)
235-
--dataset-path DATASET_PATH
236-
path to dataset folder (default: ${RAPIDS_DATASET_ROOT_DIR})
237-
238-
Build statistics CSV file is stored in `<dataset-path/<dataset>/result/build/<{algo},{group}.csv>`
239-
and index search statistics CSV file in `<dataset-path/<dataset>/result/search/<{algo},{group},k{k},bs{batch_size},{suffix}.csv>`, where suffix has three values:
240-
#. `raw`: All search results are exported
241-
#. `throughput`: Pareto frontier of throughput results is exported
242-
#. `latency`: Pareto frontier of latency results is exported
243-
244-
Step 4: Plot the results
245-
------------------------
246-
247-
The script `cuvs_bench.plot` will plot results for all algorithms found in index search statistics CSV files `<dataset-path/<dataset>/result/search/*.csv`.
248-
249-
The usage of this script is:
250-
251-
.. code-block:: bash
252-
253-
usage: [-h] [--dataset DATASET] [--dataset-path DATASET_PATH] [--output-filepath OUTPUT_FILEPATH] [--algorithms ALGORITHMS] [--groups GROUPS] [--algo-groups ALGO_GROUPS]
254-
[-k COUNT] [-bs BATCH_SIZE] [--build] [--search] [--x-scale X_SCALE] [--y-scale {linear,log,symlog,logit}] [--x-start X_START] [--mode {throughput,latency}]
255-
[--time-unit {s,ms,us}] [--raw]
256-
257-
options:
258-
-h, --help show this help message and exit
259-
--dataset DATASET dataset to plot (default: glove-100-inner)
260-
--dataset-path DATASET_PATH
261-
path to dataset folder (default: /home/coder/cuvs/datasets/)
262-
--output-filepath OUTPUT_FILEPATH
263-
directory for PNG to be saved (default: /home/coder/cuvs)
264-
--algorithms ALGORITHMS
265-
plot only comma separated list of named algorithms. If parameters `groups` and `algo-groups are both undefined, then group `base` is plot by default
266-
(default: None)
267-
--groups GROUPS plot only comma separated groups of parameters (default: base)
268-
--algo-groups ALGO_GROUPS, --algo-groups ALGO_GROUPS
269-
add comma separated <algorithm>.<group> to plot. Example usage: "--algo-groups=cuvs_cagra.large,hnswlib.large" (default: None)
270-
-k COUNT, --count COUNT
271-
the number of nearest neighbors to search for (default: 10)
272-
-bs BATCH_SIZE, --batch-size BATCH_SIZE
273-
number of query vectors to use in each query trial (default: 10000)
274-
--build
275-
--search
276-
--x-scale X_SCALE Scale to use when drawing the X-axis. Typically linear, logit or a2 (default: linear)
277-
--y-scale {linear,log,symlog,logit}
278-
Scale to use when drawing the Y-axis (default: linear)
279-
--x-start X_START Recall values to start the x-axis from (default: 0.8)
280-
--mode {throughput,latency}
281-
search mode whose Pareto frontier is used on the y-axis (default: throughput)
282-
--time-unit {s,ms,us}
283-
time unit to plot when mode is latency (default: ms)
284-
--raw Show raw results (not just Pareto frontier) of mode arg (default: False)
285-
286-
`mode`: plots pareto frontier of `throughput` or `latency` results exported in the previous step
287-
288-
`algorithms`: plots all algorithms that it can find results for the specified `dataset`. By default, only `base` group will be plotted.
289-
290-
`groups`: plot only specific groups of parameters configurations for an algorithm. Groups are defined in YAML configs (see `configuration`), and by default run `base` group
291-
292-
`algo-groups`: this parameter is helpful to append any specific algorithm+group combination to plot results for in addition to all the arguments from `algorithms` and `groups`. It is of the format `<algorithm>.<group>`, or for example, `cuvs_cagra.large`
293-
294105
Running the benchmarks
295106
======================
296107

@@ -576,7 +387,7 @@ Creating and customizing dataset configurations
576387

577388
A single configuration will often define a set of algorithms, with associated index and search parameters, that can be generalize across datasets. We use YAML to define dataset specific and algorithm specific configurations.
578389

579-
A default `datasets.yaml` is provided by CUVS in `${CUVS_HOME}/python/cuvs-ann-bench/src/cuvs_bench/run/conf` with configurations available for several datasets. Here's a simple example entry for the `sift-128-euclidean` dataset:
390+
A default `datasets.yaml` is provided by CUVS in `${CUVS_HOME}/python/cuvs_bench/src/cuvs_bench/run/conf` with configurations available for several datasets. Here's a simple example entry for the `sift-128-euclidean` dataset:
580391

581392
.. code-block:: yaml
582393

0 commit comments

Comments
 (0)