Reduce device memory usage for CAGRA's graph optimization process (2-hop detour counting) by anaruse · Pull Request #822 · rapidsai/cuvs

anaruse · 2025-04-15T09:30:57Z

CAGRA takes the initial knn graph as input and optimizes it to create a search graph. Several types of processing are performed in the graph optimization, the most memory-intensive of which is the counting of 2-hop detours.

Currently, the counting of 2-hop detours is performed on the GPU to speed up processing, and this requires that the entire initial knn graph be placed in device memory. In general, the size of the initial knn graph is 2x the size of the search graph. In other words, in the current implementation, roughly half the device memory size is the upper limit of the search graph that can be created. As it is, creating search graphs for huge datasets requires a GPU with a large amount of device memory, which is not practical.

To address this issue, this PR adds a CPU implementation of 2-hop detour counting and uses this CPU implementation to count 2-hop detours when device memory is insufficient.

The CPU implementation supports thread parallelism and is optimized to reduce conditional branches and is sufficiently fast. Of course, it is slower than the GPU implementation, but it can count 2-hop detours in about 3 to 4 times the time of the GPU implementation. Since the time for counting 2-hop detours on GPU is approximately 10% of the total indexing time, the overall time will increase by 20-30% when using the CPU implementation, but this is well within the practical range.

… usage in graph optimization

copy-pr-bot · 2025-04-15T09:31:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cjnolet · 2025-04-22T22:31:16Z

/ok to test

copy-pr-bot · 2025-04-22T22:31:20Z

/ok to test

@cjnolet, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

tfeher

Thanks Akira, it is great to have this feature! I have a few comments and suggestions below.

tfeher

Clicked on the wrong button, I meant to request changes.

tfeher

Thanks @anaruse for the updates! I have a few more request below.

tfeher

Thanks Akira for the updates, the PR looks good to me.

cjnolet · 2025-05-27T18:26:45Z

/merge

…hop detour counting) (rapidsai#822) CAGRA takes the initial knn graph as input and optimizes it to create a search graph. Several types of processing are performed in the graph optimization, the most memory-intensive of which is the counting of 2-hop detours. Currently, the counting of 2-hop detours is performed on the GPU to speed up processing, and this requires that the entire initial knn graph be placed in device memory. In general, the size of the initial knn graph is 2x the size of the search graph. In other words, in the current implementation, roughly half the device memory size is the upper limit of the search graph that can be created. As it is, creating search graphs for huge datasets requires a GPU with a large amount of device memory, which is not practical. To address this issue, this PR adds a CPU implementation of 2-hop detour counting and uses this CPU implementation to count 2-hop detours when device memory is insufficient. The CPU implementation supports thread parallelism and is optimized to reduce conditional branches and is sufficiently fast. Of course, it is slower than the GPU implementation, but it can count 2-hop detours in about 3 to 4 times the time of the GPU implementation. Since the time for counting 2-hop detours on GPU is approximately 10% of the total indexing time, the overall time will increase by 20-30% when using the CPU implementation, but this is well within the practical range. Authors: - Akira Naruse (https://github.com/anaruse) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#822

Added CPU implementation for counting 2-hop detours with large memory…

a74404a

… usage in graph optimization

anaruse requested a review from a team as a code owner April 15, 2025 09:30

github-actions Bot added the cpp label Apr 15, 2025

anaruse changed the title ~~Reduce device memory usage for CAGRA's graph optimization process~~ Reduce device memory usage for CAGRA's graph optimization process (1) Apr 21, 2025

anaruse changed the title ~~Reduce device memory usage for CAGRA's graph optimization process (1)~~ Reduce device memory usage for CAGRA's graph optimization process (2-hop detour counting) Apr 21, 2025

cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Apr 22, 2025

cjnolet assigned anaruse Apr 22, 2025

cjnolet added this to Unstructured Data Processing Apr 22, 2025

cjnolet moved this to In Progress in Unstructured Data Processing Apr 22, 2025

Merge branch 'branch-25.06' into branch-25.06.improve_graph_optimization

a6b6e8a

tfeher approved these changes Apr 23, 2025

View reviewed changes

tfeher requested changes Apr 23, 2025

View reviewed changes

anaruse and others added 3 commits April 24, 2025 19:23

Changed to use mdspan more

061a3c1

Moved CPU version of 2-hop detour counting to a separate function

1ec763a

Merge branch 'branch-25.06' into branch-25.06.improve_graph_optimization

9a0eec2

tfeher requested changes May 6, 2025

View reviewed changes

Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated

Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated

Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated

Comment thread cpp/src/neighbors/detail/cagra/cagra_build.cuh

anaruse and others added 3 commits May 7, 2025 14:17

Added use_gpu argument to optimize()

1c81617

Merge branch 'branch-25.06' into branch-25.06.improve_graph_optimization

2fc78ea

Use RAII wrapper to allocate memory using mmap for Transparent HugePage

3c53473

tfeher approved these changes May 25, 2025

View reviewed changes

tfeher mentioned this pull request May 25, 2025

Reduce device memory usage for CAGRA's graph optimization process (reverse graph creation) #832

Open

Merge branch 'branch-25.06' into branch-25.06.improve_graph_optimization

e6e6048

rapids-bot Bot merged commit c62666e into rapidsai:branch-25.06 May 27, 2025
75 checks passed

github-project-automation Bot moved this from In Progress to Done in Unstructured Data Processing May 27, 2025

Conversation

anaruse commented Apr 15, 2025

Uh oh!

copy-pr-bot Bot commented Apr 15, 2025

Uh oh!

cjnolet commented Apr 22, 2025

Uh oh!

copy-pr-bot Bot commented Apr 22, 2025

Uh oh!

tfeher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tfeher left a comment

Choose a reason for hiding this comment

Uh oh!

tfeher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tfeher left a comment

Choose a reason for hiding this comment

Uh oh!

cjnolet commented May 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants