Reduce device memory usage for CAGRA's graph optimization process (2-hop detour counting)#822
Merged
rapids-bot[bot] merged 9 commits intorapidsai:branch-25.06from May 27, 2025
Conversation
… usage in graph optimization
Member
|
/ok to test |
@cjnolet, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
tfeher
approved these changes
Apr 23, 2025
Contributor
tfeher
left a comment
There was a problem hiding this comment.
Thanks Akira, it is great to have this feature! I have a few comments and suggestions below.
tfeher
requested changes
Apr 23, 2025
Contributor
tfeher
left a comment
There was a problem hiding this comment.
Clicked on the wrong button, I meant to request changes.
tfeher
requested changes
May 6, 2025
tfeher
approved these changes
May 25, 2025
Contributor
tfeher
left a comment
There was a problem hiding this comment.
Thanks Akira for the updates, the PR looks good to me.
Member
|
/merge |
mythrocks
pushed a commit
to mythrocks/cuvs
that referenced
this pull request
Jun 3, 2025
…hop detour counting) (rapidsai#822) CAGRA takes the initial knn graph as input and optimizes it to create a search graph. Several types of processing are performed in the graph optimization, the most memory-intensive of which is the counting of 2-hop detours. Currently, the counting of 2-hop detours is performed on the GPU to speed up processing, and this requires that the entire initial knn graph be placed in device memory. In general, the size of the initial knn graph is 2x the size of the search graph. In other words, in the current implementation, roughly half the device memory size is the upper limit of the search graph that can be created. As it is, creating search graphs for huge datasets requires a GPU with a large amount of device memory, which is not practical. To address this issue, this PR adds a CPU implementation of 2-hop detour counting and uses this CPU implementation to count 2-hop detours when device memory is insufficient. The CPU implementation supports thread parallelism and is optimized to reduce conditional branches and is sufficiently fast. Of course, it is slower than the GPU implementation, but it can count 2-hop detours in about 3 to 4 times the time of the GPU implementation. Since the time for counting 2-hop detours on GPU is approximately 10% of the total indexing time, the overall time will increase by 20-30% when using the CPU implementation, but this is well within the practical range. Authors: - Akira Naruse (https://github.com/anaruse) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#822
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CAGRA takes the initial knn graph as input and optimizes it to create a search graph. Several types of processing are performed in the graph optimization, the most memory-intensive of which is the counting of 2-hop detours.
Currently, the counting of 2-hop detours is performed on the GPU to speed up processing, and this requires that the entire initial knn graph be placed in device memory. In general, the size of the initial knn graph is 2x the size of the search graph. In other words, in the current implementation, roughly half the device memory size is the upper limit of the search graph that can be created. As it is, creating search graphs for huge datasets requires a GPU with a large amount of device memory, which is not practical.
To address this issue, this PR adds a CPU implementation of 2-hop detour counting and uses this CPU implementation to count 2-hop detours when device memory is insufficient.
The CPU implementation supports thread parallelism and is optimized to reduce conditional branches and is sufficiently fast. Of course, it is slower than the GPU implementation, but it can count 2-hop detours in about 3 to 4 times the time of the GPU implementation. Since the time for counting 2-hop detours on GPU is approximately 10% of the total indexing time, the overall time will increase by 20-30% when using the CPU implementation, but this is well within the practical range.