Skip to content

Reduce device memory usage for CAGRA's graph optimization process (2-hop detour counting)#822

Merged
rapids-bot[bot] merged 9 commits intorapidsai:branch-25.06from
anaruse:branch-25.06.improve_graph_optimization
May 27, 2025
Merged

Reduce device memory usage for CAGRA's graph optimization process (2-hop detour counting)#822
rapids-bot[bot] merged 9 commits intorapidsai:branch-25.06from
anaruse:branch-25.06.improve_graph_optimization

Conversation

@anaruse
Copy link
Copy Markdown
Contributor

@anaruse anaruse commented Apr 15, 2025

CAGRA takes the initial knn graph as input and optimizes it to create a search graph. Several types of processing are performed in the graph optimization, the most memory-intensive of which is the counting of 2-hop detours.

Currently, the counting of 2-hop detours is performed on the GPU to speed up processing, and this requires that the entire initial knn graph be placed in device memory. In general, the size of the initial knn graph is 2x the size of the search graph. In other words, in the current implementation, roughly half the device memory size is the upper limit of the search graph that can be created. As it is, creating search graphs for huge datasets requires a GPU with a large amount of device memory, which is not practical.

To address this issue, this PR adds a CPU implementation of 2-hop detour counting and uses this CPU implementation to count 2-hop detours when device memory is insufficient.

The CPU implementation supports thread parallelism and is optimized to reduce conditional branches and is sufficiently fast. Of course, it is slower than the GPU implementation, but it can count 2-hop detours in about 3 to 4 times the time of the GPU implementation. Since the time for counting 2-hop detours on GPU is approximately 10% of the total indexing time, the overall time will increase by 20-30% when using the CPU implementation, but this is well within the practical range.

@anaruse anaruse requested a review from a team as a code owner April 15, 2025 09:30
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 15, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cpp label Apr 15, 2025
@anaruse anaruse changed the title Reduce device memory usage for CAGRA's graph optimization process Reduce device memory usage for CAGRA's graph optimization process (1) Apr 21, 2025
@anaruse anaruse changed the title Reduce device memory usage for CAGRA's graph optimization process (1) Reduce device memory usage for CAGRA's graph optimization process (2-hop detour counting) Apr 21, 2025
@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Apr 22, 2025
@cjnolet cjnolet moved this to In Progress in Unstructured Data Processing Apr 22, 2025
@cjnolet
Copy link
Copy Markdown
Member

cjnolet commented Apr 22, 2025

/ok to test

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 22, 2025

/ok to test

@cjnolet, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

Copy link
Copy Markdown
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Akira, it is great to have this feature! I have a few comments and suggestions below.

Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated
Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh
Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated
Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated
Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated
Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated
Copy link
Copy Markdown
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clicked on the wrong button, I meant to request changes.

Copy link
Copy Markdown
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anaruse for the updates! I have a few more request below.

Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated
Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated
Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated
Comment thread cpp/src/neighbors/detail/cagra/cagra_build.cuh
Copy link
Copy Markdown
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Akira for the updates, the PR looks good to me.

@cjnolet
Copy link
Copy Markdown
Member

cjnolet commented May 27, 2025

/merge

@rapids-bot rapids-bot Bot merged commit c62666e into rapidsai:branch-25.06 May 27, 2025
75 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Unstructured Data Processing May 27, 2025
mythrocks pushed a commit to mythrocks/cuvs that referenced this pull request Jun 3, 2025
…hop detour counting) (rapidsai#822)

CAGRA takes the initial knn graph as input and optimizes it to create a search graph. Several types of processing are performed in the graph optimization, the most memory-intensive of which is the counting of 2-hop detours.

Currently, the counting of 2-hop detours is performed on the GPU to speed up processing, and this requires that the entire initial knn graph be placed in device memory. In general, the size of the initial knn graph is 2x the size of the search graph. In other words, in the current implementation, roughly half the device memory size is the upper limit of the search graph that can be created. As it is, creating search graphs for huge datasets requires a GPU with a large amount of device memory, which is not practical.

To address this issue, this PR adds a CPU implementation of 2-hop detour counting and uses this CPU implementation to count 2-hop detours when device memory is insufficient.

The CPU implementation supports thread parallelism and is optimized to reduce conditional branches and is sufficiently fast. Of course, it is slower than the GPU implementation, but it can count 2-hop detours in about 3 to 4 times the time of the GPU implementation. Since the time for counting 2-hop detours on GPU is approximately 10% of the total indexing time, the overall time will increase by 20-30% when using the CPU implementation, but this is well within the practical range.

Authors:
  - Akira Naruse (https://github.com/anaruse)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: rapidsai#822
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants