Skip to content

HDBSCAN with NN Descent build option#7339

Merged
rapids-bot[bot] merged 4 commits intorapidsai:mainfrom
jinsolp:hdbscan-nn-descent
Oct 23, 2025
Merged

HDBSCAN with NN Descent build option#7339
rapids-bot[bot] merged 4 commits intorapidsai:mainfrom
jinsolp:hdbscan-nn-descent

Conversation

@jinsolp
Copy link
Copy Markdown
Contributor

@jinsolp jinsolp commented Oct 14, 2025

Closes #6836

@jinsolp jinsolp self-assigned this Oct 14, 2025
@jinsolp jinsolp requested review from a team as code owners October 14, 2025 18:12
@jinsolp jinsolp requested a review from viclafargue October 14, 2025 18:12
@jinsolp jinsolp added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change algo: hdbscan labels Oct 14, 2025
@github-actions github-actions Bot added Cython / Python Cython or Python issue CUDA/C++ labels Oct 14, 2025
@jinsolp jinsolp added feature request New feature or request and removed improvement Improvement / enhancement to an existing function labels Oct 14, 2025
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jinsolp, great work! It looks like we could maybe consolidate the UMAP and HDBSCAN C++ All-neighbors and NN Descent structs. Unless there is a reason we do not expose the intermediate graph degree and termination threshold parameters in HDBSCAN. Also, it looks like the distinction between knn and nnd parameters might prove useful to improve understanding, could be nice if we could to do the same for UMAP. Just a bunch of ideas for follow-up PRs. Again, great work!

Comment thread python/cuml/tests/test_hdbscan.py Outdated
random_state=42,
)

umap_handle = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hdbscan_handle?

@jinsolp
Copy link
Copy Markdown
Contributor Author

jinsolp commented Oct 22, 2025

@viclafargue thanks for the review!

like we could maybe consolidate the UMAP and HDBSCAN C++ All-neighbors and NN Descent structs. Unless there is a reason we do not expose the intermediate graph degree and termination threshold parameters in HDBSCAN.

That is a good idea! the reason I don't have intermediate graph degree and termination threshold exposed in HDBSCAN is because they don't affect the results as much as graph degree and max iterations, but might as well just expose it because we're already doing that for umap!

Also, it looks like the distinction between knn and nnd parameters might prove useful to improve understanding, could be nice if we could to do the same for UMAP.

This is also something I had in mind, so probably will have to deprecate existing parameters if we choose to do so!

@jinsolp
Copy link
Copy Markdown
Contributor Author

jinsolp commented Oct 22, 2025

@viclafargue exposed other nn descent parameters to match umap! would be nice if you could take a final look before we merge this 🙂

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds NN Descent as an alternative KNN graph-building algorithm for HDBSCAN, mirroring the existing UMAP functionality. The implementation introduces a new build_algo parameter (brute_force/nn_descent) with configurable options via build_kwds, enabling faster clustering on large datasets. The change propagates through the stack: C++ headers define the new GRAPH_BUILD_ALGO enum and parameter structs, the runner dispatches to cuVS's all_neighbors API based on build algorithm and data location (device/host), Cython bindings expose the new types, and the Python API validates parameters while auto-adjusting incompatible configurations (e.g., graph_degree >= min_samples + 1). Memory-type selection logic now uses host memory when knn_n_clusters > 1 to support datasets larger than GPU memory via overlapping cluster partitioning.

Critical Issues

  1. Compilation Error (cpp/include/cuml/cluster/hdbscan.hpp:138): Missing comma between CLUSTER_SELECTION_METHOD and GRAPH_BUILD_ALGO enum definitions will cause build failure.

  2. Test Validation Gap (test_hdbscan.py:1223-1248): The new test passes build_kwds={"knn_n_clusters": n_clusters, "nnd_graph_degree": 32} to both brute_force and nn_descent algorithms. knn_n_clusters is documented as applying to both, but nnd_graph_degree (NN Descent specific) may be silently ignored for brute_force, reducing test effectiveness. Additionally, the 0.9 ARI threshold is permissive and may not catch subtle regressions.

  3. Parameter Namespace Mismatch Risk (headers.pxd:23-34): Cython declares nn_descent_params_hdbscan and graph_build_params under nested namespace ML::HDBSCAN::Common::graph_build_params, but the C++ header places them under ML::HDBSCAN::Common. Any mismatch will cause silent memory corruption or segfaults at runtime.

  4. User Confusion from Auto-Adjustment (runner.h:113-121): When graph_degree < min_samples + 1, the code silently increases both graph_degree and intermediate_graph_degree (to 2×graph_degree). Users who explicitly set intermediate_graph_degree may be surprised their value is overridden without error.

Confidence: 2/5 - The compilation error and potential Cython namespace mismatch require immediate attention before merge. The test coverage gaps and auto-adjustment behavior need validation to ensure correctness.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jinsolp
Copy link
Copy Markdown
Contributor Author

jinsolp commented Oct 23, 2025

/merge

@rapids-bot rapids-bot Bot merged commit 772fb22 into rapidsai:main Oct 23, 2025
197 of 202 checks passed
@jinsolp jinsolp deleted the hdbscan-nn-descent branch October 23, 2025 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

algo: hdbscan CUDA/C++ Cython / Python Cython or Python issue feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TRACKER] HDBSCAN with NN Descent as knn build option

3 participants