HDBSCAN with NN Descent build option#7339
Conversation
viclafargue
left a comment
There was a problem hiding this comment.
Thanks @jinsolp, great work! It looks like we could maybe consolidate the UMAP and HDBSCAN C++ All-neighbors and NN Descent structs. Unless there is a reason we do not expose the intermediate graph degree and termination threshold parameters in HDBSCAN. Also, it looks like the distinction between knn and nnd parameters might prove useful to improve understanding, could be nice if we could to do the same for UMAP. Just a bunch of ideas for follow-up PRs. Again, great work!
| random_state=42, | ||
| ) | ||
|
|
||
| umap_handle = None |
|
@viclafargue thanks for the review!
That is a good idea! the reason I don't have intermediate graph degree and termination threshold exposed in HDBSCAN is because they don't affect the results as much as graph degree and max iterations, but might as well just expose it because we're already doing that for umap!
This is also something I had in mind, so probably will have to deprecate existing parameters if we choose to do so! |
|
@viclafargue exposed other nn descent parameters to match umap! would be nice if you could take a final look before we merge this 🙂 |
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
This PR adds NN Descent as an alternative KNN graph-building algorithm for HDBSCAN, mirroring the existing UMAP functionality. The implementation introduces a new build_algo parameter (brute_force/nn_descent) with configurable options via build_kwds, enabling faster clustering on large datasets. The change propagates through the stack: C++ headers define the new GRAPH_BUILD_ALGO enum and parameter structs, the runner dispatches to cuVS's all_neighbors API based on build algorithm and data location (device/host), Cython bindings expose the new types, and the Python API validates parameters while auto-adjusting incompatible configurations (e.g., graph_degree >= min_samples + 1). Memory-type selection logic now uses host memory when knn_n_clusters > 1 to support datasets larger than GPU memory via overlapping cluster partitioning.
Critical Issues
-
Compilation Error (cpp/include/cuml/cluster/hdbscan.hpp:138): Missing comma between
CLUSTER_SELECTION_METHODandGRAPH_BUILD_ALGOenum definitions will cause build failure. -
Test Validation Gap (test_hdbscan.py:1223-1248): The new test passes
build_kwds={"knn_n_clusters": n_clusters, "nnd_graph_degree": 32}to both brute_force and nn_descent algorithms.knn_n_clustersis documented as applying to both, butnnd_graph_degree(NN Descent specific) may be silently ignored for brute_force, reducing test effectiveness. Additionally, the 0.9 ARI threshold is permissive and may not catch subtle regressions. -
Parameter Namespace Mismatch Risk (headers.pxd:23-34): Cython declares
nn_descent_params_hdbscanandgraph_build_paramsunder nested namespaceML::HDBSCAN::Common::graph_build_params, but the C++ header places them underML::HDBSCAN::Common. Any mismatch will cause silent memory corruption or segfaults at runtime. -
User Confusion from Auto-Adjustment (runner.h:113-121): When
graph_degree < min_samples + 1, the code silently increases bothgraph_degreeandintermediate_graph_degree(to 2×graph_degree). Users who explicitly setintermediate_graph_degreemay be surprised their value is overridden without error.
Confidence: 2/5 - The compilation error and potential Cython namespace mismatch require immediate attention before merge. The test coverage gaps and auto-adjustment behavior need validation to ensure correctness.
5 files reviewed, no comments
|
/merge |
Closes #6836