Add NearestNeighbors.radius_neighbors_graph#7811
Add NearestNeighbors.radius_neighbors_graph#7811rapids-bot[bot] merged 5 commits intorapidsai:mainfrom
NearestNeighbors.radius_neighbors_graph#7811Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
b7ca26d to
b0cae30
Compare
NearestNeighbors.radius_neighbors_graphNearestNeighbors.radius_neighbors_graph
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds GPU radius-based neighbor-graph support: new C++/CUDA API Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/include/cuml/neighbors/knn.hpp`:
- Around line 74-82: Add a complete Doxygen comment block for the new public API
function rbc_radius_neighbors_graph describing purpose, parameters,
return/outputs and the two-phase semantics: explain that when adj_cols==nullptr
and nnz==0 it performs a count-only pass writing per-row counts to adj_rows, and
on the fill pass callers must allocate adj_cols with size nnz and pass the same
adj_rows to receive neighbor column indices; document meanings and units of
handle, rbc_index, query, n_query, dim, radius, adj_rows, adj_cols, and nnz, any
preconditions (e.g., non-null pointers, how nnz is interpreted), and reference
similar documented functions like brute_force_knn / rbc_free_index for style and
phrasing so the header is consistent with existing Doxygen conventions.
In `@python/cuml/cuml/accel/_wrappers/sklearn/neighbors.py`:
- Around line 23-36: In _gpu_radius_neighbors_graph, accessing
self.effective_metric_ can raise AttributeError on unfitted proxies and bypass
the ProxyBase CPU-fallback; wrap the metric check to first guard for a fitted
state (e.g., use hasattr(self, "effective_metric_") or getattr(self,
"effective_metric_", None)) and if it's missing raise UnsupportedOnGPU with a
clear message (e.g., "estimator is not fitted" or similar) so
ProxyBase._call_method will catch it and fall back to the CPU path; update the
check in the _gpu_radius_neighbors_graph method accordingly.
In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 350-353: The connectivity data array is created with cp.ones(nnz)
which yields float64 by default, causing a dtype mismatch with kneighbors_graph
(which uses np.float32); change the data creation in nearest_neighbors.pyx to
cp.ones(nnz, dtype=cp.float32) (or cp.float32) so the returned
cupyx.scipy.sparse.csr_matrix has float32 connectivity values and matches
kneighbors_graph's dtype.
- Around line 1165-1167: Update the radius_neighbors_graph signature to include
the sklearn-compatible parameters mode='connectivity' and sort_results=False
(i.e., def radius_neighbors_graph(self, X=None, radius=None,
mode='connectivity', sort_results=False) -> SparseCumlArray) so callers passing
those kwargs won't get TypeError; inside the method (radius_neighbors_graph)
validate mode and raise a clear error if mode != 'connectivity' (or map behavior
if you implement others), and accept sort_results (no-op if unsupported) so API
matches scikit-learn while preserving existing connectivity-only behavior.
- Around line 304-354: The radius_neighbors_graph method has a race: after the
first with nogil call to rbc_radius_neighbors_graph (which populates indptr) the
code reads indptr[-1].item() without synchronizing the RAFT handle; add a call
to handle.sync() (using the same handle obtained via get_handle()) immediately
after the first with nogil block and before computing nnz to ensure the GPU work
writing indptr is finished; follow the same pattern used in RBCIndex.build and
RBCIndex.kneighbors to locate where to insert handle.sync() in
radius_neighbors_graph.
In `@python/cuml/tests/test_nearest_neighbors.py`:
- Around line 560-582: The test declares metric but never passes it to the KNN
models, so haversine cases are not exercised; update the
test_radius_neighbors_graph to pass metric=metric into both skKNN and cuKNN when
constructing sk_model and cu_model, and add a guard to skip the haversine branch
for non-2D inputs (e.g., if metric == "haversine" and n_features != 2:
pytest.skip(...)) so the test only runs haversine with 2 features.
b0cae30 to
dfae43f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
python/cuml/cuml/neighbors/nearest_neighbors.pyx (1)
304-316: Unused variablen_query; consider removing.
n_queryis assigned on Line 312 but never used —n_rows(Line 316, same value) is what's actually passed to the C call. This is a harmless leftover but adds noise.- cdef int64_t n_query = X.shape[0] - indptr = cp.empty(n_query + 1, dtype=np.int64) + indptr = cp.empty(n_rows + 1, dtype=np.int64)#!/bin/bash # Verify n_query is not used elsewhere in the method rg -n 'n_query' python/cuml/cuml/neighbors/nearest_neighbors.pyx | head -20🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx` around lines 304 - 316, The variable n_query in the radius_neighbors_graph function is unused and duplicates n_rows; remove the declaration/assignment "cdef int64_t n_query = X.shape[0]" and any references to n_query so the code only uses n_rows (or consolidate to n_query if you prefer that name) to avoid the dead variable while leaving the rest of the function (handle setup, indptr creation, X_ptr, n_rows usage) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 356-361: Add the same dimension guard used in RBCIndex.kneighbors
to RBCIndex.radius_neighbors_graph: check X.shape[1] and if it's > 3 raise a
ValueError with the same message ("The rbc algorithm is not supported for >3
dimensions currently.") so calls that invoke the underlying ball_cover logic are
prevented from running with unsupported dimensionality; update the
RBCIndex.radius_neighbors_graph method to perform this check before any
ball_cover operations.
---
Duplicate comments:
In `@python/cuml/cuml/accel/_wrappers/sklearn/neighbors.py`:
- Around line 32-36: Don't access self.effective_metric_ directly; guard that
attribute before the metric check in the radius_neighbors_graph wrapper so an
unfitted proxy doesn't raise AttributeError unexpectedly. Replace the direct
access in the check (the code that currently does if self.effective_metric_ not
in ["l2","euclidean"]) with a presence-safe check such as using hasattr(self,
"effective_metric_") or metric = getattr(self, "effective_metric_", None) and
only perform the UnsupportedOnGPU check when metric is present; leave behavior
unchanged for the call to self._gpu.radius_neighbors_graph(X=..., radius=...)
and ensure ProxyBase._call_method can still catch unfitted-state errors.
In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Line 350: The array created for connectivity currently uses cp.ones(nnz) which
defaults to float64 and conflicts with kneighbors_graph's use of
dtype=np.float32; change the creation of data to explicitly use float32 (e.g.,
cp.ones(nnz, dtype=cp.float32) or equivalent) so the dtype matches
kneighbors_graph and avoids inconsistencies between the two graph methods.
- Around line 1165-1167: The cuML method radius_neighbors_graph lacks the
sklearn-compatible parameters mode and sort_results, causing unexpected-argument
errors; update the radius_neighbors_graph signature in nearest_neighbors.pyx to
accept mode='connectivity' and sort_results=False (matching sklearn defaults) in
addition to X and radius, then thread these parameters into the existing
implementation or validation logic used by the accel wrapper (ensure the
function forwards them to the same internal checks/dispatch that lives in
neighbors.py or handles them the same way), and maintain the existing decorators
(`@insert_into_docstring`, `@reflect`) and return type SparseCumlArray.
- Around line 320-333: The GPU-written indptr is read on the host without
synchronizing the RAFT/handle stream; after the first with nogil block that
calls rbc_radius_neighbors_graph (and likewise after the second with nogil that
writes device outputs) insert a handle synchronization before any host reads of
indptr (e.g., call handle_[0].sync() or the project’s RAFT handle.sync()
equivalent) so that the device kernel completes before executing
indptr[-1].item() and any subsequent host-side transfers.
---
Nitpick comments:
In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 304-316: The variable n_query in the radius_neighbors_graph
function is unused and duplicates n_rows; remove the declaration/assignment
"cdef int64_t n_query = X.shape[0]" and any references to n_query so the code
only uses n_rows (or consolidate to n_query if you prefer that name) to avoid
the dead variable while leaving the rest of the function (handle setup, indptr
creation, X_ptr, n_rows usage) unchanged.
This adds support for `NearestNeighbors.radius_neighbors_graph`, built on `cuvs::neighbors::ball_cover::eps_nn`.
dfae43f to
cee586c
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
python/cuml/cuml/neighbors/nearest_neighbors.pyx (1)
312-316: Optional:n_rowsduplicatesn_query— one can be removed.Both
n_query(Line 312) andn_rows(Line 316) holdX.shape[0]. The rest of the method usesn_rowswhilen_queryis only used forindptrallocation. Collapsing to a single variable reduces cognitive load.♻️ Suggested simplification
- cdef int64_t n_query = X.shape[0] - indptr = cp.empty(n_query + 1, dtype=np.int64) cdef float* X_ptr = <float*><uintptr_t>X.ptr cdef int64_t n_rows = X.shape[0] + cdef int64_t n_query = n_rows cdef int64_t n_cols = X.shape[1]or simply use
n_rowseverywhere and dropn_query.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx` around lines 312 - 316, n_query duplicates n_rows; remove the redundant variable and use n_rows everywhere: delete the cdef int64_t n_query declaration and change the indptr allocation to use n_rows + 1 (indptr = cp.empty(n_rows + 1, dtype=np.int64)); ensure any other uses of n_query are replaced with n_rows (the method currently uses n_rows elsewhere) so there is a single authoritative row-count variable in nearest_neighbors.pyx.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/include/cuml/neighbors/knn.hpp`:
- Around line 79-82: Update the Doxygen comment that incorrectly uses Python
negative-index notation: replace occurrences of adj_indptr[-1] with the
C++-appropriate adj_indptr[n_query] (since adj_indptr has size n_query+1) and
ensure the explanation for computing nnz uses adj_indptr[n_query]; check the
related mentions near the adj_indices and nnz description (the same pattern
appears again around the later comment block) and update them to the same
adj_indptr[n_query] form.
In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 1274-1275: The current guard if radius <= 0 in
nearest_neighbors.pyx doesn't reject NaN or Inf; update the check to validate
finiteness and positivity (e.g., import math and replace the guard with if not
math.isfinite(radius) or radius <= 0: raise ValueError(...)) so NaN/inf are
rejected before hitting the C++ layer; ensure the error message mentions
invalid/non-finite radius and keep the check around the same scope where the
variable radius is validated.
In `@python/cuml/tests/test_nearest_neighbors.py`:
- Around line 600-602: Add a second assertion in the test to exercise the
zero-radius branch: call model.radius_neighbors_graph(radius=0) and assert it
raises the same ValueError with message "Expected `radius > 0`, got 0" (or use
match "Expected `radius > 0`" to be less brittle); this ensures the
radius_neighbors_graph guard against radius <= 0 is covered. Reference the
existing test block around model.radius_neighbors_graph to insert the new
pytest.raises check for radius=0.
---
Nitpick comments:
In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 312-316: n_query duplicates n_rows; remove the redundant variable
and use n_rows everywhere: delete the cdef int64_t n_query declaration and
change the indptr allocation to use n_rows + 1 (indptr = cp.empty(n_rows + 1,
dtype=np.int64)); ensure any other uses of n_query are replaced with n_rows (the
method currently uses n_rows elsewhere) so there is a single authoritative
row-count variable in nearest_neighbors.pyx.
viclafargue
left a comment
There was a problem hiding this comment.
Thanks! LGTM, just minor questions. Thanks for removing radius search from KNN classifier and regressor. I think that NearestNeighborsMG also inherits from the NearestNeighbor class too, but this is not directly exposed to user, so less of a concern.
| cdef int64_t n_query = X.shape[0] | ||
|
|
||
| indptr = cp.empty(n_query + 1, dtype=np.int64) | ||
| cdef float* X_ptr = <float*><uintptr_t>X.ptr | ||
| cdef int64_t n_rows = X.shape[0] |
There was a problem hiding this comment.
Could we fuse n_rows and n_query as n_queries?
There was a problem hiding this comment.
Agree we have an extraneous local here, will fix in the followup.
| This method is most efficient when the instance is fit with | ||
| `algorithm="rbc"`. Other algorithms will build a temporary RBC index | ||
| per-call, which adds a small overhead. |
There was a problem hiding this comment.
Should we replicate sklearn behavior exactly here or instead ensure that the user always have optimal use of VRAM by demanding the RBC index to be built during fit and thus avoid uselessly storing the dataset in VRAM in addition to the index?
There was a problem hiding this comment.
I don't think we should error. If anything, we could add a log or warning in this case, but IMO if we can make something just work then we should. Happy to discuss and maybe add a warning/log in the followup.
| >>> import cupy as cp | ||
| >>> from cuml.neighbors import NearestNeighbors | ||
| >>> X = cp.array([[0], [3], [1]]) | ||
| >>> nn = NearestNeighbors().fit(X) |
There was a problem hiding this comment.
Shouldn't the minimal doc example demonstrate use with algorithm="rbc"?
There was a problem hiding this comment.
Agreed - I'll add that in the followup.
|
Thanks for the review! I agree with a few of your comments. Will fix them in the followup PR while expanding support; none seemed worth spending an extra CI cycle here when I'll be touching this function again in a followup. |
|
/merge |
This adds support for
NearestNeighbors.radius_neighbors_graph, built oncuvs::neighbors::ball_cover::eps_nn.This only adds support for
mode="connectivity",mode="distance"requires rapidsai/cuvs#1815 fromcuvsand will be handled in a followup. Likewise forNearestNeighbors.radius_neighbors.Part of #262, Part of #7759.