Skip to content

Spectral Embedding argument affinity={"precomputed", "nearest_neighbors"}#7117

Merged
rapids-bot[bot] merged 18 commits intorapidsai:branch-25.10from
aamijar:precomputed-spectral-embedding
Aug 26, 2025
Merged

Spectral Embedding argument affinity={"precomputed", "nearest_neighbors"}#7117
rapids-bot[bot] merged 18 commits intorapidsai:branch-25.10from
aamijar:precomputed-spectral-embedding

Conversation

@aamijar
Copy link
Copy Markdown
Member

@aamijar aamijar commented Aug 14, 2025

Resolves #7081

@aamijar aamijar requested review from a team as code owners August 14, 2025 23:30
@github-actions github-actions Bot added Cython / Python Cython or Python issue CUDA/C++ labels Aug 14, 2025
@aamijar aamijar removed the request for review from teju85 August 14, 2025 23:31
@aamijar aamijar added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Aug 14, 2025
@aamijar aamijar changed the title Spectral Embedding precomputed argument affinity="precomputed" Spectral Embedding argument affinity={"precomputed", "nearest_neighbors"} Aug 14, 2025
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @aamijar! It's looking good. Could you update the docstrings? Also here is a number of small change requests.

Comment thread cpp/src/spectral/spectral_embedding.cu
Comment on lines +63 to +66
raft::device_vector_view<int, int> rows,
raft::device_vector_view<int, int> cols,
raft::device_vector_view<float, int> vals,
raft::device_matrix_view<float, int, raft::col_major> embedding);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the API won't work with datasets having more elements (nnz) than std::numeric_limits<int>::max. Would be great to update the cuVS and cuML APIs to allow larger matrices (extent as uint64_t). Maybe as a follow-up PR?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tracking here rapidsai/cuvs#1243.

Comment on lines +152 to +163
rows = A.row
cols = A.col
vals = A.data
n_samples = A.shape[0]
nnz = A.nnz

rows = input_to_cuml_array(rows, order="C",
check_dtype=np.int32, convert_to_dtype=cp.int32)[0]
cols = input_to_cuml_array(cols, order="C",
check_dtype=np.int32, convert_to_dtype=cp.int32)[0]
vals = input_to_cuml_array(vals, order="C",
check_dtype=np.float32, convert_to_dtype=cp.float32)[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a check to ensure that this is a COO matrix and maybe convert otherwise. You could maybe reuse the extract_knn_graph function. Additionally asserts on the length of the arrays would be nice to have too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the assert here ddf2a26. In what case would we need to convert?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user provides other sparse formats than COO and maybe even a dense pre-computed graph.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 706b843

input_to_cuml_array(A, order="C", check_dtype=np.float32,
convert_to_dtype=cp.float32)
A_ptr = <uintptr_t>A.ptr
n_samples = A.shape[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't n_samples the same as _n_rows? Safer to avoid accessing with the shape attribute and leave it to the input_to_cuml_array function to determine the number of samples.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 63f7fa6 and ddf2a26

transform(
deref(h), config,
make_device_matrix_view[float, int, row_major](
<float *>A_ptr, <int> n_samples, <int> A.shape[1]),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use _n_cols rather than A.shape[1].

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 63f7fa6

Comment on lines 82 to 84
def test_spectral_embedding_trustworthiness(
dataset_loader, n_samples, min_trustworthiness
dataset_loader, n_samples, affinity
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to quickly check if it also behave as expected with a smooth KNN such as one produced by the fuzzy_simplicial_set function.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9373046

Comment on lines +73 to +77
[
("nearest_neighbors", None), # Use built-in nearest_neighbors affinity
("precomputed", "binary_knn"), # Precomputed binary k-NN graph
("precomputed", "fuzzy_knn"), # Precomputed fuzzy k-NN graph from UMAP
],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add ("precomputed", "regular_knn") with mode="distance" to check that it is as good as ("nearest_neighbors", None).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in ba3601d

affinity="precomputed",
random_state=42,
)
X_sklearn = sk_spectral.fit_transform(graph_dense.get())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the Scikit-Learn implementation handle sparse arrays here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, addressed here 4560a49

Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aamijar! LGTM, just two small comments.

Comment on lines +214 to +220
# Use deepcopy=True to ensure we don't modify the original arrays
rows = input_to_cuml_array(rows, order="C", deepcopy=True,
check_dtype=np.int32, convert_to_dtype=cp.int32)[0]
cols = input_to_cuml_array(cols, order="C", deepcopy=True,
check_dtype=np.int32, convert_to_dtype=cp.int32)[0]
vals = input_to_cuml_array(vals, order="C", deepcopy=True,
check_dtype=np.float32, convert_to_dtype=cp.float32)[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the C++ side updates the input COO matrix?

Copy link
Copy Markdown
Member Author

@aamijar aamijar Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think its because we are doing coo_sort in place on the input view.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to be able to avoid these copies here.

If we could move the sorting out to be handled by the caller (the creation of the initial coo here should already do that) that'd be cleaner IMO.

If we can't, then I think avoiding the copy is still fine. Sorting is a canonicalization step (the same one that cupyx.scipy.sparse will do). A mutation like that won't make an input coo matrix invalid, and should be fine IMO. Still have a preference to move the sorting out of the routine though.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, something fishy is happening if I remove the copying. I was running into this last week as well. It happens when the input sparse matrix is csc specifically. So the flow is that the user passes in the csc sparse matrix and then it gets converted to a coo matrix in place in the python code. Then the cpp code also performs a coo_sort operation. This corrupts the original input data. So in the pytest the second call to spectral_embedding fails since the input was modified.

Copy link
Copy Markdown
Member Author

@aamijar aamijar Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already tried sorting in the python side and removing the coo_sort in the cpp side. But that didn't work for the csc input.

Copy link
Copy Markdown
Member Author

@aamijar aamijar Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 884e282.
I fixed it by handling csc input with copying. I am also doing the sorting in python side. I confirmed the csc to coo and sorting in python actually also corrupts the input data so we need to copy.

Comment on lines +74 to +80
# Handle scipy sparse matrices
if scipy_issparse(A):
return A.tocoo()

# Handle cupy sparse matrices
if cupy_issparse(A):
return A.tocoo()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort_indices should guarantee order for CSR/CSC. We should probably update the extract_knn_graph function too, maybe in an other PR.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is the sort_indices part you are referring to?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, does that get called as part of the .tocoo(). Is it a problem, or I am not sure what to update.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually nevermind, we should still get contiguous row blocks without sorting and that should be fine.

Comment on lines +214 to +220
# Use deepcopy=True to ensure we don't modify the original arrays
rows = input_to_cuml_array(rows, order="C", deepcopy=True,
check_dtype=np.int32, convert_to_dtype=cp.int32)[0]
cols = input_to_cuml_array(cols, order="C", deepcopy=True,
check_dtype=np.int32, convert_to_dtype=cp.int32)[0]
vals = input_to_cuml_array(vals, order="C", deepcopy=True,
check_dtype=np.float32, convert_to_dtype=cp.float32)[0]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to be able to avoid these copies here.

If we could move the sorting out to be handled by the caller (the creation of the initial coo here should already do that) that'd be cleaner IMO.

If we can't, then I think avoiding the copy is still fine. Sorting is a canonicalization step (the same one that cupyx.scipy.sparse will do). A mutation like that won't make an input coo matrix invalid, and should be fine IMO. Still have a preference to move the sorting out of the routine though.

Comment thread python/cuml/cuml/manifold/spectral_embedding.pyx Outdated
aamijar and others added 3 commits August 26, 2025 01:50
- Simplify affinity matrix input handling
- Raise error on invalid `affinity` value
- A few cleanups to docstrings and tests
Copy link
Copy Markdown
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Comment thread python/cuml/cuml/manifold/spectral_embedding.pyx
Fixes a bug where non-float32 cupy sparse matrices were mishandled. Adds
a test for float64 inputs across all input types.
@jcrist
Copy link
Copy Markdown
Member

jcrist commented Aug 26, 2025

/merge

@rapids-bot rapids-bot Bot merged commit c2de322 into rapidsai:branch-25.10 Aug 26, 2025
76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable passing of pre-computed knn graph to SpectralEmbedding

4 participants