Spectral Embedding argument affinity={"precomputed", "nearest_neighbors"}#7117
Conversation
affinity="precomputed"affinity={"precomputed", "nearest_neighbors"}
viclafargue
left a comment
There was a problem hiding this comment.
Thanks a lot @aamijar! It's looking good. Could you update the docstrings? Also here is a number of small change requests.
| raft::device_vector_view<int, int> rows, | ||
| raft::device_vector_view<int, int> cols, | ||
| raft::device_vector_view<float, int> vals, | ||
| raft::device_matrix_view<float, int, raft::col_major> embedding); |
There was a problem hiding this comment.
It looks like the API won't work with datasets having more elements (nnz) than std::numeric_limits<int>::max. Would be great to update the cuVS and cuML APIs to allow larger matrices (extent as uint64_t). Maybe as a follow-up PR?
| rows = A.row | ||
| cols = A.col | ||
| vals = A.data | ||
| n_samples = A.shape[0] | ||
| nnz = A.nnz | ||
|
|
||
| rows = input_to_cuml_array(rows, order="C", | ||
| check_dtype=np.int32, convert_to_dtype=cp.int32)[0] | ||
| cols = input_to_cuml_array(cols, order="C", | ||
| check_dtype=np.int32, convert_to_dtype=cp.int32)[0] | ||
| vals = input_to_cuml_array(vals, order="C", | ||
| check_dtype=np.float32, convert_to_dtype=cp.float32)[0] |
There was a problem hiding this comment.
Could you add a check to ensure that this is a COO matrix and maybe convert otherwise. You could maybe reuse the extract_knn_graph function. Additionally asserts on the length of the arrays would be nice to have too.
There was a problem hiding this comment.
Added the assert here ddf2a26. In what case would we need to convert?
There was a problem hiding this comment.
If the user provides other sparse formats than COO and maybe even a dense pre-computed graph.
| input_to_cuml_array(A, order="C", check_dtype=np.float32, | ||
| convert_to_dtype=cp.float32) | ||
| A_ptr = <uintptr_t>A.ptr | ||
| n_samples = A.shape[0] |
There was a problem hiding this comment.
Isn't n_samples the same as _n_rows? Safer to avoid accessing with the shape attribute and leave it to the input_to_cuml_array function to determine the number of samples.
| transform( | ||
| deref(h), config, | ||
| make_device_matrix_view[float, int, row_major]( | ||
| <float *>A_ptr, <int> n_samples, <int> A.shape[1]), |
There was a problem hiding this comment.
Please use _n_cols rather than A.shape[1].
| def test_spectral_embedding_trustworthiness( | ||
| dataset_loader, n_samples, min_trustworthiness | ||
| dataset_loader, n_samples, affinity | ||
| ): |
There was a problem hiding this comment.
Would be great to quickly check if it also behave as expected with a smooth KNN such as one produced by the fuzzy_simplicial_set function.
| [ | ||
| ("nearest_neighbors", None), # Use built-in nearest_neighbors affinity | ||
| ("precomputed", "binary_knn"), # Precomputed binary k-NN graph | ||
| ("precomputed", "fuzzy_knn"), # Precomputed fuzzy k-NN graph from UMAP | ||
| ], |
There was a problem hiding this comment.
Could we also add ("precomputed", "regular_knn") with mode="distance" to check that it is as good as ("nearest_neighbors", None).
| affinity="precomputed", | ||
| random_state=42, | ||
| ) | ||
| X_sklearn = sk_spectral.fit_transform(graph_dense.get()) |
There was a problem hiding this comment.
Doesn't the Scikit-Learn implementation handle sparse arrays here?
viclafargue
left a comment
There was a problem hiding this comment.
Thanks @aamijar! LGTM, just two small comments.
| # Use deepcopy=True to ensure we don't modify the original arrays | ||
| rows = input_to_cuml_array(rows, order="C", deepcopy=True, | ||
| check_dtype=np.int32, convert_to_dtype=cp.int32)[0] | ||
| cols = input_to_cuml_array(cols, order="C", deepcopy=True, | ||
| check_dtype=np.int32, convert_to_dtype=cp.int32)[0] | ||
| vals = input_to_cuml_array(vals, order="C", deepcopy=True, | ||
| check_dtype=np.float32, convert_to_dtype=cp.float32)[0] |
There was a problem hiding this comment.
Does the C++ side updates the input COO matrix?
There was a problem hiding this comment.
Yes, I think its because we are doing coo_sort in place on the input view.
There was a problem hiding this comment.
It'd be nice to be able to avoid these copies here.
If we could move the sorting out to be handled by the caller (the creation of the initial coo here should already do that) that'd be cleaner IMO.
If we can't, then I think avoiding the copy is still fine. Sorting is a canonicalization step (the same one that cupyx.scipy.sparse will do). A mutation like that won't make an input coo matrix invalid, and should be fine IMO. Still have a preference to move the sorting out of the routine though.
There was a problem hiding this comment.
Hmm, something fishy is happening if I remove the copying. I was running into this last week as well. It happens when the input sparse matrix is csc specifically. So the flow is that the user passes in the csc sparse matrix and then it gets converted to a coo matrix in place in the python code. Then the cpp code also performs a coo_sort operation. This corrupts the original input data. So in the pytest the second call to spectral_embedding fails since the input was modified.
There was a problem hiding this comment.
I already tried sorting in the python side and removing the coo_sort in the cpp side. But that didn't work for the csc input.
There was a problem hiding this comment.
Addressed in 884e282.
I fixed it by handling csc input with copying. I am also doing the sorting in python side. I confirmed the csc to coo and sorting in python actually also corrupts the input data so we need to copy.
| # Handle scipy sparse matrices | ||
| if scipy_issparse(A): | ||
| return A.tocoo() | ||
|
|
||
| # Handle cupy sparse matrices | ||
| if cupy_issparse(A): | ||
| return A.tocoo() |
There was a problem hiding this comment.
sort_indices should guarantee order for CSR/CSC. We should probably update the extract_knn_graph function too, maybe in an other PR.
There was a problem hiding this comment.
Which is the sort_indices part you are referring to?
There was a problem hiding this comment.
There was a problem hiding this comment.
Okay, does that get called as part of the .tocoo(). Is it a problem, or I am not sure what to update.
There was a problem hiding this comment.
Actually nevermind, we should still get contiguous row blocks without sorting and that should be fine.
| # Use deepcopy=True to ensure we don't modify the original arrays | ||
| rows = input_to_cuml_array(rows, order="C", deepcopy=True, | ||
| check_dtype=np.int32, convert_to_dtype=cp.int32)[0] | ||
| cols = input_to_cuml_array(cols, order="C", deepcopy=True, | ||
| check_dtype=np.int32, convert_to_dtype=cp.int32)[0] | ||
| vals = input_to_cuml_array(vals, order="C", deepcopy=True, | ||
| check_dtype=np.float32, convert_to_dtype=cp.float32)[0] |
There was a problem hiding this comment.
It'd be nice to be able to avoid these copies here.
If we could move the sorting out to be handled by the caller (the creation of the initial coo here should already do that) that'd be cleaner IMO.
If we can't, then I think avoiding the copy is still fine. Sorting is a canonicalization step (the same one that cupyx.scipy.sparse will do). A mutation like that won't make an input coo matrix invalid, and should be fine IMO. Still have a preference to move the sorting out of the routine though.
Fixes a bug where non-float32 cupy sparse matrices were mishandled. Adds a test for float64 inputs across all input types.
|
/merge |
Resolves #7081