UMAP with new Spectral Embedding initialization#7063
UMAP with new Spectral Embedding initialization#7063rapids-bot[bot] merged 18 commits intorapidsai:branch-25.10from
Conversation
viclafargue
left a comment
There was a problem hiding this comment.
Thanks Anupam, this looks great! This was a longstanding issue, so it's great to see it addressed. LGTM overall. One point though: the reference implementation initializes the embeddings within a (-10, 10) bounding box, regardless of the initialization method. Could we retain that behavior?
| auto tmp_embedding_view = raft::make_device_matrix_view<float, int, raft::col_major>( | ||
| tmp_embedding.data_handle(), n, params->n_components); |
There was a problem hiding this comment.
Just a detail, but you could use .view() to create a view.
EDIT: Unless raft::col_major is important here.
|
@aamijar in addition to the viz above, please also benchmark the old and new spectral initialization on multiple datasets or different shapes and sizes so that we can get an idea of the perf gap (if any) between them. it’s very very important that we characterize the delta here very carefully. |
cjnolet
left a comment
There was a problem hiding this comment.
We are noticing some potential recent perf regressions in UMAP, so I'd just like to make sure we're not introducing significantly more with these changes. We need to collect some UMAP benchmarks (before and after this change) before this is merged.
I think the impact to perf could be justified if not major just because of the huge quality improvements. But we'll need to at least assess where we are first.
|
Thanks for the review @viclafargue! I have addressed your comments and added the bounding box initialization constraint. |
viclafargue
left a comment
There was a problem hiding this comment.
LGTM! But, could you test again the quality of the initialization after these changes (the bounding box and random noise addition) ?
|
Yes, I have added the new visualizations to the PR description. |
|
I think that the number of epochs has to be set to 1 to get a good visualization of spectral initialization within UMAP. Setting it to 0 makes UMAP use a default value. |
|
I am not running the simpl_set part since I comment out the part I mentioned in the PR description. |
Oh, I see, then it should be good. But, why are there only 3 points visible in the visualization then? |
|
I think because the dataset has 2 features. The embedding clusters the points on top of each other. The umap-learn version also does the same if you plot out the spectral initialization. |
|
@aamijar thanks for updating the visualizations in the description. I'm a little confused by the second set of visualizations, though. Can you verify the "updated" visualization is what we would expect from the cpu version as well? You should be able to verify this by setting the number of epochs to 0, or printing the points in the cpu version right after initialization. |
|
Hi @cjnolet, I have added the benchmark results the PR description. I have also added additional plots to show the CPU spectral initialization. The CPU one also has 3 points if you run sklearn SpectralEmbedding (I edited the code to do this) as the spectral initialization. If you run the usual CPU spectral initialization you will get a different plot. |
| auto connectivity_graph_view = raft::make_device_coo_matrix_view<float, int, int, int>( | ||
| coo->vals(), | ||
| raft::make_device_coordinate_structure_view<int, int, int>( | ||
| coo->rows(), coo->cols(), n, n, coo->nnz)); |
There was a problem hiding this comment.
Hi @jinsolp, I think we can create a follow up issue for nnz_t types since I would need to change the cuvs api too.
|
/merge |
Resolves #7052, Depends on rapidsai/cuvs#1197
Use the new Spectral Embedding algorithm from
cuvs::preprocessing::spectral_embeddingfor spectral initialization in UMAP.The previous spectral initialization has had longstanding issues mentioned in issues such as #5782.
Visualization Analysis
Code beyond the initialization step in umap is commented out to obtain the plots https://github.com/rapidsai/cuml/blob/branch-25.08/cpp/src/umap/runner.cuh#L245C3-L258C40
1 = old spectral initialization
2 = new spectral initialization
3 = cpu spectral initialization using SpectralEmbedding()
4 = cpu spectral initialization
Benchmarking 1x L4
The following python script is used to obtain the plots.