Skip to content

UMAP with new Spectral Embedding initialization#7063

Merged
rapids-bot[bot] merged 18 commits intorapidsai:branch-25.10from
aamijar:umap-spectral-embedding
Aug 12, 2025
Merged

UMAP with new Spectral Embedding initialization#7063
rapids-bot[bot] merged 18 commits intorapidsai:branch-25.10from
aamijar:umap-spectral-embedding

Conversation

@aamijar
Copy link
Copy Markdown
Member

@aamijar aamijar commented Jul 30, 2025

Resolves #7052, Depends on rapidsai/cuvs#1197

Use the new Spectral Embedding algorithm from cuvs::preprocessing::spectral_embedding for spectral initialization in UMAP.
The previous spectral initialization has had longstanding issues mentioned in issues such as #5782.

Visualization Analysis

Code beyond the initialization step in umap is commented out to obtain the plots https://github.com/rapidsai/cuml/blob/branch-25.08/cpp/src/umap/runner.cuh#L245C3-L258C40

1 = old spectral initialization
2 = new spectral initialization
3 = cpu spectral initialization using SpectralEmbedding()
4 = cpu spectral initialization

image image

Benchmarking 1x L4

Dataset Samples Features New Time (s) New Trustworthiness Old Time (s) Old Trustworthiness
deep-image-96-angular 50,000 96 0.321±0.004 0.8945±0.0007 0.341±0.006 0.8961±0.0013
fashion-mnist-784-euclidean 50,000 784 1.247±0.002 0.9743±0.0003 1.307±0.006 0.9726±0.0022
gist-960-euclidean 50,000 960 1.642±0.003 0.7704±0.0010 1.702±0.017 0.7718±0.0011
glove-25-angular 50,000 25 0.314±0.002 0.8146±0.0008 0.320±0.002 0.8191±0.0027
mnist-784-euclidean 50,000 784 1.247±0.004 0.9522±0.0011 1.246±0.010 0.9528±0.0013
sift-128-euclidean 50,000 128 0.326±0.003 0.9191±0.0008 0.325±0.003 0.9180±0.0009
Dataset Samples Features New Time (s) New Trustworthiness Old Time (s) Old Trustworthiness
deep-image-96-angular 250,000 96 5.495±0.012 0.8891±0.0013 5.619±0.071 0.8873±0.0021
fashion-mnist-784-euclidean 60,000 784 1.782±0.004 0.9732±0.0003 1.872±0.027 0.9724±0.0014
gist-960-euclidean 250,000 960 40.270±0.176 0.7656±0.0021 40.468±0.141 0.7660±0.0015
glove-25-angular 250,000 25 5.128±0.004 0.8180±0.0024 5.201±0.015 0.8194±0.0023
mnist-784-euclidean 60,000 784 1.803±0.010 0.9480±0.0022 1.794±0.006 0.9513±0.0017
sift-128-euclidean 250,000 128 5.726±0.085 0.9156±0.0005 5.617±0.008 0.9150±0.0006

The following python script is used to obtain the plots.

import numpy as np
from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
from cuml.manifold import UMAP, SpectralEmbedding
from sklearn.manifold import SpectralEmbedding as skSpectralEmbedding

mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X_mnist = mnist.data.astype('float32')
y_mnist = mnist.target.astype(int)

subset_size = 5000
X = X_mnist
y = y_mnist

# Compute spectral embedding
spectral_model = SpectralEmbedding(n_components=2, n_neighbors=15)
X_spectral = spectral_model.fit_transform(X)

# Compute UMAP embedding
umap_model = UMAP(n_neighbors=15, n_components=2, random_state=42, n_epochs=0, init="spectral")
X_umap = umap_model.fit_transform(X)

# Plot side-by-side
fig, axs = plt.subplots(1, 2, figsize=(18, 7))

scatter1 = axs[0].scatter(X_spectral[:, 0], X_spectral[:, 1], c=y, cmap='tab10', s=2)
axs[0].set_title('cuML Spectral Embedding of MNIST')
axs[0].set_xlabel('Spectral Component 1')
axs[0].set_ylabel('Spectral Component 2')

scatter2 = axs[1].scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', s=2)
axs[1].set_title('cuML UMAP Projection of MNIST')
axs[1].set_xlabel('UMAP 1')
axs[1].set_ylabel('UMAP 2')

plt.tight_layout()
plt.show()

@aamijar aamijar requested a review from a team as a code owner July 30, 2025 08:24
@aamijar aamijar requested review from teju85 and vyasr July 30, 2025 08:24
@aamijar aamijar added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Jul 30, 2025
@aamijar aamijar moved this from Todo to In Progress in Unstructured Data Processing Jul 30, 2025
@aamijar aamijar removed request for teju85 and vyasr July 30, 2025 08:29
Comment thread cpp/src/umap/init_embed/spectral_algo.cuh Outdated
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Anupam, this looks great! This was a longstanding issue, so it's great to see it addressed. LGTM overall. One point though: the reference implementation initializes the embeddings within a (-10, 10) bounding box, regardless of the initialization method. Could we retain that behavior?

Comment on lines +67 to +68
auto tmp_embedding_view = raft::make_device_matrix_view<float, int, raft::col_major>(
tmp_embedding.data_handle(), n, params->n_components);
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a detail, but you could use .view() to create a view.
EDIT: Unless raft::col_major is important here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 8f32544

@cjnolet
Copy link
Copy Markdown
Member

cjnolet commented Jul 31, 2025

@aamijar in addition to the viz above, please also benchmark the old and new spectral initialization on multiple datasets or different shapes and sizes so that we can get an idea of the perf gap (if any) between them.

it’s very very important that we characterize the delta here very carefully.

Copy link
Copy Markdown
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are noticing some potential recent perf regressions in UMAP, so I'd just like to make sure we're not introducing significantly more with these changes. We need to collect some UMAP benchmarks (before and after this change) before this is merged.

I think the impact to perf could be justified if not major just because of the huge quality improvements. But we'll need to at least assess where we are first.

@aamijar
Copy link
Copy Markdown
Member Author

aamijar commented Aug 3, 2025

Thanks for the review @viclafargue! I have addressed your comments and added the bounding box initialization constraint.

Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! But, could you test again the quality of the initialization after these changes (the bounding box and random noise addition) ?

@aamijar
Copy link
Copy Markdown
Member Author

aamijar commented Aug 5, 2025

Yes, I have added the new visualizations to the PR description.

@viclafargue
Copy link
Copy Markdown
Contributor

I think that the number of epochs has to be set to 1 to get a good visualization of spectral initialization within UMAP. Setting it to 0 makes UMAP use a default value.

@aamijar
Copy link
Copy Markdown
Member Author

aamijar commented Aug 5, 2025

I am not running the simpl_set part since I comment out the part I mentioned in the PR description.

@viclafargue
Copy link
Copy Markdown
Contributor

I am not running the simpl_set part since I comment out the part I mentioned in the PR description.

Oh, I see, then it should be good. But, why are there only 3 points visible in the visualization then?

@aamijar
Copy link
Copy Markdown
Member Author

aamijar commented Aug 5, 2025

I think because the dataset has 2 features. The embedding clusters the points on top of each other. The umap-learn version also does the same if you plot out the spectral initialization.

@cjnolet
Copy link
Copy Markdown
Member

cjnolet commented Aug 6, 2025

@aamijar thanks for updating the visualizations in the description. I'm a little confused by the second set of visualizations, though. Can you verify the "updated" visualization is what we would expect from the cpu version as well? You should be able to verify this by setting the number of epochs to 0, or printing the points in the cpu version right after initialization.

@aamijar
Copy link
Copy Markdown
Member Author

aamijar commented Aug 8, 2025

Hi @cjnolet, I have added the benchmark results the PR description. I have also added additional plots to show the CPU spectral initialization. The CPU one also has 3 points if you run sklearn SpectralEmbedding (I edited the code to do this) as the spectral initialization. If you run the usual CPU spectral initialization you will get a different plot.

Comment on lines +62 to +65
auto connectivity_graph_view = raft::make_device_coo_matrix_view<float, int, int, int>(
coo->vals(),
raft::make_device_coordinate_structure_view<int, int, int>(
coo->rows(), coo->cols(), n, n, coo->nnz));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aamijar , could you make sure that this uses nnz_t for the nnz type instead of hardwiring them toint types? I think that should fix this issue

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jinsolp, I think we can create a follow up issue for nnz_t types since I would need to change the cuvs api too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking here rapidsai/cuvs#1243

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, thank you!

@cjnolet
Copy link
Copy Markdown
Member

cjnolet commented Aug 12, 2025

/merge

@rapids-bot rapids-bot Bot merged commit a5f3914 into rapidsai:branch-25.10 Aug 12, 2025
74 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Unstructured Data Processing Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

algo: umap CUDA/C++ improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Integrate new Spectral Embedding in UMAP spectral initialization

6 participants