Skip to content

Fix UMAP outlier issue by checking for outliers and shuffling#7131

Merged
rapids-bot[bot] merged 20 commits intorapidsai:branch-25.10from
jinsolp:fix-umap-outlier
Oct 1, 2025
Merged

Fix UMAP outlier issue by checking for outliers and shuffling#7131
rapids-bot[bot] merged 20 commits intorapidsai:branch-25.10from
jinsolp:fix-umap-outlier

Conversation

@jinsolp
Copy link
Copy Markdown
Contributor

@jinsolp jinsolp commented Aug 25, 2025

Closing #6454

Main difference between out simplicial set embedding and CPU UMAP was in negative sampling.
We should use updated values (value after adding gradients) in the negative sampling stage.

Dispatched to two kernels (and three usages) based on `n_components. Fixed like below.

  • optimize_batch_kernel_reg (n_components=2): update the current_reg register value (used later in the negative sampling stage) along with grads
  • optimize_batch_kernel (with shared memory): distinguish current_buffer (which used to JUST hold the gradient) from the grad_buffer. Now current_buffer and grad_buffer corresponds to the current_reg and grads registers in the register-approch kernel.
  • optimize_batch_kernel (without shared memory): untouched because the grads are applied directly to global memory. This updated value in global memory is read directly for negative sampling later on.

Visualizations 2D

50K samples random selected for plotting.
From the left

  • CPU KNN + CPU UMAP
  • GPU KNN + CPU UMAP
  • GPU KNN + GPU UMAP Before fix
  • GPU KNN + GPU UMAP After fix in this PR

Using dataset 639K x 384
unique_embeddings_Beauty_comparison

Using dataset 1.8M x 384
unique_embeddings_Appliances_comparison

Visualizations 3D

50K samples random selected for plotting.
Plotting the same dataset with n_components=3 (Which uses the second kernel).
From the left

  • GPU KNN + CPU UMAP
  • GPU KNN + GPU UMAP Before fix
  • GPU KNN + GPU UMAP After fix in this PR

Using dataset 639K x 384 (was already doing pretty well without outliers, still doing well)
Screenshot 2025-08-25 at 1 16 37 PM

Using dataset 1.8M x 384
before fix had outliers.
Screenshot 2025-08-25 at 1 22 41 PM

@jinsolp jinsolp requested a review from a team as a code owner August 25, 2025 20:23
@jinsolp jinsolp requested review from bdice and dantegd August 25, 2025 20:23
@jinsolp jinsolp added bug Something isn't working non-breaking Non-breaking change labels Aug 25, 2025
@jinsolp jinsolp changed the title Fix UMAP outlier issue by using updated gradients in negative sampling. Fix UMAP outlier issue by using updated values in negative sampling. Aug 25, 2025
@jinsolp jinsolp removed request for bdice and dantegd August 25, 2025 20:30
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jinsolp! It looks great.

Comment thread cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh
Comment thread cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh
Comment on lines -334 to +358
truncate_gradient(rounding, current_buffer[d * TPB_X]));
raft::myAtomicAdd<T>((T*)cur_write + d, truncate_gradient(rounding, grads_buffer[d * TPB_X]));
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importantly, when random_state is set, current != cur_write and other != oth_write as updates accumulate in separate buffer to allow high precision deterministic accumulation of updates. It looks like we may still have outliers in this case? But, I guess that is acceptable for now.

Comment thread cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh
@jinsolp jinsolp changed the title Fix UMAP outlier issue by using updated values in negative sampling. Fix UMAP outlier issue by checking for outliers and shuffling Sep 5, 2025
@jinsolp
Copy link
Copy Markdown
Contributor Author

jinsolp commented Sep 5, 2025

Have to add unit test for outlier checking. Plan is to grab a large enough dataset that originally fails (i.e. has outliers).
We can use the all-neighbors python API to build the knn graph quickly, give that as precomputed graphs to both CPU and cuML UMAP, and compare the resulting embeddings.

e.g. get min/max values of the CPU embedding, and check if all values in out embedding is within a certain threshold of that range.

Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jinsolp! It looks great! However, I believe that we would have to apply the shuffling before make_epochs_per_sample is called (see comment).

Comment thread cpp/src/umap/simpl_set_embed/algo.cuh Outdated
Comment thread cpp/src/umap/simpl_set_embed/algo.cuh
@jinsolp jinsolp requested a review from a team as a code owner September 22, 2025 22:19
Comment thread python/cuml/tests/test_umap.py Outdated
import pytest
import scipy.sparse as scipy_sparse
import umap
from cuvs.neighbors import all_neighbors, nn_descent
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not directly import the cuvs Python API in cuML. If we do, then we need to cuvs (not just libcuvs) to our test dependencies. CC @divyegala

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can get rid of this, but if we do, we have to run the full e2e cpu umap on a not-so-small dataset for comparison (because outliers don't show up with small datasets)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can consider adding cuvs to the test dependencies, but then let's make sure to guard the cuvs import with pytest.importorskip. I'll let @divyegala chime in on this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is fine as a testing dependency. Please point me to the commit that adds cuvs as so, I will verify that we don't leak the dependency by mistake.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using importorskip for now. Left an issue: #7279

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you are still using a direct import here?

Comment thread python/cuml/tests/test_umap.py Outdated
Comment thread cpp/src/umap/simpl_set_embed/algo.cuh Outdated
Comment thread cpp/src/umap/simpl_set_embed/algo.cuh Outdated
Comment thread cpp/src/umap/simpl_set_embed/algo.cuh Outdated
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Sep 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@divyegala
Copy link
Copy Markdown
Member

/merge

@rapids-bot rapids-bot Bot merged commit e736d05 into rapidsai:branch-25.10 Oct 1, 2025
200 of 202 checks passed
@jinsolp jinsolp deleted the fix-umap-outlier branch October 1, 2025 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working CUDA/C++ Cython / Python Cython or Python issue non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cuml.UMAP embeddings result in outliers

7 participants