Fix UMAP outliers when `random_state` is given by jinsolp · Pull Request #7597 · rapidsai/cuml

jinsolp · 2025-12-11T01:51:22Z

This PR fixes outliers when random_state is given.

High-level explanation

Why we had issues in the previous implementation

All threads read the embedding value of and then write the gradient update out to a separate buffer.
This ensure determinism because each thread will be computing the gradient on the same value across different runs (value of the embedding in that epoch) instead of nondeterministic values (say, if another thread writes its update into the embedding then we can't be sure whether this thread will read that updated value or the value before the update)

a pseudocode looks like this

# Existing implementation given random_state
for epoch in epochs:
    # === start kernel launch nnz threads
        grad = compute_grad(embedding[i], embedding[j])
        atomicAdd(out_buff[i], grad)
        atomicAdd(out_buff[j], grad)
    # === end kernel
    embedding += out_buff
    out_buff = 0

Although this ensures deterministic behavior, it results in outliers because a gradient should be accumulatively computed. i.e. an update to i-th vector in the embedding should be taken into consideration to compute the gradient for the i-th vector in another thread.

This already achieved when we don't require determinism: by writing back to the embedding directly so that there are more chances of computing the gradient on an updated value.

# Existing implementation when we don't care about determinism
for epoch in epochs:
    # === start kernel launch nnz threads
        grad = compute_grad(embedding[i], embedding[j])
        atomicAdd(embedding[i], grad)
        atomicAdd(embedding[j], grad)
    # === end kernel

Fixes in this PR

To keep it deterministic but allow threads to read a somewhat updated value, this PR splits a single epoch into more fine-grained chunks.

for epoch in epochs:
    for chunk in n_chunks:
        # === start kernel launch nnz threads
            grad = compute_grad(embedding[i], embedding[j])
            atomicAdd(out_buff[i], grad)
            atomicAdd(out_buff[j], grad)
        # === end kernel
        embedding += out_buff
        out_buff = 0

now after the kernel returns for a chunk, the next chunk of threads start off with an embedding that includes the updates from the previous chunk of threads.

It is easy if we think of larger n_chunks meaning more serial behavior, and therefore approximating the desired sequential implementation.

To be more efficient I added a bitwise-flag to efficiently apply sparse updates per chunk.

Benchmarks ncomp=2

(as of commit 1606616)
Green slots indicate the cases where we don't see outliers (i.e. with large n_chunks)

Amazon food data (5M x 384)

Amazon Sports data (13M x 384)

Appliances (1.8M x 384) and Beauty (640K x 384)

These didn't have outliers in the first place

Chosen heuristics and Performance Implications

Increasing n_chunks doesn't increase the optimize runtime (this is due to sparse updates). Thus, have conservatively chosen num_chunks = raft::ceildiv(nnz, static_cast<nnz_t>(100000)) based on looking at when the results start to be free from outliers.

Our original implementation with random_state (numbers in red in the table above) takes up about 0.2% of the end-to-end runtime. Thus, having a 2x slowdown in the optimize step doesn't really affect the e2e perf.

copy-pr-bot · 2025-12-11T01:51:41Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

viclafargue

Thanks for working on this! LGTM

dantegd

The approach is sound and well-motivated, and should address the root cause. The only real concern I had is whether the condition for applying the chunking heuristic, which could leave some deterministic cases unprotected.

dantegd · 2026-01-15T17:47:05Z

  }

-  if (has_outlier) {
+  if (has_outlier || params->deterministic) {


Question: If deterministic=true but has_outlier=false then no additional chunking is applied (num_chunks stays at 1), but is there a chance that the outlier detection (check_outliers) may miss edge cases, since it is a heuristic at the end of the day?

that is possible, but to prevent this we have to be overly conservative. We could default to a larger num_chunks for when deterministic=true (like 4 maybe?). This has been working well so far with the synthetic/real datasets that I've been working on, but you're right that it's difficult to be 100% confident that this will cover all edge cases.

I wonder if it might be worth adding a super "strict" mode that always does this, so that if a user can turn it on explicitly, with documentation that it shouldn't be needed in general and just to be used as a "last resource"?

dantegd · 2026-01-15T17:55:23Z

 @pytest.mark.parametrize("n_components", [2, 5])
-def test_umap_outliers(n_neighbors, n_components):
-    n_rows = 50_000
+@pytest.mark.parametrize("random_state", [None, 42])


I wonder if it would be worth it to add some edge case tests, was thinking something like:

Very small datasets where chunking might have odd effects

Datasets near the threshold boundaries (e.g., nnz close to 100000 or 10000)

jinsolp · 2026-01-26T17:22:57Z

@dantegd @csadorf Left a related issue for possible improvements: #7716

dantegd · 2026-01-26T17:55:48Z

/merge

jinsolp added 2 commits December 10, 2025 00:57

fix deterministic mode outliers

edd7b03

shuffle

36b1355

jinsolp self-assigned this Dec 11, 2025

jinsolp requested a review from a team as a code owner December 11, 2025 01:51

jinsolp requested a review from hcho3 December 11, 2025 01:51

jinsolp added the bug Something isn't working label Dec 11, 2025

jinsolp requested a review from jcrist December 11, 2025 01:51

jinsolp added non-breaking Non-breaking change algo: umap labels Dec 11, 2025

github-actions Bot added the CUDA/C++ label Dec 11, 2025

jinsolp marked this pull request as draft December 11, 2025 01:51

jinsolp added 10 commits December 10, 2025 17:51

Merge branch 'main' into fix-umap-deterministic-outlier

374d168

resolve merge conflict

30cc173

bit flags for sparse apply

9b210d8

Merge branch 'main' into fix-umap-deterministic-outlier

1606616

add comment and heuristic

4b3f1f9

comment

c7ef2be

Merge branch 'main' into fix-umap-deterministic-outlier

4e9b037

cleanup

d217093

fix threshold

f1329ab

Add previously failing test

779aba4

github-actions Bot added the Cython / Python Cython or Python issue label Dec 18, 2025

Merge branch 'main' into fix-umap-deterministic-outlier

dcea750

jinsolp marked this pull request as ready for review December 18, 2025 00:25

jinsolp requested a review from a team as a code owner December 18, 2025 00:25

jinsolp requested review from divyegala and jcrist and removed request for hcho3 and jcrist December 18, 2025 00:25

jinsolp requested a review from hcho3 December 18, 2025 00:26

viclafargue approved these changes Jan 8, 2026

View reviewed changes

Comment thread cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh

Comment thread cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh Outdated

jinsolp added 4 commits January 9, 2026 20:47

static cast

50b6ddc

Merge branch 'main' into fix-umap-deterministic-outlier

90c2f6f

style check

30c2eed

Merge branch 'main' into fix-umap-deterministic-outlier

f589cca

dantegd reviewed Jan 15, 2026

View reviewed changes

jinsolp changed the base branch from main to release/26.02 January 16, 2026 20:30

jinsolp added 3 commits January 16, 2026 14:17

Merge branch 'release/26.02' into fix-umap-deterministic-outlier

42f15ff

add detailed comments

f54f73d

tail buffer assignment when move_other=true

636ddad

jinsolp mentioned this pull request Jan 26, 2026

[FEA] Support strict enforcement mode for removing outliers in UMAP deterministic mode #7716

Closed

dantegd approved these changes Jan 26, 2026

View reviewed changes

rapids-bot Bot merged commit b22be54 into rapidsai:release/26.02 Jan 26, 2026
117 checks passed

csadorf mentioned this pull request Jan 27, 2026

[BUG] UMAP with random state seed produces noisy results. #7176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UMAP outliers when `random_state` is given#7597

Fix UMAP outliers when `random_state` is given#7597
rapids-bot[bot] merged 20 commits intorapidsai:release/26.02from
jinsolp:fix-umap-deterministic-outlier

jinsolp commented Dec 11, 2025 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Dec 11, 2025

Uh oh!

viclafargue left a comment

Uh oh!

Uh oh!

Uh oh!

dantegd left a comment

Uh oh!

dantegd Jan 15, 2026 •

edited

Loading

Uh oh!

jinsolp Jan 16, 2026

Uh oh!

dantegd Jan 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dantegd Jan 15, 2026

Uh oh!

jinsolp commented Jan 26, 2026

Uh oh!

dantegd commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jinsolp commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

High-level explanation

Why we had issues in the previous implementation

Fixes in this PR

Benchmarks ncomp=2

Amazon food data (5M x 384)

Amazon Sports data (13M x 384)

Appliances (1.8M x 384) and Beauty (640K x 384)

Chosen heuristics and Performance Implications

Uh oh!

copy-pr-bot Bot commented Dec 11, 2025

Uh oh!

viclafargue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dantegd left a comment

Choose a reason for hiding this comment

Uh oh!

dantegd Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinsolp Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

dantegd Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dantegd Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

jinsolp commented Jan 26, 2026

Uh oh!

dantegd commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jinsolp commented Dec 11, 2025 •

edited

Loading

dantegd Jan 15, 2026 •

edited

Loading