Skip to content

Fix UMAP outliers when random_state is given#7597

Merged
rapids-bot[bot] merged 20 commits intorapidsai:release/26.02from
jinsolp:fix-umap-deterministic-outlier
Jan 26, 2026
Merged

Fix UMAP outliers when random_state is given#7597
rapids-bot[bot] merged 20 commits intorapidsai:release/26.02from
jinsolp:fix-umap-deterministic-outlier

Conversation

@jinsolp
Copy link
Copy Markdown
Contributor

@jinsolp jinsolp commented Dec 11, 2025

Closes #7176

This PR fixes outliers when random_state is given.

High-level explanation

Why we had issues in the previous implementation

All threads read the embedding value of and then write the gradient update out to a separate buffer.
This ensure determinism because each thread will be computing the gradient on the same value across different runs (value of the embedding in that epoch) instead of nondeterministic values (say, if another thread writes its update into the embedding then we can't be sure whether this thread will read that updated value or the value before the update)

a pseudocode looks like this

# Existing implementation given random_state
for epoch in epochs:
    # === start kernel launch nnz threads
        grad = compute_grad(embedding[i], embedding[j])
        atomicAdd(out_buff[i], grad)
        atomicAdd(out_buff[j], grad)
    # === end kernel
    embedding += out_buff
    out_buff = 0

Although this ensures deterministic behavior, it results in outliers because a gradient should be accumulatively computed. i.e. an update to i-th vector in the embedding should be taken into consideration to compute the gradient for the i-th vector in another thread.

This already achieved when we don't require determinism: by writing back to the embedding directly so that there are more chances of computing the gradient on an updated value.

# Existing implementation when we don't care about determinism
for epoch in epochs:
    # === start kernel launch nnz threads
        grad = compute_grad(embedding[i], embedding[j])
        atomicAdd(embedding[i], grad)
        atomicAdd(embedding[j], grad)
    # === end kernel

Fixes in this PR

To keep it deterministic but allow threads to read a somewhat updated value, this PR splits a single epoch into more fine-grained chunks.

for epoch in epochs:
    for chunk in n_chunks:
        # === start kernel launch nnz threads
            grad = compute_grad(embedding[i], embedding[j])
            atomicAdd(out_buff[i], grad)
            atomicAdd(out_buff[j], grad)
        # === end kernel
        embedding += out_buff
        out_buff = 0

now after the kernel returns for a chunk, the next chunk of threads start off with an embedding that includes the updates from the previous chunk of threads.

It is easy if we think of larger n_chunks meaning more serial behavior, and therefore approximating the desired sequential implementation.

To be more efficient I added a bitwise-flag to efficiently apply sparse updates per chunk.

Benchmarks ncomp=2

(as of commit 1606616)
Green slots indicate the cases where we don't see outliers (i.e. with large n_chunks)

Amazon food data (5M x 384)

Screenshot 2025-12-15 at 4 59 21 PM

Amazon Sports data (13M x 384)

Screenshot 2025-12-15 at 5 00 08 PM

Appliances (1.8M x 384) and Beauty (640K x 384)

These didn't have outliers in the first place
Screenshot 2025-12-15 at 5 01 27 PM

Chosen heuristics and Performance Implications

Increasing n_chunks doesn't increase the optimize runtime (this is due to sparse updates). Thus, have conservatively chosen num_chunks = raft::ceildiv(nnz, static_cast<nnz_t>(100000)) based on looking at when the results start to be free from outliers.

Our original implementation with random_state (numbers in red in the table above) takes up about 0.2% of the end-to-end runtime. Thus, having a 2x slowdown in the optimize step doesn't really affect the e2e perf.

@jinsolp jinsolp self-assigned this Dec 11, 2025
@jinsolp jinsolp requested a review from a team as a code owner December 11, 2025 01:51
@jinsolp jinsolp requested a review from hcho3 December 11, 2025 01:51
@jinsolp jinsolp added the bug Something isn't working label Dec 11, 2025
@jinsolp jinsolp requested a review from jcrist December 11, 2025 01:51
@jinsolp jinsolp added non-breaking Non-breaking change algo: umap labels Dec 11, 2025
@jinsolp jinsolp marked this pull request as draft December 11, 2025 01:51
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Dec 11, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the Cython / Python Cython or Python issue label Dec 18, 2025
@jinsolp jinsolp marked this pull request as ready for review December 18, 2025 00:25
@jinsolp jinsolp requested a review from a team as a code owner December 18, 2025 00:25
@jinsolp jinsolp requested review from divyegala and jcrist and removed request for hcho3 and jcrist December 18, 2025 00:25
@jinsolp jinsolp requested a review from hcho3 December 18, 2025 00:26
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! LGTM

Comment thread cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh
Comment thread cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh Outdated
Copy link
Copy Markdown
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach is sound and well-motivated, and should address the root cause. The only real concern I had is whether the condition for applying the chunking heuristic, which could leave some deterministic cases unprotected.

}

if (has_outlier) {
if (has_outlier || params->deterministic) {
Copy link
Copy Markdown
Member

@dantegd dantegd Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: If deterministic=true but has_outlier=false then no additional chunking is applied (num_chunks stays at 1), but is there a chance that the outlier detection (check_outliers) may miss edge cases, since it is a heuristic at the end of the day?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is possible, but to prevent this we have to be overly conservative. We could default to a larger num_chunks for when deterministic=true (like 4 maybe?). This has been working well so far with the synthetic/real datasets that I've been working on, but you're right that it's difficult to be 100% confident that this will cover all edge cases.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it might be worth adding a super "strict" mode that always does this, so that if a user can turn it on explicitly, with documentation that it shouldn't be needed in general and just to be used as a "last resource"?

Comment thread cpp/src/umap/simpl_set_embed/algo.cuh
Comment thread cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh
Comment thread cpp/src/umap/simpl_set_embed/algo.cuh
@pytest.mark.parametrize("n_components", [2, 5])
def test_umap_outliers(n_neighbors, n_components):
n_rows = 50_000
@pytest.mark.parametrize("random_state", [None, 42])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be worth it to add some edge case tests, was thinking something like:

  • Very small datasets where chunking might have odd effects
  • Datasets near the threshold boundaries (e.g., nnz close to 100000 or 10000)

@jinsolp jinsolp changed the base branch from main to release/26.02 January 16, 2026 20:30
@jinsolp
Copy link
Copy Markdown
Contributor Author

jinsolp commented Jan 26, 2026

@dantegd @csadorf Left a related issue for possible improvements: #7716

@dantegd
Copy link
Copy Markdown
Member

dantegd commented Jan 26, 2026

/merge

@rapids-bot rapids-bot Bot merged commit b22be54 into rapidsai:release/26.02 Jan 26, 2026
117 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

algo: umap bug Something isn't working CUDA/C++ Cython / Python Cython or Python issue non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] UMAP with random state seed produces noisy results.

4 participants