Fix UMAP outliers when random_state is given#7597
Fix UMAP outliers when random_state is given#7597rapids-bot[bot] merged 20 commits intorapidsai:release/26.02from
random_state is given#7597Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
viclafargue
left a comment
There was a problem hiding this comment.
Thanks for working on this! LGTM
dantegd
left a comment
There was a problem hiding this comment.
The approach is sound and well-motivated, and should address the root cause. The only real concern I had is whether the condition for applying the chunking heuristic, which could leave some deterministic cases unprotected.
| } | ||
|
|
||
| if (has_outlier) { | ||
| if (has_outlier || params->deterministic) { |
There was a problem hiding this comment.
Question: If deterministic=true but has_outlier=false then no additional chunking is applied (num_chunks stays at 1), but is there a chance that the outlier detection (check_outliers) may miss edge cases, since it is a heuristic at the end of the day?
There was a problem hiding this comment.
that is possible, but to prevent this we have to be overly conservative. We could default to a larger num_chunks for when deterministic=true (like 4 maybe?). This has been working well so far with the synthetic/real datasets that I've been working on, but you're right that it's difficult to be 100% confident that this will cover all edge cases.
There was a problem hiding this comment.
I wonder if it might be worth adding a super "strict" mode that always does this, so that if a user can turn it on explicitly, with documentation that it shouldn't be needed in general and just to be used as a "last resource"?
| @pytest.mark.parametrize("n_components", [2, 5]) | ||
| def test_umap_outliers(n_neighbors, n_components): | ||
| n_rows = 50_000 | ||
| @pytest.mark.parametrize("random_state", [None, 42]) |
There was a problem hiding this comment.
I wonder if it would be worth it to add some edge case tests, was thinking something like:
- Very small datasets where chunking might have odd effects
- Datasets near the threshold boundaries (e.g., nnz close to 100000 or 10000)
|
/merge |
Closes #7176
This PR fixes outliers when
random_stateis given.High-level explanation
Why we had issues in the previous implementation
All threads read the embedding value of and then write the gradient update out to a separate buffer.
This ensure determinism because each thread will be computing the gradient on the same value across different runs (value of the embedding in that epoch) instead of nondeterministic values (say, if another thread writes its update into the embedding then we can't be sure whether this thread will read that updated value or the value before the update)
a pseudocode looks like this
Although this ensures deterministic behavior, it results in outliers because a gradient should be accumulatively computed. i.e. an update to i-th vector in the embedding should be taken into consideration to compute the gradient for the i-th vector in another thread.
This already achieved when we don't require determinism: by writing back to the embedding directly so that there are more chances of computing the gradient on an updated value.
Fixes in this PR
To keep it deterministic but allow threads to read a somewhat updated value, this PR splits a single epoch into more fine-grained chunks.
now after the kernel returns for a chunk, the next chunk of threads start off with an embedding that includes the updates from the previous chunk of threads.
It is easy if we think of larger
n_chunksmeaning more serial behavior, and therefore approximating the desired sequential implementation.To be more efficient I added a bitwise-flag to efficiently apply sparse updates per chunk.
Benchmarks ncomp=2
(as of commit 1606616)
Green slots indicate the cases where we don't see outliers (i.e. with large n_chunks)
Amazon food data (5M x 384)
Amazon Sports data (13M x 384)
Appliances (1.8M x 384) and Beauty (640K x 384)
These didn't have outliers in the first place

Chosen heuristics and Performance Implications
Increasing
n_chunksdoesn't increase the optimize runtime (this is due to sparse updates). Thus, have conservatively chosennum_chunks = raft::ceildiv(nnz, static_cast<nnz_t>(100000))based on looking at when the results start to be free from outliers.Our original implementation with random_state (numbers in red in the table above) takes up about 0.2% of the end-to-end runtime. Thus, having a 2x slowdown in the optimize step doesn't really affect the e2e perf.