-
Notifications
You must be signed in to change notification settings - Fork 623
Fix UMAP outliers when random_state is given
#7597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
rapids-bot
merged 20 commits into
rapidsai:release/26.02
from
jinsolp:fix-umap-deterministic-outlier
Jan 26, 2026
+311
−142
Merged
Changes from 13 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
edd7b03
fix deterministic mode outliers
jinsolp 36b1355
shuffle
jinsolp 374d168
Merge branch 'main' into fix-umap-deterministic-outlier
jinsolp 30cc173
resolve merge conflict
jinsolp 9b210d8
bit flags for sparse apply
jinsolp 1606616
Merge branch 'main' into fix-umap-deterministic-outlier
jinsolp 4b3f1f9
add comment and heuristic
jinsolp c7ef2be
comment
jinsolp 4e9b037
Merge branch 'main' into fix-umap-deterministic-outlier
jinsolp d217093
cleanup
jinsolp f1329ab
fix threshold
jinsolp 779aba4
Add previously failing test
jinsolp dcea750
Merge branch 'main' into fix-umap-deterministic-outlier
jinsolp 50b6ddc
static cast
jinsolp 90c2f6f
Merge branch 'main' into fix-umap-deterministic-outlier
jinsolp 30c2eed
style check
jinsolp f589cca
Merge branch 'main' into fix-umap-deterministic-outlier
jinsolp 42f15ff
Merge branch 'release/26.02' into fix-umap-deterministic-outlier
jinsolp f54f73d
add detailed comments
jinsolp 636ddad
tail buffer assignment when move_other=true
jinsolp File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: If deterministic=true but has_outlier=false then no additional chunking is applied (num_chunks stays at 1), but is there a chance that the outlier detection (
check_outliers) may miss edge cases, since it is a heuristic at the end of the day?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is possible, but to prevent this we have to be overly conservative. We could default to a larger num_chunks for when deterministic=true (like 4 maybe?). This has been working well so far with the synthetic/real datasets that I've been working on, but you're right that it's difficult to be 100% confident that this will cover all edge cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it might be worth adding a super "strict" mode that always does this, so that if a user can turn it on explicitly, with documentation that it shouldn't be needed in general and just to be used as a "last resource"?