Forward-merge release/26.02 into main#7720
Merged
rapids-bot[bot] merged 7 commits intorapidsai:mainfrom Jan 27, 2026
Merged
Conversation
Closes rapidsai#7176 This PR fixes outliers when `random_state` is given. # High-level explanation ### Why we had issues in the previous implementation All threads **read the embedding value** of and then write the gradient update out to a **separate buffer**. This ensure determinism because each thread will be computing the gradient on the same value across different runs (value of the embedding in that epoch) instead of nondeterministic values (say, if another thread writes its update into the embedding then we can't be sure whether this thread will read that updated value or the value before the update) a pseudocode looks like this ``` # Existing implementation given random_state for epoch in epochs: # === start kernel launch nnz threads grad = compute_grad(embedding[i], embedding[j]) atomicAdd(out_buff[i], grad) atomicAdd(out_buff[j], grad) # === end kernel embedding += out_buff out_buff = 0 ``` Although this ensures deterministic behavior, it results in outliers because a gradient should be accumulatively computed. i.e. an update to i-th vector in the embedding should be taken into consideration to compute the gradient for the i-th vector in another thread. This already achieved when we don't require determinism: by writing back to the embedding directly so that there are more chances of computing the gradient on an updated value. ``` # Existing implementation when we don't care about determinism for epoch in epochs: # === start kernel launch nnz threads grad = compute_grad(embedding[i], embedding[j]) atomicAdd(embedding[i], grad) atomicAdd(embedding[j], grad) # === end kernel ``` ### Fixes in this PR To keep it deterministic but allow threads to read a somewhat updated value, this PR splits a single epoch into more fine-grained chunks. ``` for epoch in epochs: for chunk in n_chunks: # === start kernel launch nnz threads grad = compute_grad(embedding[i], embedding[j]) atomicAdd(out_buff[i], grad) atomicAdd(out_buff[j], grad) # === end kernel embedding += out_buff out_buff = 0 ``` now after the kernel returns for a chunk, the next chunk of threads start off with an embedding that includes the updates from the previous chunk of threads. It is easy if we think of larger `n_chunks` meaning more serial behavior, and therefore approximating the desired sequential implementation. To be more efficient I added a bitwise-flag to efficiently apply sparse updates per chunk. # Benchmarks ncomp=2 (as of commit rapidsai@1606616) Green slots indicate the cases where we don't see outliers (i.e. with large n_chunks) ### Amazon food data (5M x 384) <img width="814" height="194" alt="Screenshot 2025-12-15 at 4 59 21 PM" src="https://github.com/user-attachments/assets/d6934e84-4085-47d6-9b10-da2882098d4a" /> ### Amazon Sports data (13M x 384) <img width="815" height="323" alt="Screenshot 2025-12-15 at 5 00 08 PM" src="https://github.com/user-attachments/assets/98b53584-7375-4d42-9c48-ef4337e6ab13" /> ### Appliances (1.8M x 384) and Beauty (640K x 384) These didn't have outliers in the first place <img width="816" height="307" alt="Screenshot 2025-12-15 at 5 01 27 PM" src="https://github.com/user-attachments/assets/40632d9f-7b68-4a72-8a9c-c0ab11eda358" /> # Chosen heuristics and Performance Implications Increasing `n_chunks` doesn't increase the optimize runtime (this is due to sparse updates). Thus, have conservatively chosen `num_chunks = raft::ceildiv(nnz, static_cast<nnz_t>(100000))` based on looking at when the results start to be free from outliers. Our original implementation with random_state (numbers in red in the table above) **takes up about 0.2% of the end-to-end** runtime. Thus, having a 2x slowdown in the optimize step doesn't really affect the e2e perf. Authors: - Jinsol Park (https://github.com/jinsolp) Approvers: - Victor Lafargue (https://github.com/viclafargue) - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#7597
fix(build): build package on merge to `release/*` branch
…ests and do not fail test runs on integration tests (rapidsai#7715) ## Summary Explicitly install `requests` in BERTopic integration test and add `continue-on-error: true` to wheel integration tests to prevent external dependency failures from blocking nightly CI. ## Motivation Wheel integration tests verify compatibility with external packages (e.g., BERTopic) but should not block CI when those packages have regressions outside our control. **Current failure:** BERTopic test fails due to missing `requests` dependency in `sentence-transformers==5.2.1` (released today, 2026-01-26): ```python ModuleNotFoundError: No module named 'requests' File "sentence_transformers/util/file_io.py", line 7 ``` This is an upstream bug in sentence-transformers, not a cuML issue. See also: huggingface/sentence-transformers#3617 Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - Bradley Dice (https://github.com/bdice) - James Lamb (https://github.com/jameslamb) URL: rapidsai#7715
Closes rapidsai#4249 The [GPU-accelerated version of Delaunay](https://docs.cupy.dev/en/latest/reference/generated/cupyx.scipy.spatial.Delaunay.html) will be release in cuPy soon. Authors: - Victor Lafargue (https://github.com/viclafargue) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - Simon Adorf (https://github.com/csadorf) URL: rapidsai#7674
dantegd
approved these changes
Jan 26, 2026
dantegd
approved these changes
Jan 26, 2026
Contributor
Author
|
/merge nosquash |
Contributor
|
Commit history integrity check failed: not all commits from original PR #7718 appear to be present individually in this PR's history. This usually happens if commits were squashed during the manual resolution process. Please ensure all original commits are preserved individually. You can fix this and try the |
0aa51e0 to
df6a931
Compare
jameslamb
approved these changes
Jan 26, 2026
Contributor
Author
|
/merge nosquash |
…a-cuda Fallback to numba-cuda with no extra CUDA packages if 'cuda_suffixed' isn't true
Contributor
|
Commit history integrity check failed: not all commits from original PR #7718 appear to be present individually in this PR's history. This usually happens if commits were squashed during the manual resolution process. Please ensure all original commits are preserved individually. You can fix this and try the |
df6a931 to
0aa9676
Compare
0aa9676 to
c6b7cff
Compare
Closes rapidsai#7648 Authors: - Victor Lafargue (https://github.com/viclafargue) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#7616
jameslamb
approved these changes
Jan 27, 2026
c6b7cff to
ee46427
Compare
Contributor
Author
|
/merge nosquash |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.