Skip to content

Forward-merge release/26.02 into main#7720

Merged
rapids-bot[bot] merged 7 commits intorapidsai:mainfrom
csadorf:main-merge-release/26.02
Jan 27, 2026
Merged

Forward-merge release/26.02 into main#7720
rapids-bot[bot] merged 7 commits intorapidsai:mainfrom
csadorf:main-merge-release/26.02

Conversation

@csadorf
Copy link
Copy Markdown
Contributor

@csadorf csadorf commented Jan 26, 2026

No description provided.

jinsolp and others added 2 commits January 26, 2026 17:55
Closes rapidsai#7176

This PR fixes outliers when `random_state` is given.

# High-level explanation
### Why we had issues in the previous implementation
All threads **read the embedding value** of and then write the gradient update out to a **separate buffer**. 
This ensure determinism because each thread will be computing the gradient on the same value across different runs (value of the embedding in that epoch) instead of nondeterministic values (say, if another thread writes its update into the embedding then we can't be sure whether this thread will read that updated value or the value before the update)

a pseudocode looks like this
```
# Existing implementation given random_state
for epoch in epochs:
    # === start kernel launch nnz threads
        grad = compute_grad(embedding[i], embedding[j])
        atomicAdd(out_buff[i], grad)
        atomicAdd(out_buff[j], grad)
    # === end kernel
    embedding += out_buff
    out_buff = 0
```
Although this ensures deterministic behavior, it results in outliers because a gradient should be accumulatively computed. i.e. an update to i-th vector in the embedding should be taken into consideration to compute the gradient for the i-th vector in another thread.

This already achieved when we don't require determinism: by writing back to the embedding directly so that there are more chances of computing the gradient on an updated value.
```
# Existing implementation when we don't care about determinism
for epoch in epochs:
    # === start kernel launch nnz threads
        grad = compute_grad(embedding[i], embedding[j])
        atomicAdd(embedding[i], grad)
        atomicAdd(embedding[j], grad)
    # === end kernel
```

### Fixes in this PR
To keep it deterministic but allow threads to read a somewhat updated value, this PR splits a single epoch into more fine-grained chunks.

```
for epoch in epochs:
    for chunk in n_chunks:
        # === start kernel launch nnz threads
            grad = compute_grad(embedding[i], embedding[j])
            atomicAdd(out_buff[i], grad)
            atomicAdd(out_buff[j], grad)
        # === end kernel
        embedding += out_buff
        out_buff = 0
```

now after the kernel returns for a chunk, the next chunk of threads start off with an embedding that includes the updates from the previous chunk of threads.

It is easy if we think of larger `n_chunks` meaning more serial behavior, and therefore approximating the desired sequential implementation.

To be more efficient I added a bitwise-flag to efficiently apply sparse updates per chunk.

# Benchmarks ncomp=2
(as of commit rapidsai@1606616)
Green slots indicate the cases where we don't see outliers (i.e. with large n_chunks)
### Amazon food data (5M x 384)
<img width="814" height="194" alt="Screenshot 2025-12-15 at 4 59 21 PM" src="https://github.com/user-attachments/assets/d6934e84-4085-47d6-9b10-da2882098d4a" />

### Amazon Sports data (13M x 384)
<img width="815" height="323" alt="Screenshot 2025-12-15 at 5 00 08 PM" src="https://github.com/user-attachments/assets/98b53584-7375-4d42-9c48-ef4337e6ab13" />

### Appliances (1.8M x 384) and Beauty (640K x 384)
These didn't have outliers in the first place
<img width="816" height="307" alt="Screenshot 2025-12-15 at 5 01 27 PM" src="https://github.com/user-attachments/assets/40632d9f-7b68-4a72-8a9c-c0ab11eda358" />


# Chosen heuristics and Performance Implications
Increasing `n_chunks` doesn't increase the optimize runtime (this is due to sparse updates). Thus, have conservatively chosen `num_chunks = raft::ceildiv(nnz, static_cast<nnz_t>(100000))` based on looking at when the results start to be free from outliers. 

Our original implementation with random_state (numbers in red in the table above) **takes up about 0.2% of the end-to-end** runtime. Thus, having a 2x slowdown in the optimize step doesn't really affect the e2e perf.

Authors:
  - Jinsol Park (https://github.com/jinsolp)

Approvers:
  - Victor Lafargue (https://github.com/viclafargue)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#7597
fix(build): build package on merge to `release/*` branch
@csadorf csadorf requested review from a team as code owners January 26, 2026 19:08
@github-actions github-actions Bot added Cython / Python Cython or Python issue CUDA/C++ labels Jan 26, 2026
@csadorf csadorf changed the title Main merge release/26.02 Forward-merge release/26.02 into main Jan 26, 2026
@csadorf csadorf added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 26, 2026
csadorf and others added 2 commits January 26, 2026 19:46
…ests and do not fail test runs on integration tests (rapidsai#7715)

## Summary

Explicitly install `requests` in BERTopic integration test and add `continue-on-error: true` to wheel integration tests to prevent external dependency failures from blocking nightly CI.

## Motivation

Wheel integration tests verify compatibility with external packages (e.g., BERTopic) but should not block CI when those packages have regressions outside our control.

**Current failure:** BERTopic test fails due to missing `requests` dependency in `sentence-transformers==5.2.1` (released today, 2026-01-26):
```python
ModuleNotFoundError: No module named 'requests'
  File "sentence_transformers/util/file_io.py", line 7
```

This is an upstream bug in sentence-transformers, not a cuML issue.

See also: huggingface/sentence-transformers#3617

Authors:
  - Simon Adorf (https://github.com/csadorf)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - James Lamb (https://github.com/jameslamb)

URL: rapidsai#7715
@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Jan 26, 2026

/merge nosquash

@rapids-bot
Copy link
Copy Markdown
Contributor

rapids-bot Bot commented Jan 26, 2026

Commit history integrity check failed: not all commits from original PR #7718 appear to be present individually in this PR's history. This usually happens if commits were squashed during the manual resolution process. Please ensure all original commits are preserved individually. You can fix this and try the /merge nosquash command again.

@csadorf csadorf force-pushed the main-merge-release/26.02 branch from 0aa51e0 to df6a931 Compare January 26, 2026 22:21
@csadorf csadorf requested a review from a team as a code owner January 26, 2026 22:21
@csadorf csadorf requested a review from msarahan January 26, 2026 22:21
@github-actions github-actions Bot added the ci label Jan 26, 2026
@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Jan 26, 2026

/merge nosquash

…a-cuda

Fallback to numba-cuda with no extra CUDA packages if 'cuda_suffixed' isn't true
@rapids-bot
Copy link
Copy Markdown
Contributor

rapids-bot Bot commented Jan 26, 2026

Commit history integrity check failed: not all commits from original PR #7718 appear to be present individually in this PR's history. This usually happens if commits were squashed during the manual resolution process. Please ensure all original commits are preserved individually. You can fix this and try the /merge nosquash command again.

@csadorf csadorf force-pushed the main-merge-release/26.02 branch from df6a931 to 0aa9676 Compare January 27, 2026 14:52
@github-actions github-actions Bot removed the ci label Jan 27, 2026
@csadorf csadorf force-pushed the main-merge-release/26.02 branch from 0aa9676 to c6b7cff Compare January 27, 2026 14:55
@csadorf csadorf requested a review from a team as a code owner January 27, 2026 14:55
@github-actions github-actions Bot added the ci label Jan 27, 2026
@csadorf csadorf force-pushed the main-merge-release/26.02 branch from c6b7cff to ee46427 Compare January 27, 2026 15:36
@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Jan 27, 2026

/merge nosquash

@rapids-bot rapids-bot Bot merged commit f9928c4 into rapidsai:main Jan 27, 2026
117 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants