Skip to content

[cuda.coop]: add device-side coop.warp.sum benchmark with pynvbench#6846

Merged
NaderAlAwar merged 9 commits intoNVIDIA:mainfrom
NaderAlAwar:python-bench-warp-reduce
Feb 3, 2026
Merged

[cuda.coop]: add device-side coop.warp.sum benchmark with pynvbench#6846
NaderAlAwar merged 9 commits intoNVIDIA:mainfrom
NaderAlAwar:python-bench-warp-reduce

Conversation

@NaderAlAwar
Copy link
Contributor

Description

closes #6606

This adds a Python benchmark using pynvbench for coop.warp.sum. This was already done in C++ in #6431, this reimplements it in Python.

Comparing the two, we get these results from the pre-existing C++ benchmark (I deleted the results for types we don't support in cuda.coop right now):

| T{ct} | Samples |  CPU Time  | Noise |  GPU Time  | Noise |
|-------|---------|------------|-------|------------|-------|
|    I8 |    438x |  41.403 us | 1.11% |  33.060 us | 1.21% |
|   I16 |    308x |  41.496 us | 1.34% |  32.183 us | 1.32% |
|   I32 |    472x |  17.428 us | 1.44% |   9.120 us | 3.03% |
|   I64 |    652x |  61.092 us | 3.58% |  52.017 us | 3.93% |
|   F16 |    562x |  37.591 us | 0.80% |  29.853 us | 1.12% |
|   F32 |    462x |  38.499 us | 0.46% |  29.664 us | 1.23% |
|   F64 |    416x | 226.925 us | 0.14% | 218.994 us | 0.14% |

and these results from the new Python benchmark

| T{ct} | Samples |  CPU Time  | Noise  |  GPU Time  | Noise |
|-------|---------|------------|--------|------------|-------|
|    I8 |    436x |  53.733 us |  4.15% |  33.017 us | 1.26% |
|   I16 |    326x |  53.525 us |  5.77% |  32.181 us | 1.40% |
|   I32 |    664x |  28.928 us |  4.75% |   9.101 us | 3.42% |
|   I64 |    486x |  77.020 us |  2.57% |  55.843 us | 0.81% |
|   F16 |    624x |  54.038 us | 16.46% |  32.318 us | 1.85% |
|   F32 |    428x |  50.040 us |  2.32% |  29.422 us | 1.36% |
|   F64 |    474x | 238.258 us |  0.78% | 218.305 us | 0.21% |

The GPU Times for both are identical but the Python implementation has larger CPU Times, likely due to overhead introduced by the numba-cuda kernel call.

One other thing of note is that the SASS for two versions are identical, except for an extra NOP in the Python version that appears after the redux instruction:

REDUX.SUM.S32 UR4, R0 ;
NOP ; <- not present in the C++ version

I spent some time investigating this and I have a minimal reproducer that shows issue is from numba-cuda. Will open an issue to track this.

This PR will also introduce pynvbench as an optional dependency (haven't implemented this yet).

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

eating an intrinsic that mimics the memcpy used to generate the random input data b) calculating the grid size in
stead of using grid 1 c) not passing the input stream
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Dec 2, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Dec 2, 2025
@NaderAlAwar NaderAlAwar marked this pull request as ready for review January 30, 2026 16:24
@NaderAlAwar NaderAlAwar requested review from a team as code owners January 30, 2026 16:24
@NaderAlAwar NaderAlAwar requested a review from shwina January 30, 2026 16:24
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Jan 30, 2026
@shwina
Copy link
Contributor

shwina commented Jan 30, 2026

I think it's helpful to write a briefREADME.md in the benchmarks directory for cuda.coop describing the approach we're taking. For someone reading the benchmark without context, they may not understand why we are writing it the way we are (e.g., using custom intrinsics and a manually unrolled loop).

@github-actions

This comment has been minimized.

@NaderAlAwar NaderAlAwar requested a review from a team as a code owner January 30, 2026 18:55
@NaderAlAwar NaderAlAwar requested a review from gonidelis January 30, 2026 18:55
@NaderAlAwar NaderAlAwar enabled auto-merge (squash) January 30, 2026 19:22
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@NaderAlAwar NaderAlAwar disabled auto-merge February 2, 2026 18:50
Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to move benchmarking facilities to a common header for later re-use.

@NaderAlAwar
Copy link
Contributor Author

Updated benchmark numbers following REDUX optimization for integers sized less than 4 bytes (we upcast to int32 to be able to use REDUX):

C++

### [0] NVIDIA RTX 6000 Ada Generation

| T{ct} | Samples |  CPU Time  | Noise |  GPU Time  | Noise |
|-------|---------|------------|-------|------------|-------|
|    I8 |    394x |  20.624 us | 4.41% |   9.309 us | 3.13% |
|   I16 |    620x |  21.963 us | 1.40% |  10.052 us | 3.66% |
|   I32 |    374x |  20.359 us | 2.30% |   9.158 us | 3.00% |
|   I64 |    368x |  66.403 us | 1.28% |  54.405 us | 0.75% |
|   F16 |    686x |  41.971 us | 1.31% |  30.574 us | 1.22% |
|   F32 |    326x |  39.606 us | 0.90% |  28.800 us | 1.09% |
|   F64 |    324x | 229.889 us | 0.18% | 219.088 us | 0.21% |

Python

### [0] NVIDIA RTX 6000 Ada Generation

| T{ct} | Samples |  CPU Time  | Noise  |  GPU Time  | Noise |
|-------|---------|------------|--------|------------|-------|
|    I8 |    400x |  29.527 us |  7.86% |   9.161 us | 3.20% |
|   I16 |    506x |  30.929 us |  8.81% |   9.934 us | 4.17% |
|   I32 |    424x |  29.689 us | 16.27% |   9.158 us | 3.28% |
|   I64 |    522x |  77.199 us |  5.87% |  56.194 us | 0.88% |
|   F16 |    340x |  52.305 us |  2.29% |  32.289 us | 1.57% |
|   F32 |    494x |  52.827 us |  3.82% |  31.874 us | 1.21% |
|   F64 |    332x | 257.336 us |  2.38% | 236.187 us | 0.20% |

@NaderAlAwar NaderAlAwar enabled auto-merge (squash) February 2, 2026 21:25
@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

🥳 CI Workflow Results

🟩 Finished in 14h 13m: Pass: 100%/56 | Total: 18h 15m | Max: 57m 20s

See results here.

@NaderAlAwar NaderAlAwar merged commit b0e3135 into NVIDIA:main Feb 3, 2026
156 of 162 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Feb 3, 2026
fbusato pushed a commit to fbusato/cccl that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Add WarpReduce Device-Side Benchmarks for cuda.coop

4 participants