[cuda.coop]: add device-side `coop.warp.sum` benchmark with pynvbench by NaderAlAwar · Pull Request #6846 · NVIDIA/cccl

NaderAlAwar · 2025-12-02T20:37:16Z

Description

This adds a Python benchmark using pynvbench for coop.warp.sum. This was already done in C++ in #6431, this reimplements it in Python.

Comparing the two, we get these results from the pre-existing C++ benchmark (I deleted the results for types we don't support in cuda.coop right now):

| T{ct} | Samples |  CPU Time  | Noise |  GPU Time  | Noise |
|-------|---------|------------|-------|------------|-------|
|    I8 |    438x |  41.403 us | 1.11% |  33.060 us | 1.21% |
|   I16 |    308x |  41.496 us | 1.34% |  32.183 us | 1.32% |
|   I32 |    472x |  17.428 us | 1.44% |   9.120 us | 3.03% |
|   I64 |    652x |  61.092 us | 3.58% |  52.017 us | 3.93% |
|   F16 |    562x |  37.591 us | 0.80% |  29.853 us | 1.12% |
|   F32 |    462x |  38.499 us | 0.46% |  29.664 us | 1.23% |
|   F64 |    416x | 226.925 us | 0.14% | 218.994 us | 0.14% |

and these results from the new Python benchmark

| T{ct} | Samples |  CPU Time  | Noise  |  GPU Time  | Noise |
|-------|---------|------------|--------|------------|-------|
|    I8 |    436x |  53.733 us |  4.15% |  33.017 us | 1.26% |
|   I16 |    326x |  53.525 us |  5.77% |  32.181 us | 1.40% |
|   I32 |    664x |  28.928 us |  4.75% |   9.101 us | 3.42% |
|   I64 |    486x |  77.020 us |  2.57% |  55.843 us | 0.81% |
|   F16 |    624x |  54.038 us | 16.46% |  32.318 us | 1.85% |
|   F32 |    428x |  50.040 us |  2.32% |  29.422 us | 1.36% |
|   F64 |    474x | 238.258 us |  0.78% | 218.305 us | 0.21% |

The GPU Times for both are identical but the Python implementation has larger CPU Times, likely due to overhead introduced by the numba-cuda kernel call.

One other thing of note is that the SASS for two versions are identical, except for an extra NOP in the Python version that appears after the redux instruction:

REDUX.SUM.S32 UR4, R0 ;
NOP ; <- not present in the C++ version

I spent some time investigating this and I have a minimal reproducer that shows issue is from numba-cuda. Will open an issue to track this.

This PR will also introduce pynvbench as an optional dependency (haven't implemented this yet).

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

eating an intrinsic that mimics the memcpy used to generate the random input data b) calculating the grid size in stead of using grid 1 c) not passing the input stream

copy-pr-bot · 2025-12-02T20:37:20Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

shwina · 2026-01-30T16:34:30Z

I think it's helpful to write a briefREADME.md in the benchmarks directory for cuda.coop describing the approach we're taking. For someone reading the benchmark without context, they may not understand why we are writing it the way we are (e.g., using custom intrinsics and a manually unrolled loop).

reduce benchmark

gevtushenko

I'd suggest to move benchmarking facilities to a common header for later re-use.

NaderAlAwar · 2026-02-02T19:27:44Z

Updated benchmark numbers following REDUX optimization for integers sized less than 4 bytes (we upcast to int32 to be able to use REDUX):

C++

### [0] NVIDIA RTX 6000 Ada Generation

| T{ct} | Samples |  CPU Time  | Noise |  GPU Time  | Noise |
|-------|---------|------------|-------|------------|-------|
|    I8 |    394x |  20.624 us | 4.41% |   9.309 us | 3.13% |
|   I16 |    620x |  21.963 us | 1.40% |  10.052 us | 3.66% |
|   I32 |    374x |  20.359 us | 2.30% |   9.158 us | 3.00% |
|   I64 |    368x |  66.403 us | 1.28% |  54.405 us | 0.75% |
|   F16 |    686x |  41.971 us | 1.31% |  30.574 us | 1.22% |
|   F32 |    326x |  39.606 us | 0.90% |  28.800 us | 1.09% |
|   F64 |    324x | 229.889 us | 0.18% | 219.088 us | 0.21% |

Python

### [0] NVIDIA RTX 6000 Ada Generation

| T{ct} | Samples |  CPU Time  | Noise  |  GPU Time  | Noise |
|-------|---------|------------|--------|------------|-------|
|    I8 |    400x |  29.527 us |  7.86% |   9.161 us | 3.20% |
|   I16 |    506x |  30.929 us |  8.81% |   9.934 us | 4.17% |
|   I32 |    424x |  29.689 us | 16.27% |   9.158 us | 3.28% |
|   I64 |    522x |  77.199 us |  5.87% |  56.194 us | 0.88% |
|   F16 |    340x |  52.305 us |  2.29% |  32.289 us | 1.57% |
|   F32 |    494x |  52.827 us |  3.82% |  31.874 us | 1.21% |
|   F64 |    332x | 257.336 us |  2.38% | 236.187 us | 0.20% |

github-actions · 2026-02-03T15:12:31Z

🥳 CI Workflow Results

🟩 Finished in 14h 13m: Pass: 100%/56 | Total: 18h 15m | Max: 57m 20s

See results here.

…NVIDIA#6846) * Add cuda-bench to benchmark dependencies

NaderAlAwar added 3 commits December 2, 2025 12:21

Add initial version of warp reduce benchmark in python

30fa617

Follow C++ benchmark more closely by a) cr

394e1a0

eating an intrinsic that mimics the memcpy used to generate the random input data b) calculating the grid size in stead of using grid 1 c) not passing the input stream

Use an axis for types more consistent with C++

99ffdbb

github-project-automation bot added this to CCCL Dec 2, 2025

github-project-automation bot moved this to Todo in CCCL Dec 2, 2025

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Dec 2, 2025

Add cuda-bench to test dependencies

7d6b246

NaderAlAwar marked this pull request as ready for review January 30, 2026 16:24

NaderAlAwar requested review from a team as code owners January 30, 2026 16:24

NaderAlAwar requested a review from shwina January 30, 2026 16:24

cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Jan 30, 2026

This comment has been minimized.

Sign in to view

NaderAlAwar added 2 commits January 30, 2026 12:09

Merge branch 'main' into python-bench-warp-reduce

532baa4

Add README explaining warp

0160173

reduce benchmark

NaderAlAwar requested a review from a team as a code owner January 30, 2026 18:55

NaderAlAwar requested a review from gonidelis January 30, 2026 18:55

shwina approved these changes Jan 30, 2026

View reviewed changes

NaderAlAwar enabled auto-merge (squash) January 30, 2026 19:22

This comment has been minimized.

Sign in to view

Move cuda-bench to new set of optional dependencies

2b07507

This comment has been minimized.

Sign in to view

Merge branch 'main' into python-bench-warp-reduce

eb7dbce

NaderAlAwar disabled auto-merge February 2, 2026 18:50

gevtushenko approved these changes Feb 2, 2026

View reviewed changes

oleksandr-pavlyk approved these changes Feb 2, 2026

View reviewed changes

Move kernel creation to separate file for later reuse

dc95e94

NaderAlAwar enabled auto-merge (squash) February 2, 2026 21:25

This comment has been minimized.

Sign in to view

NaderAlAwar merged commit b0e3135 into NVIDIA:main Feb 3, 2026
156 of 162 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Feb 3, 2026

fbusato pushed a commit to fbusato/cccl that referenced this pull request Feb 19, 2026

[cuda.coop]: add device-side coop.warp.sum benchmark with pynvbench (…

9b09fd4

…NVIDIA#6846) * Add cuda-bench to benchmark dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda.coop]: add device-side `coop.warp.sum` benchmark with pynvbench#6846

[cuda.coop]: add device-side `coop.warp.sum` benchmark with pynvbench#6846
NaderAlAwar merged 9 commits intoNVIDIA:mainfrom
NaderAlAwar:python-bench-warp-reduce

NaderAlAwar commented Dec 2, 2025

Uh oh!

copy-pr-bot bot commented Dec 2, 2025

Uh oh!

shwina commented Jan 30, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

gevtushenko left a comment

Uh oh!

NaderAlAwar commented Feb 2, 2026

Uh oh!

This comment has been minimized.

github-actions bot commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

NaderAlAwar commented Dec 2, 2025

Description

Checklist

Uh oh!

copy-pr-bot bot commented Dec 2, 2025

Uh oh!

shwina commented Jan 30, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

gevtushenko left a comment

Choose a reason for hiding this comment

Uh oh!

NaderAlAwar commented Feb 2, 2026

Uh oh!

This comment has been minimized.

github-actions bot commented Feb 3, 2026

🥳 CI Workflow Results

🟩 Finished in 14h 13m: Pass: 100%/56 | Total: 18h 15m | Max: 57m 20s

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants