[cuda.coop]: add device-side coop.warp.sum benchmark with pynvbench#6846
Merged
NaderAlAwar merged 9 commits intoNVIDIA:mainfrom Feb 3, 2026
Merged
[cuda.coop]: add device-side coop.warp.sum benchmark with pynvbench#6846NaderAlAwar merged 9 commits intoNVIDIA:mainfrom
coop.warp.sum benchmark with pynvbench#6846NaderAlAwar merged 9 commits intoNVIDIA:mainfrom
Conversation
eating an intrinsic that mimics the memcpy used to generate the random input data b) calculating the grid size in stead of using grid 1 c) not passing the input stream
Contributor
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
|
I think it's helpful to write a brief |
This comment has been minimized.
This comment has been minimized.
shwina
approved these changes
Jan 30, 2026
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
gevtushenko
approved these changes
Feb 2, 2026
Collaborator
gevtushenko
left a comment
There was a problem hiding this comment.
I'd suggest to move benchmarking facilities to a common header for later re-use.
oleksandr-pavlyk
approved these changes
Feb 2, 2026
Contributor
Author
|
Updated benchmark numbers following C++ Python |
This comment has been minimized.
This comment has been minimized.
Contributor
🥳 CI Workflow Results🟩 Finished in 14h 13m: Pass: 100%/56 | Total: 18h 15m | Max: 57m 20sSee results here. |
fbusato
pushed a commit
to fbusato/cccl
that referenced
this pull request
Feb 19, 2026
…NVIDIA#6846) * Add cuda-bench to benchmark dependencies
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
closes #6606
This adds a Python benchmark using pynvbench for
coop.warp.sum. This was already done in C++ in #6431, this reimplements it in Python.Comparing the two, we get these results from the pre-existing C++ benchmark (I deleted the results for types we don't support in cuda.coop right now):
and these results from the new Python benchmark
The GPU Times for both are identical but the Python implementation has larger CPU Times, likely due to overhead introduced by the numba-cuda kernel call.
One other thing of note is that the SASS for two versions are identical, except for an extra NOP in the Python version that appears after the redux instruction:
I spent some time investigating this and I have a minimal reproducer that shows issue is from numba-cuda. Will open an issue to track this.
This PR will also introduce pynvbench as an optional dependency (haven't implemented this yet).
Checklist