Skip to content

Segmented scan mvp#6022

Merged
oleksandr-pavlyk merged 31 commits intoNVIDIA:mainfrom
oleksandr-pavlyk:segmented_scan_mvp
Nov 17, 2025
Merged

Segmented scan mvp#6022
oleksandr-pavlyk merged 31 commits intoNVIDIA:mainfrom
oleksandr-pavlyk:segmented_scan_mvp

Conversation

@oleksandr-pavlyk
Copy link
Contributor

@oleksandr-pavlyk oleksandr-pavlyk commented Sep 25, 2025

Description

closes #5598

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Sep 25, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Sep 25, 2025
@oleksandr-pavlyk
Copy link
Contributor Author

/ok to test

@oleksandr-pavlyk oleksandr-pavlyk self-assigned this Oct 8, 2025
@oleksandr-pavlyk oleksandr-pavlyk added the cub For all items related to CUB label Oct 8, 2025
@oleksandr-pavlyk
Copy link
Contributor Author

/ok to test

@oleksandr-pavlyk
Copy link
Contributor Author

Current status:

  • Implemented APIs

    • cub::DeviceSegmentedScan::InclusiveSegmentedScan
    • cub::DeviceSegmentedScan::ExclusiveSegmentedScan
    • cub::DeviceSegmentedScan::InclusiveSegmentedScanInit
    • cub::DeviceSegmentedScan::InclusiveSegmentedSum
    • cub::DeviceSegmentedScan::ExclusiveSegmentedSum
  • Catch2 tests are written for each *SegmentedScan* variant, but not for *SegmentedSum*.

  • Doxygen docstrings are in place.

Current APIs are based on dispatch signature:

  cudaError_t Dispatch(
       void *temp_storage,
       size_t &temp_storage_bytes,
       InputIteratorT d_in,
       OutputIteratorT d_out,
       int64_t num_segments,
       BeginOffsetsIteratorInputT d_in_begin_offsets,
       EndOffsetsIteratorInputT d_in_end_offsets,
       BeginOffsetsIteratorOutputT d_out_begin_offsets,
       ScanOpT scan_op,
       InitialValueT init_value,
       cudaStream_t stream = 0);

Common situations is when d_in_end_offsets == d_in_begin_offsets + 1. In this case d_out_begin_offsets is typically the same as d_in_begin_offsets. In case of in-place work, i.e., d_out == d_in we also typically would use d_out_begin_offsets == d_in_begin_offsets.

In general, when d_in_end_offsets == d_in_begin_offsets + 1 is not true, a sensible default value for d_out_begin_offsets would be exclusive sum of segment sizes, i.e., d_in_end_offsets[i] - d_in_begin_offsets[i].

To compute d_out_begin_offsets this way a temporary allocation is required. I would therefore like to illustrate such a use case with an example, rather than provide an API to combine these two algorithms.

We could provide an API where d_out_begin_offsets is not present, which would contain CCCL_ASSERT(d_in_end_offsets == d_in_begin_offsets + 1, "Input segments must be adjacent to one another");.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@oleksandr-pavlyk oleksandr-pavlyk force-pushed the segmented_scan_mvp branch 3 times, most recently from da9743a to 6082574 Compare October 17, 2025 13:35
@oleksandr-pavlyk
Copy link
Contributor Author

/ok to test

@github-actions

This comment has been minimized.

@oleksandr-pavlyk
Copy link
Contributor Author

/ok to test

1 similar comment
@oleksandr-pavlyk
Copy link
Contributor Author

/ok to test

@github-actions

This comment has been minimized.

@oleksandr-pavlyk oleksandr-pavlyk marked this pull request as ready for review October 21, 2025 16:23
@oleksandr-pavlyk oleksandr-pavlyk requested a review from a team as a code owner October 21, 2025 16:23
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Oct 21, 2025
@oleksandr-pavlyk oleksandr-pavlyk force-pushed the segmented_scan_mvp branch 2 times, most recently from 826c40f to ba9ffc3 Compare October 21, 2025 21:08
@oleksandr-pavlyk oleksandr-pavlyk requested a review from a team as a code owner October 21, 2025 21:08
Add file test/catch2_test_segmented_scan_api.cu which several examples
of using two-offsets and three-offsets APIs.

Some of these examples are references from docstrings.
Use examples from catch2_test_segmented_scan_api.cu using
literalinclude in docstrings of DeviceSegmentedScan algorithm.
The std::exclusive_sum is not available in numeric header in GCC7.
This works around an issue with CTK 13.0 support for extended lambdas
and CCCL testing harness.
Adhere to C++ coding style.
Renamed dispatch/kernels/segmented_scan.cuh to
dispatch/kernels/kernel_segmented_scan.cuh to follow along the change
upstream
Reimplemented tuning for segmented scan based on
excelled review suggestion to set AgentSegmentedScanPolicyT
template values based on AgentScanPolicyT instance extracted
from the same node in chain of policies.

The assumption is that the chain detail::segmented_scan::policy_hub
is the same as the chain in detail::scan::policy_hub.
When AgentSegmentedScanPolicyT is built from parameters of
scan_hub::PolicyXY::ScanPolicyT that was already mem-bound scaled,
we should not apply membound scaling again.

So cub::detail::segmented_scan::policy_hub implements private NoScaling
and applies it in translate_agent helper type.
Per PR review feedback, use prefix operator to track exclusive prefix
value accross iterations.

Improved readibility of iteration loop by introducing ScanFirstTile
and ScanLaterTile private methods that hide away if constexpr logic.

While working on the refactoring, fixed a bug refactoring uncovered,
where exclusive_prefix computed for ExclusiveScan was not accounting
for init_value.
Instead start with default single setting policy.
@oleksandr-pavlyk oleksandr-pavlyk enabled auto-merge (squash) November 17, 2025 19:07
@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 4h 35m: Pass: 100%/81 | Total: 3d 06h | Max: 4h 35m | Hits: 79%/75140

See results here.

@oleksandr-pavlyk oleksandr-pavlyk merged commit 5d85f26 into NVIDIA:main Nov 17, 2025
93 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Nov 17, 2025
@oleksandr-pavlyk oleksandr-pavlyk deleted the segmented_scan_mvp branch November 17, 2025 23:49
@oleksandr-pavlyk oleksandr-pavlyk mentioned this pull request Nov 20, 2025
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cub For all items related to CUB

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

MVP segmented scan implementation

5 participants