-
Notifications
You must be signed in to change notification settings - Fork 353
Description
Is this a duplicate?
- I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct
Type of Bug
Compile-time Error
Component
CUB
Describe the bug
For this test:
#include <execution>
#include <numeric>
#include <vector>
int32_t
main()
{
std::vector<double> in(1000);
std::vector<double> out(1000);
auto orr = std::inclusive_scan(std::execution::par, in.begin(), in.end(), out.begin());
return 0;
}
We’re seeing a ptxas register mismatch error when compiling for sm_120 (Blackwell) that didn’t occur on earlier architectures (e.g. sm_80).
This appears to be related to changes introduced in commit 94bd6e4
specifically the new get_device_scan_launch_bounds() function and its warpspeed path.
ptxas error : Entry function '_ZN3cub22_V_300400_SM_120_NVHPC6detail4scan16DeviceScanKernelINS2_10policy_hubIdddmN4cuda3std3__44plusIvEEE10Policy1000EPdSC_NS0_13ScanTileStateIdLb1EEES9_NS0_8NullTypeEmdLb0ESF_EEvT0_T1_NS2_23tile_state_kernel_arg_tIT2_T6_EEiT3_T4_T5_i' with max regcount of 168 calls function '_ZN3cub22_V_300400_SM_120_NVHPC6detail4scan10kernelBodyINS2_10policy_hubIdddmN4cuda3std3__44plusIvEEE10Policy100015WarpspeedPolicyEdddS9_NS0_8NullTypeELb0EEEvNS1_9warpspeed5SquadENSE_16SpecialRegistersERKNS2_16scanKernelParamsIT0_T1_T2_EET3_T4_' with regcount of 254
A new call get_device_scan_launch_bounds makes 352 number of threads for sm_120 and thus 168 max regcount per thread. The warpspeed path leads to a different policy selection and higher thread count on sm_120, lowering the available registers per thread and triggering the error.
I found that a similar issue was previously discussed and addressed here: #902.
How to Reproduce
nvc++ -stdpar -fast --c++17 scan.cpp
Expected behavior
Test should compile fine after the fix.
Reproduction link
No response
Operating System
No response
nvidia-smi output
No response
NVCC version
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status