Releases: ROCm/rocPRIM
Releases · ROCm/rocPRIM
rocPRIM 4.2.0 for ROCm 7.2.0
Added
- Added missing benchmarks, such that every autotuned specialization is now benchmarked.
- Added a new cmake option,
BENCHMARK_USE_AMDSMI. It is set toOFFby default. When this option is set toON, it lets benchmarks use AMD SMI to output more GPU statistics. - Added the first tested example program for
device_search, which is linked in the documentation. - Added
apply_config_improvements.py, which generates improved configs by taking the best specializations from old and new configs.- Run the script with
--helpfor usage instructions, and seeprojects/rocprim/docs/concepts/tuning.rstfor documentation.
- Run the script with
- Kernel Tuner proof-of-concept.
- Enhanced SPIR-V support and performance.
Optimizations
- Improved performance of
device_radix_sortonesweep variant
Resolved issues
- Fixed the issue where
rocprim::device_scan_by_keyfailed when performing an "in-place" inclusive scan by reusing "keys" as output, by adding a buffer to store the last keys of each block (excluding the last block). This fix only affects the specific case of reusing "keys" as output in an inclusive scan, and does not affect other cases. - Fixed benchmark build error on Windows.
- Fixed offload compress build option.
- Fixed
float_bit_maskforrocprim::half. - Fixed handling of undefined behaviour when
__builtin_clz,__builtin_ctz, and similar builtins are called. - Fixed potential build error with
rocprim::detail::histogram_impl.
Known issues
- Potential hang with
rocprim::partition_threewaywith large input data sizes on later ROCm builds. A workaround is currently in place.
rocprim 4.1.0 for ROCm 7.1.1
rocPRIM code for ROCm 7.1.1 did not change. The library was rebuilt for the updated ROCm 7.1.1 stack.
rocPRIM 4.1.0 for ROCm 7.1.0
Added
- Added
get_sreg_lanemask_lt,get_sreg_lanemask_le,get_sreg_lanemask_gtandget_sreg_lanemask_ge. - Added
rocprim::transform_output_iteratorandrocprim::make_transform_output_iterator. - Added experimental support for SPIR-V, to use the correct tuned config for part of the appliable algorithms.
- Added a new cmake option,
BUILD_OFFLOAD_COMPRESS. When rocPRIM is build with this option enabled, the--offload-compressswitch is passed to the compiler. This causes the compiler to compress the binary that it generates. Compression can be useful in cases where you are compiling for a large number of targets, since this often results in a large binary. Without compression, in some cases, the generated binary may become so large symbols are placed out of range, resulting in linking errors. The newBUILD_OFFLOAD_COMPRESSoption is set toONby default. - Added a new CMake option
-DUSE_SYSTEM_LIBto allow tests to be built fromROCmlibraries provided by the system. - Added
rocprim::applywhich applies a function to arocprim::tuple.
Changed
- Changed tests to support
ptr-to-constoutput in/test/rocprim/test_device_batch_memcpy.cpp.
Optimizations
- Improved performance of many algorithms, by updating their tuned configs.
- 891 specializations have been improved.
- 399 specializations have been added.
Upcoming changes
- Deprecated the
->operator for thezip_iterator.
Resolved issues
- Fixed
device_select,device_merge, anddevice_merge_sortnot allocating the correct amount of virtual shared memory on the host. - Fixed the
->operator for thetransform_iterator, thetexture_cache_iteratorand thearg_index_iterator, by now returning a proxy pointer.- The
arg_index_iteratoralso now only returns the internal iterator for the->.
- The
rocprim 4.0.1 for ROCm 7.0.2
rocPRIM code for ROCm 7.0.2 did not change. The library was rebuilt for the updated ROCm 7.0.2 stack.
rocprim 4.0.0 for ROCm 7.0.1
rocPRIM code for ROCm 7.0.1 did not change. The library was rebuilt for the updated ROCm 7.0.1 stack.
rocPRIM 4.0.0 for ROCm 7.0.0
Added
- Added
rocprim::accumulator_tto ensure parity with CCCL. - Added test for
rocprim::accumulator_t - Added
rocprim::invoke_result_rto ensure parity with CCCL. - Added function
is_build_inintorocprim::traits::get. - Added virtual shared memory as a fallback option in
rocprim::device_mergewhen it exceeds shared memory capacity, similar torocprim::device_select,rocprim::device_partition, androcprim::device_merge_sort, which already include this feature. - Added initial value support to device level inclusive scans.
- Added new optimization to the backend for
device_transformwhen the input and output are pointers. - Added
LoadTypetotransform_config, which is used for thedevice_transformwhen the input and output are pointers. - Added
rocprim:device_transformfor n-ary transform operations API with as inputnnumber of iterators inside arocprim::tuple. - Added gfx950 support.
- Added
rocprim::key_value_pair::operator==. - Added the
rocprim::unrolled_copythread function to copy multiple items inside a thread. - Added the
rocprim::unrolled_thread_loadfunction to load multiple items inside a thread usingrocprim::thread_load. - Added
rocprim::int128_tandrocprim::uint128_tto benchmarks for improved performance evaluation on 128-bit integers. - Added
rocprim::int128_tto the supported autotuning types to improve performance for 128-bit integers. - Added the
rocprim::merge_inplacefunction for merging in-place. - Added initial value support for warp- and block-level inclusive scan.
- Added support for building tests with device-side random data generation, making them finish faster. This requires rocRAND, and is enabled with the
WITH_ROCRAND=ONbuild flag. - Added tests and documentation to
lookback_scan_state. It is still in thedetailnamespace.
Optimizations
- Improved performance of
rocprim::device_selectandrocprim::device_partitionwhen using multiple streams on the MI3XX architecture.
Changed
- Changed the parameters
long_radix_bitsandLongRadixBitsfromsegmented_radix_sorttoradix_bitsandRadixBitsrespectively. - Marked the initialisation constructor of
rocprim::reverse_iterator<Iter>explicit, userocprim::make_reverse_iterator. - Merged
radix_key_codecinto type_traits system. - Renamed
type_traits_interface.hpptotype_traits.hpp, rename the originaltype_traits.hpptotype_traits_functions.hpp. - The default scan accumulator types for device-level scan algorithms have changed. This is a breaking change.
The previous default accumulator types could lead to situations in which unexpected overflow occured, such as
when the input or inital type was smaller than the output type.- This is a complete list of affected functions and how their default accumulator types are changing:
rocprim::inclusive_scan- Previous default:
class AccType = typename std::iterator_traits<InputIterator>::value_type> - Current default:
class AccType = rocprim::accumulator_t<BinaryFunction, typename std::iterator_traits<InputIterator>::value_type>
- Previous default:
rocprim::deterministic_inclusive_scan- Previous default:
class AccType = typename std::iterator_traits<InputIterator>::value_type> - Current default:
class AccType = rocprim::accumulator_t<BinaryFunction, typename std::iterator_traits<InputIterator>::value_type>
- Previous default:
rocprim::exclusive_scan- Previous default:
class AccType = detail::input_type_t<InitValueType>> - Current default:
class AccType = rocprim::accumulator_t<BinaryFunction, rocprim::detail::input_type_t<InitValueType>>
- Previous default:
rocprim::deterministic_exclusive_scan- Previous default:
class AccType = detail::input_type_t<InitValueType>> - Current default:
class AccType = rocprim::accumulator_t<BinaryFunction, rocprim::detail::input_type_t<InitValueType>>
- Previous default:
- This is a complete list of affected functions and how their default accumulator types are changing:
- Undeprecated internal
detail::raw_storage. - A new version of
rocprim::thread_loadandrocprim::thread_storereplace the deprecatedrocprim::thread_loadandrocprim::thread_storefunctions. The versions avoid inline assembly where possible, and don't hinder the optimizer as much as a result. - Renamed
rocprim::load_cstorocprim::load_nontemporalandrocprim::store_cstorocprim::store_nontemporalto express the intent of these load and store methods better. - All kernels now have hidden symbol visibility. All symbols now have inline namespaces that include the library version, for example,
rocprim::ROCPRIM_300400_NS::symbolinstead ofrocPRIM::symbol, letting the user link multiple libraries built with different versions of rocPRIM.
Upcoming changes
rocprim::invoke_result_binary_opandrocprim::invoke_result_binary_op_tare deprecated. Userocprim::accumulator_tnow.
Removed
- Removed
rocprim::detail::float_bit_maskand relative tests, userocprim::traits::float_bit_maskinstead. - Removed
rocprim::traits::is_fundamental, please userocprim::traits::get<T>::is_fundamental()directly. - Removed the deprecated parameters
short_radix_bitsandShortRadixBitsfrom thesegmented_radix_sortconfig. They were unused, it is only an API change. - Removed the deprecated
operator<<from the iterators. - Removed the deprecated
TwiddleInandTwiddleOut. Useradix_key_codecinstead. - Removed the deprecated flags API of
block_adjacent_difference. Usesubtract_left()orblock_discontinuity::flag_heads()instead. - Removed the deprecated
to_exclusivefunctions in the warp scans. - Removed the
rocprim::load_csfrom thecache_load_modifierenum. Userocprim::load_nontemporalinstead. - Removed the
rocprim::store_csfrom thecache_store_modifierenum. Userocprim::store_nontemporalinstead. - Removed the deprecated header file
rocprim/detail/match_result_type.hpp. Includerocprim/type_traits.hppinstead.- This header included
rocprim::detail::invoke_result. Userocprim::invoke_resultinstead. - This header included
rocprim::detail::invoke_result_binary_op. Userocprim::invoke_result_binary_opinstead. - This header included
rocprim::detail::match_result_type. Userocprim::invoke_result_binary_op_tinstead.
- This header included
- Removed the deprecated
rocprim::detail::radix_key_codecfunction. Userocprim::radix_key_codecinstead. - Removed
rocprim/detail/radix_sort.hpp, functionality can now be found inrocprim/thread/radix_key_codec.hpp. - Removed C++14 support, only C++17 is supported.
- Due to the removal of
__AMDGCN_WAVEFRONT_SIZEin the compiler, the following deprecated warp size-related symbols have been removed:rocprim::device_warp_size()- For compile-time constants, this is replaced with
rocprim::arch::wavefront::min_size()androcprim::arch::wavefront::max_size(). Use this when allocating global or shared memory. - For run-time constants, this is replaced with
rocprim::arch::wavefront::size().
- For compile-time constants, this is replaced with
rocprim::warp_size()- Use
rocprim::host_warp_size(),rocprim::arch::wavefront::min_size()orrocprim::arch::wavefront::max_size()instead.
- Use
ROCPRIM_WAVEFRONT_SIZE- Use
rocprim::arch::wavefront::min_size()orrocprim::arch::wavefront::max_size()instead.
- Use
__AMDGCN_WAVEFRONT_SIZE- This was a fallback define for the compiler's removed symbol, having the same name.
- This release removes support for custom builds on gfx940 and gfx941.
Resolved issues
- Fixed an issue where
device_batch_memcpyreported benchmarking throughput being 2x lower than it was in reality. - Fixed an issue where
device_segmented_reducereported autotuning throughput being 5x lower than it was in reality. - Fixed device radix sort not returning the correct required temporary storage when a double buffer contains
nullptr. - Fixed constness of equality operators (
==and!=) inrocprim::key_value_pair. - Fixed an issue for the comparison operators in
arg_index_iteratorandtexture_cache_iterator, where<and>comparators were swapped. - Fixed an issue for the
rocprim::thread_reducenot working correctly with a prefix value.
Known issues
- When using
rocprim::deterministic_inclusive_scan_by_keyandrocprim::deterministic_exclusive_scan_by_keythe intermediate values can change order on Navi3x- However if a commutative scan operator is used then the final scan value (output array) will still always be consistent between runs
rocPRIM 3.4.1 for ROCm 6.4.4
rocPRIM code for ROCm 6.4.4 did not change. The library was rebuilt for the updated ROCm 6.4.4 stack.
rocPRIM 3.4.1 for ROCm 6.4.3
rocPRIM code for ROCm 6.4.3 did not change. The library was rebuilt for the updated ROCm 6.4.3 stack.
rocPRIM 3.4.1 for ROCm 6.4.2
Upcoming changes
- Changes to the template parameters of warp and block algorithms will be made in an upcoming release.
Deprecations
- Due to an upcoming compiler change the following warp size-related symbols will be removed in the next major release and are thus marked as deprecated:
rocprim::device_warp_size()- For compile-time constants, this is replaced with
rocprim::arch::wavefront::min_size()androcprim::arch::wavefront::max_size(). Use this when allocating global or shared memory. - For run-time constants, this is replaced with
rocprim::arch::wavefront::size().
- For compile-time constants, this is replaced with
rocprim::warp_size()- `ROCPRIM_WAVEFRONT_SIZE
rocPRIM 3.4.0 for ROCm 6.4.1
rocPRIM code for ROCm 6.4.1 did not change. The library was rebuilt for the updated ROCm 6.4.1 stack.