Skip to content

Commit 833f699

Browse files
yzhaiustcyuzhai
andauthored
v3.8.0 update (NVIDIA#2082)
* 3.8 update * fix Markus' name --------- Co-authored-by: yuzhai <[email protected]>
1 parent affd1b6 commit 833f699

File tree

168 files changed

+24945
-3436
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

168 files changed

+24945
-3436
lines changed

CHANGELOG.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
- [Pipelines that implement Blackwell specific synchronization](./include/cutlass/pipeline/sm100_pipeline.hpp).
1515
- [Cluster launch control API supporting preferred and fallback cluster shapes](./include/cutlass/cluster_launch.hpp).
1616
- Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types.
17-
- Tile schedulers using [Blackwell's Cluster Launch Control (CLC) feature](./cutlass/media/docs/blackwell_cluster_launch_control.md) to implement dynamic persistence scheduling for [GEMMs](./include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp), and [stream-K](./include/cutlass/gemm/kernel/sm100_tile_scheduler_stream_k.hpp).
17+
- Tile schedulers using [Blackwell's Cluster Launch Control (CLC) feature](./media/docs/blackwell_cluster_launch_control.md) to implement dynamic persistence scheduling for [GEMMs](./include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp), and [stream-K](./include/cutlass/gemm/kernel/sm100_tile_scheduler_stream_k.hpp).
1818
- Extensions to testbeds and reference check code for unit tests and CUTLASS profiler.
1919
* Full support for Blackwell SM100 kernels in CUTLASS 3.x API:
2020
- [Blackwell specific kernel layers](./include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp) that
@@ -32,6 +32,7 @@
3232
* CUTLASS library and profiler integration for block scaled data types for kernel emission, profiling, and verification.
3333
- Support for preferred and fallback cluster shapes via profiler command line arguments parsing to set dynamic cluster shapes.
3434
- Support for dynamic datatypes by parsing profiler via profiler command line arguments parsing to set dynamic datatype setting in TCGen05 MMA instruction descriptors.
35+
* New CUTLASS profiler flag `use-cuda-graphs` to reduce overheads when benchmarking launch-bound kernels.
3536
* Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM100 architecture:
3637
- [Basic FP16 and FP8 GEMMs with minimal changes from Hopper examples](./examples/70_blackwell_gemm/), demonstrating ease of migration for off the shelf kernels using the 3.x collective builder API.
3738
- GEMM with [opt-in collective builder schedules showcasing available recipes](./examples/71_blackwell_gemm_with_collective_builder/71_blackwell_gemm_with_collective_builder.cu) for Blackwell.
@@ -46,14 +47,15 @@
4647
- [Fused multi-head attention fprop kernel](./examples/77_blackwell_fmha/77_blackwell_fmha.cu) supporting fp16/bf16/fp8 data types across head dims of 32,64, and 128.
4748
* Documentation updates:
4849
- [Quickstart - instantiating a Blackwell block-scaled GEMM](./media/docs/quickstart.md#instantiating-a-blackwell-gemm-kernel).
49-
- Detailed [Blackwell block-scaled GEMM functionality documentation](./media/docs/narrow_and_mixed_precision_gemms.md)
50+
- Detailed [Blackwell block-scaled GEMM functionality documentation](./media/docs/blackwell_functionality.md)
5051
- A new [functionality documentation](./media/docs/functionality.md) specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
5152
- Updates to [compatibility](./README.md#compatibility) section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and [Target Architecture](./README.md#Target-Architecture).
53+
- Support grouped GEMM in the CUTLASS profiler (`./cutlass_profiler --operation=GroupedGemm --help` for details).
5254

5355
## [3.7.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.7.0) (2025-01-11)
5456
- [Hopper blockwise scaling FP8 GEMM](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) uses 2D scaling tensor, assigning one value per threadblock. This allows a finer-grained scaling to be applied for each output tile per gemm-k iteration. The operands and scaling tensors are loaded from global memory to shared memory using TMA and cp_async, respectively. The scaling is applied inside the mainloop. Details with figures are [here](https://github.com/NVIDIA/cutlass/pull/1932#issue-2645398439).
5557
- [Distributed GEMM](./examples/65_distributed_gemm/65_distributed_gemm.cu) is a new (experimental) API which can turn existing CUTLASS GEMM kernels into pipelined Tensor Parallel GEMMs that run efficiently on NVLink-based network of GPUs. Its pipelining schedules can hide most of the communication behind computation, and relies on point-to-point communication, which can simply use CUDA runtime's peer device access feature. It also utilizes remote TMA loads and memcopies with CUDA graphs to handle communication primarily through the Copy Engine, leaving all SMs free for Hopper's persistent kernels. For more details you can refer to the [DistGEMM blog post](https://blog.shi-labs.com/distributed-gemm-88be6a481e2b).
56-
- Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new `make_kernel_hardware_info` API as shown in [example 48](./examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu).
58+
- Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new `make_kernel_hardware_info` API as shown in [example 48](./examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu).
5759
- Enabled high precision accumulation for Hopper FP8 Sparse GEMM.
5860
- Potential API breaking changes:
5961
+ Fix `cute::UniversalCopy` for type safety.

CMakeLists.txt

Lines changed: 95 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,13 @@ set(CUTLASS_TEST_LEVEL "0" CACHE STRING "Level of tests to compile.")
114114
find_package(Python3 3.5 COMPONENTS Interpreter REQUIRED)
115115

116116
################################################################################
117+
118+
119+
include(customConfigs.cmake)
120+
121+
################################################################################
122+
123+
117124
set(CUTLASS_ENABLE_HEADERS_ONLY OFF CACHE BOOL "Enable only the header library")
118125

119126
if(CUTLASS_ENABLE_HEADERS_ONLY)
@@ -395,12 +402,6 @@ endif()
395402
#
396403
###################################################################################################
397404

398-
if (CUDA_VERSION VERSION_GREATER_EQUAL 12.8)
399-
list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUDA_BLACKWELL_TMA_SWIZZLE_ENABLED=1)
400-
401-
list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUDA_ENABLE_PREFERRED_CLUSTER=1)
402-
endif()
403-
404405

405406

406407
# Warnings-as-error exceptions and warning suppressions for Clang builds
@@ -978,6 +979,94 @@ function(cutlass_add_executable_tests NAME TARGET)
978979

979980
endfunction()
980981

982+
983+
984+
function(cutlass_generate_profiler_tests NAME)
985+
986+
set(options)
987+
set(oneValueArgs)
988+
set(multiValueArgs DEPENDS DEPENDEES CUTLASS_PROFILER_EXTRA_OPTIONS)
989+
cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
990+
991+
if (NOT CUTLASS_BUILD_FOR_PROFILER_REGRESSIONS AND NOT CUTLASS_BUILD_FOR_PROFILER_PERFORMANCE_REGRESSIONS)
992+
return()
993+
endif()
994+
995+
install(
996+
FILES ${CUTLASS_PROFILER_REGRESSION_LIST_FILE}
997+
DESTINATION ${CMAKE_INSTALL_INFODIR}/cutlass/
998+
RENAME profiler_regressions.csv
999+
)
1000+
1001+
# Generate cmake test targets for each entry in the testlist csv
1002+
1003+
if (NOT EXISTS "${CUTLASS_PROFILER_REGRESSION_LIST_FILE}")
1004+
message(SEND_ERROR "Profiler unit tests list path is invalid: CUTLASS_PROFILER_REGRESSION_LIST_FILE = ${CUTLASS_PROFILER_REGRESSION_LIST_FILE}")
1005+
else()
1006+
message(STATUS "Using ${CUTLASS_PROFILER_REGRESSION_LIST_FILE} to generate profiler-based tests.")
1007+
endif()
1008+
1009+
file(STRINGS ${CUTLASS_PROFILER_REGRESSION_LIST_FILE} TEST_LIST)
1010+
1011+
foreach(TEST IN LISTS TEST_LIST)
1012+
1013+
if ("${TEST}" MATCHES " *cutlass_profiler.*")
1014+
1015+
# Generate a flattened name for the test from the test command line.
1016+
string(REPLACE "," ";" TEST_NAME_LIST ${TEST})
1017+
list(GET TEST_NAME_LIST 0 TEST)
1018+
string(REGEX MATCHALL "[a-zA-Z0-9_=]+" TEST_NAME "${TEST}")
1019+
list(FILTER TEST_NAME EXCLUDE REGEX "cutlass_profiler|mode=trace|providers=cutlass")
1020+
list(JOIN TEST_NAME "_" TEST_NAME)
1021+
string(REGEX REPLACE "_verification_required=(true|false)" "" TEST_NAME "${TEST_NAME}")
1022+
string(REPLACE "_verification_providers=device" "" TEST_NAME "${TEST_NAME}")
1023+
string(REPLACE "batch_count=" "batch" TEST_NAME "${TEST_NAME}")
1024+
string(REPLACE "cluster_m=" "" TEST_NAME "${TEST_NAME}")
1025+
string(REPLACE "_cluster_n=" "x" TEST_NAME "${TEST_NAME}")
1026+
string(REGEX REPLACE "_cluster_k=[0-9]+" "" TEST_NAME "${TEST_NAME}")
1027+
string(REPLACE "cluster_m_fallback=" "" TEST_NAME "${TEST_NAME}")
1028+
string(REPLACE "_cluster_n_fallback=" "x" TEST_NAME "${TEST_NAME}")
1029+
string(REGEX REPLACE "_cluster_k_fallback=[0-9]+" "" TEST_NAME "${TEST_NAME}")
1030+
string(REPLACE "runtime_input_datatype_a=" "" TEST_NAME "${TEST_NAME}")
1031+
string(REPLACE "runtime_input_datatype_b=" "" TEST_NAME "${TEST_NAME}")
1032+
string(REPLACE "=" "" TEST_NAME "${TEST_NAME}")
1033+
string(REPLACE "_error_on_no_match" "" TEST_NAME "${TEST_NAME}")
1034+
string(REPLACE "_error_if_nothing_is_profiled" "" TEST_NAME "${TEST_NAME}")
1035+
string(REPLACE "kernels" "" TEST_NAME "${TEST_NAME}")
1036+
string(REPLACE "operation" "" TEST_NAME "${TEST_NAME}")
1037+
1038+
if (__DO_NOT_LOWERCASE_TEST_NAME)
1039+
string(TEST_NAME_LOWER "${TEST_NAME}")
1040+
else()
1041+
string(TOLOWER "${TEST_NAME}" TEST_NAME_LOWER)
1042+
endif()
1043+
1044+
# Munge the test command
1045+
string(REPLACE "cutlass_profiler" "" TEST "${TEST}")
1046+
set(TEST "${TEST}" ${__CUTLASS_PROFILER_EXTRA_OPTIONS} "--junit-output=${TEST_NAME_LOWER}")
1047+
set(TEST_COMMAND_${TEST_NAME_LOWER} "${TEST}")
1048+
list(APPEND TEST_COMMAND_VARS ${TEST_NAME_LOWER})
1049+
1050+
endif()
1051+
1052+
endforeach()
1053+
1054+
cutlass_add_executable_tests(
1055+
${NAME} cutlass_profiler
1056+
DEPENDS ${__DEPENDS}
1057+
DEPENDEES ${__DEPENDEES}
1058+
TEST_COMMAND_OPTIONS ${TEST_COMMAND_VARS}
1059+
TEST_COMMAND_OPTIONS_PREFIX TEST_COMMAND_
1060+
DISABLE_EXECUTABLE_INSTALL_RULE
1061+
# Uncomment the following line when alloc/dealloc tracking
1062+
# is fixed for all configurations.
1063+
# TEST_SETS_SUPPORTED tmem_alloc_tracking
1064+
)
1065+
1066+
endfunction()
1067+
1068+
1069+
9811070
if (CUTLASS_ENABLE_TOOLS)
9821071
add_subdirectory(tools)
9831072
if (CUTLASS_ENABLE_PROFILER)

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -87,11 +87,11 @@ For a background on Blackwell's new features, please consult the PTX documentati
8787
- [Fused multi-head attention fprop kernel](./examples/77_blackwell_fmha/77_blackwell_fmha.cu) supporting fp16/bf16/fp8 data types across head dims of 32,64, and 128.
8888
* Documentation updates:
8989
- [Quickstart - instantiating a Blackwell block-scaled GEMM](./media/docs/quickstart.md#instantiating-a-blackwell-gemm-kernel).
90-
- Detailed [Blackwell block-scaled GEMM functionality documentation](./media/docs/narrow_and_mixed_precision_gemms.md)
90+
- Detailed [Blackwell block-scaled GEMM functionality documentation](./media/docs/blackwell_functionality.md)
9191
- A new [functionality documentation](./media/docs/functionality.md) specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
9292
- Updates to [compatibility](./README.md#compatibility) section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and [Target Architecture](./README.md#Target-Architecture).
9393

94-
Note: CUTLASS 3.x builds are known to be broken on Windows platforms for all CUDA toolkits.
94+
Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
9595
CUTLASS team is working on a fix.
9696

9797
**See the [CHANGELOG](CHANGELOG.md) for details of all past releases and updates.**
@@ -162,7 +162,7 @@ We have tested the following environments.
162162

163163
Note: GCC 8.5.0 has known regressions regarding fold expressions and overloaded operators. Using GCC 7.5.0 or (preferred) GCC >= 9 is recommended.
164164

165-
Note: CUTLASS 3.x builds are known to be broken on Windows platforms for all CUDA toolkits.
165+
Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
166166
CUTLASS team is working on a fix.
167167

168168
## Hardware

customConfigs.cmake

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: BSD-3-Clause
3+
#
4+
# Redistribution and use in source and binary forms, with or without
5+
# modification, are permitted provided that the following conditions are met:
6+
#
7+
# 1. Redistributions of source code must retain the above copyright notice, this
8+
# list of conditions and the following disclaimer.
9+
#
10+
# 2. Redistributions in binary form must reproduce the above copyright notice,
11+
# this list of conditions and the following disclaimer in the documentation
12+
# and/or other materials provided with the distribution.
13+
#
14+
# 3. Neither the name of the copyright holder nor the names of its
15+
# contributors may be used to endorse or promote products derived from
16+
# this software without specific prior written permission.
17+
#
18+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
19+
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
21+
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
22+
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
23+
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
24+
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
25+
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
26+
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
27+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
28+
29+
30+
31+
32+
33+
# Profiler based functional testing
34+
set(CUTLASS_BUILD_FOR_PROFILER_REGRESSIONS OFF CACHE BOOL "Utilize profiler-based functional regressions")
35+
set(CUTLASS_PROFILER_REGRESSION_TEST_LEVEL ${CUTLASS_TEST_LEVEL} CACHE STRING "Profiler functional regression test level")
36+
37+
find_package(Python3 3.5 COMPONENTS Interpreter REQUIRED)
38+
39+
function(cutlass_generate_kernel_filter_and_testlists_files)
40+
41+
set(options)
42+
set(oneValueArgs TEST_SET_NAME)
43+
set(multiValueArgs)
44+
cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
45+
46+
execute_process(
47+
COMMAND ${CMAKE_COMMAND} -E env PYTHONPATH=${CUTLASS_LIBRARY_PACKAGE_DIR}
48+
${Python3_EXECUTABLE} ${CUTLASS_SOURCE_DIR}/python/cutlass_library/generator.py
49+
--generator-target=${__TEST_SET_NAME}
50+
--cuda-version=${CUTLASS_GENERATOR_CUDA_COMPILER_VERSION}
51+
--architectures=${CUTLASS_NVCC_ARCHS}
52+
--kernels=\*
53+
--disable-cutlass-package-imports
54+
WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}
55+
RESULT_VARIABLE cutlass_FILTER_GENERATION_RESULT
56+
OUTPUT_VARIABLE cutlass_FILTER_GENERATION_OUTPUT
57+
OUTPUT_FILE ${CMAKE_CURRENT_BINARY_DIR}/library_filter_generation.log
58+
ERROR_FILE ${CMAKE_CURRENT_BINARY_DIR}/library_filter_generation.log
59+
)
60+
61+
if(NOT cutlass_FILTER_GENERATION_RESULT EQUAL 0)
62+
message(FATAL_ERROR "Error generating kernel filters and testlists files. See ${CMAKE_CURRENT_BINARY_DIR}/library_filter_generation.log")
63+
endif()
64+
endfunction()
65+
66+
if(CUTLASS_BUILD_FOR_PROFILER_REGRESSIONS)
67+
68+
set(PROFILER_ARCH_LIST 100a)
69+
foreach(ARCH IN LISTS CUTLASS_NVCC_ARCHS)
70+
if(NOT (ARCH IN_LIST PROFILER_ARCH_LIST))
71+
message(FATAL_ERROR "Only SM100a compute capability is supported with profiler-based unit tests")
72+
endif()
73+
endforeach()
74+
75+
if(CUTLASS_PROFILER_REGRESSION_TEST_LEVEL EQUAL 0)
76+
77+
message(STATUS "Building for L0 profiler-based functional regressions")
78+
cutlass_generate_kernel_filter_and_testlists_files(TEST_SET_NAME kernel_testlist_l0)
79+
set(KERNEL_FILTER_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L0_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm_kernel_filter.list CACHE STRING "Kernel set")
80+
set(CUTLASS_PROFILER_REGRESSION_LIST_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L0_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm.csv CACHE STRING "Regression set")
81+
82+
elseif (CUTLASS_PROFILER_REGRESSION_TEST_LEVEL EQUAL 1)
83+
84+
message(STATUS "Building for L1 profiler-based functional regressions")
85+
cutlass_generate_kernel_filter_and_testlists_files(TEST_SET_NAME kernel_testlist_l1)
86+
set(KERNEL_FILTER_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L1_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm_kernel_filter.list CACHE STRING "Kernel set")
87+
set(CUTLASS_PROFILER_REGRESSION_LIST_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L1_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm.csv CACHE STRING "Regression set")
88+
89+
endif()
90+
endif()
91+
92+

examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -483,18 +483,13 @@ int main(int argc, char const **args) {
483483
CUDA_CHECK(cudaGetDevice(&current_device_id));
484484
CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
485485
cudaError_t error = cudaGetDeviceProperties(&props, 0);
486-
if (props.major < 9) {
487-
std::cerr
488-
<< "This example requires a GPU of NVIDIA's Hopper Architecture or "
489-
<< "later (compute capability 90 or greater).\n";
490-
return 0;
491-
}
492-
493486
if (props.major != 9 || props.minor != 0) {
494487
std::cerr
495488
<< "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
496489
return 0;
497490
}
491+
492+
498493

499494

500495
//

examples/54_hopper_fp8_warp_specialized_gemm/54_hopper_fp8_warp_specialized_gemm.cu

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -566,17 +566,13 @@ int main(int argc, char const **args) {
566566
CUDA_CHECK(cudaGetDevice(&current_device_id));
567567
CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
568568
cudaError_t error = cudaGetDeviceProperties(&props, 0);
569-
if (props.major < 9) {
569+
if (props.major != 9 || props.minor != 0) {
570570
std::cerr
571-
<< "This example requires a GPU of NVIDIA's Hopper Architecture or "
572-
<< "later (compute capability 90 or greater).\n";
571+
<< "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
573572
return 0;
574573
}
574+
575575

576-
else if (props.major != 9 || props.minor != 0) {
577-
std::cerr << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
578-
return 0;
579-
}
580576

581577

582578
//

0 commit comments

Comments
 (0)