-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[NVIDIA] Blackwell Family #24673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVIDIA] Blackwell Family #24673
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the CMake configuration to support the new NVIDIA Blackwell architecture family, aligning with CUDA 12.9+ features. The changes introduce new architecture codes and update the minimum required CUDA version for Blackwell-specific kernels. While this is a necessary update, I've identified a critical issue with how the new architecture suffixes are handled, which will likely cause build failures. Additionally, there's a potential regression for users on CUDA 12.8 that should be addressed.
|
No ciflow labels are configured for this repo. |
Signed-off-by: Johnny <[email protected]>
|
No ciflow labels are configured for this repo. |
|
No ciflow labels are configured for this repo. |
|
Thanks for the PR! Could you please add some more information to the PR description about what this enables, for example:
|
This enable correctly: |
|
@johnnynunez do you want to update cutlass separately? 4.2 tag hasn't been made yet |
hello, cutlass v4.2.0 is out today! |
|
cc @Aidyn-A could you review this PR? |
Aidyn-A
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left some of the comments. I would like you to test the build, make sure it is 100% successful and double test all the kernels on all machines. I doubt that those CUTLASS kernels need extra arch flags, as CUTLASS kernels tend to be arch specific (e.g. cutlass kernel written for sm_100 can be unusable on sm_120 or sm_110). Please keep in mind that not all arch conditional instructions can be replaced with family conditional.
Additionally, I have not dig into it, but I am pretty sure, the function cuda_archs_loose_intersection needs to be modified for family conditional flags like it does for arch conditional:
Lines 315 to 325 in f4cd80f
| set(_CUDA_ARCHS) | |
| foreach(_arch ${_SRC_CUDA_ARCHS}) | |
| if(_arch MATCHES "\\a$") | |
| list(REMOVE_ITEM _SRC_CUDA_ARCHS "${_arch}") | |
| string(REPLACE "a" "" _base "${_arch}") | |
| if ("${_base}" IN_LIST TGT_CUDA_ARCHS) | |
| list(REMOVE_ITEM _TGT_CUDA_ARCHS "${_base}") | |
| list(APPEND _CUDA_ARCHS "${_arch}") | |
| endif() | |
| endif() | |
| endforeach() |
Lastly. The regular arch flags and their family conditional counterparts are conflicting with each other. For example:
nvcc code.cu -gencode arch=compute_120,code=sm_120 -gencode arch=compute_120f,code=sm_120fwill end up failing:
nvcc fatal : The same GPU code (`sm_120`) generated for non family-specific and family-specific GPU arch
That being said, here it is important to be very precise on what flags to apply. I would modify cuda_archs_loose_intersection to exclude the basic arch flag if family conditional is being passed.
|
Does |
is strange this code: #if defined CUDA_VERSION i have need to analyze and test. @jasl have a simple code for testing if capavility is active and run ? |
|
@DrStone71 My testing machine is an x86 with RTX Pro 6000 (SM120) I'm not sure this kernel labeled And I have a patch that works for me jasl@b52d720 Because SM120 doesn't contain SM100 features, it's difficult to disable the compilation for non-SM100 platforms |
|
is need a check with Cuda 13 and Sm_120 ? i use a physical machine, there is a need of particular environment for testing (sw version or pip sw ?) DrStone71 |
Test MoE, gptoss etc |
Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]>
Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]> Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]> Signed-off-by: Tomer Asida <[email protected]>
Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
|
|
||
| # moe_data.cu is used by all CUTLASS MoE kernels. | ||
| cuda_archs_loose_intersection(CUTLASS_MOE_DATA_ARCHS "9.0a;10.0a" "${CUDA_ARCHS}") | ||
| if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it 13.0 rather than 12.9?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]>
Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]>
Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]>
Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]>


https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/
cc @simon-mo