[NVIDIA] Blackwell Family #24673

johnnynunez · 2025-09-11T15:54:14Z

https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/

cc @simon-mo

gemini-code-assist

Code Review

This pull request updates the CMake configuration to support the new NVIDIA Blackwell architecture family, aligning with CUDA 12.9+ features. The changes introduce new architecture codes and update the minimum required CUDA version for Blackwell-specific kernels. While this is a necessary update, I've identified a critical issue with how the new architecture suffixes are handled, which will likely cause build failures. Additionally, there's a potential regression for users on CUDA 12.8 that should be addressed.

CMakeLists.txt

pytorch-bot · 2025-09-13T20:56:37Z

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

Signed-off-by: Johnny <[email protected]>

pytorch-bot · 2025-09-13T20:59:50Z

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

pytorch-bot · 2025-09-13T21:03:11Z

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

hmellor · 2025-09-15T08:21:22Z

Thanks for the PR! Could you please add some more information to the PR description about what this enables, for example:

Which new hardware does this enable?
What effect does it have on the wheel size?

johnnynunez · 2025-09-15T08:30:57Z

Thanks for the PR! Could you please add some more information to the PR description about what this enables, for example:

Which new hardware does this enable?

What effect does it have on the wheel size?

This enable correctly:
Spark
Thor
Gb300
Without increase binaries size.

CMakeLists.txt

mgoin · 2025-09-15T16:35:52Z

@johnnynunez do you want to update cutlass separately? 4.2 tag hasn't been made yet

[2025-09-13T21:11:55Z] #25 13.59 CMake Error at cutlass-subbuild/cutlass-populate-prefix/tmp/cutlass-populate-gitclone.cmake:61 (message):
[2025-09-13T21:11:55Z] #25 13.59   Failed to checkout tag: 'v4.2.0'

johnnynunez · 2025-09-15T20:19:41Z

@johnnynunez do you want to update cutlass separately? 4.2 tag hasn't been made yet

[2025-09-13T21:11:55Z] #25 13.59 CMake Error at cutlass-subbuild/cutlass-populate-prefix/tmp/cutlass-populate-gitclone.cmake:61 (message):
[2025-09-13T21:11:55Z] #25 13.59   Failed to checkout tag: 'v4.2.0'

hello, cutlass v4.2.0 is out today!

NVIDIA/cutlass@6a35b4d

johnnynunez · 2025-09-16T02:22:29Z

@mgoin it is out now: https://github.com/NVIDIA/cutlass/releases/tag/v4.2.0

johnnynunez · 2025-09-17T14:48:43Z

cc @Aidyn-A could you review this PR?

Aidyn-A

I have left some of the comments. I would like you to test the build, make sure it is 100% successful and double test all the kernels on all machines. I doubt that those CUTLASS kernels need extra arch flags, as CUTLASS kernels tend to be arch specific (e.g. cutlass kernel written for sm_100 can be unusable on sm_120 or sm_110). Please keep in mind that not all arch conditional instructions can be replaced with family conditional.

Additionally, I have not dig into it, but I am pretty sure, the function cuda_archs_loose_intersection needs to be modified for family conditional flags like it does for arch conditional:

vllm/cmake/utils.cmake

Lines 315 to 325 in f4cd80f

    
             set(_CUDA_ARCHS) 
        
             foreach(_arch ${_SRC_CUDA_ARCHS}) 
        
               if(_arch MATCHES "\\a$") 
        
                 list(REMOVE_ITEM _SRC_CUDA_ARCHS "${_arch}") 
        
                 string(REPLACE "a" "" _base "${_arch}") 
        
                 if ("${_base}" IN_LIST TGT_CUDA_ARCHS) 
        
                   list(REMOVE_ITEM _TGT_CUDA_ARCHS "${_base}") 
        
                   list(APPEND _CUDA_ARCHS "${_arch}") 
        
                 endif() 
        
               endif() 
        
             endforeach()

Lastly. The regular arch flags and their family conditional counterparts are conflicting with each other. For example:

nvcc code.cu -gencode arch=compute_120,code=sm_120 -gencode arch=compute_120f,code=sm_120f

will end up failing:

nvcc fatal   : The same GPU code (`sm_120`) generated for non family-specific and family-specific GPU arch

That being said, here it is important to be very precise on what flags to apply. I would modify cuda_archs_loose_intersection to exclude the basic arch flag if family conditional is being passed.

CMakeLists.txt

jasl · 2025-09-28T21:41:59Z

https://github.com/vllm-project/vllm/pull/24673/files#diff-c1cdf2ae7a3604efb26f96752f519b9d86fe38fd05dacca78301c522fb20d819R256-R262

Does cutlass_moe_mm_sm100 work on SM120?

DrStone1971 · 2025-09-29T07:35:55Z

https://github.com/vllm-project/vllm/pull/24673/files#diff-c1cdf2ae7a3604efb26f96752f519b9d86fe38fd05dacca78301c522fb20d819R256-R262

Does cutlass_moe_mm_sm100 work on SM120?

is strange this code:

#if defined CUDA_VERSION
if (cuda_device_capability >= 100) {
return CUDA_VERSION >= 12080;
}
if (cuda_device_capability >= 90) {
return CUDA_VERSION >= 12030;
}
#endif

i have need to analyze and test.

@jasl have a simple code for testing if capavility is active and run ?

jasl · 2025-09-29T07:50:27Z

https://github.com/vllm-project/vllm/pull/24673/files#diff-c1cdf2ae7a3604efb26f96752f519b9d86fe38fd05dacca78301c522fb20d819R256-R262

@DrStone71

My testing machine is an x86 with RTX Pro 6000 (SM120)

I'm not sure this kernel labeled _sm100 will work on SM120.
Looking at the CMakeList https://github.com/vllm-project/vllm/blob/main/CMakeLists.txt#L678-L696 it only works for SM100.
So I guess we need an upper bound check here

And cutlass_moe_mm_sm100 causes trouble for me when running vllm, I got importError: /home/jasl/.venv/lib/python3.12/site-packages/vllm/_C.abi3.so: undefined symbol: _Z20cutlass_moe_mm_sm100RN2at6TensorERKS0_S3_S3_S3_S3_S3_S3_S3_S3_bb

I have a patch that works for me jasl@b52d720
(I'm not sure this is correct)

Because SM120 doesn't contain SM100 features, it's difficult to disable the compilation for non-SM100 platforms

jasl · 2025-09-29T10:27:07Z

With my additional patch, vllm@CUDA 13 is working on my RTX Pro 6000

I'm testing on my Thor now.

Signed-off-by: johnnynunez <[email protected]>

DrStone1971 · 2025-10-01T07:46:08Z

is need a check with Cuda 13 and Sm_120 ?

i use a physical machine, there is a need of particular environment for testing (sw version or pip sw ?)

DrStone71

johnnynunez · 2025-10-01T08:04:17Z

is need a check with Cuda 13 and Sm_120 ?

i use a physical machine, there is a need of particular environment for testing (sw version or pip sw ?)

DrStone71

Test MoE, gptoss etc

Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]>

Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]> Signed-off-by: yewentao256 <[email protected]>

Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]> Signed-off-by: Tomer Asida <[email protected]>

Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

youkaichao · 2025-10-14T02:29:40Z

CMakeLists.txt


  # moe_data.cu is used by all CUTLASS MoE kernels.
-  cuda_archs_loose_intersection(CUTLASS_MOE_DATA_ARCHS "9.0a;10.0a" "${CUDA_ARCHS}")
+  if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0)


why is it 13.0 rather than 12.9?

@johnnynunez

Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]>

Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

Signed-off-by: Johnny <[email protected]> Signed-off-by: johnnynunez <[email protected]> Signed-off-by: Johnny <[email protected]> Signed-off-by: Salvatore Cena <[email protected]> Co-authored-by: Aidyn-A <[email protected]> Co-authored-by: Salvatore Cena <[email protected]>

johnnynunez requested review from LucasWilkinson and tlrmchlsmth as code owners September 11, 2025 15:54

mergify bot added the ci/build label Sep 11, 2025

gemini-code-assist bot reviewed Sep 11, 2025

View reviewed changes

johnnynunez mentioned this pull request Sep 13, 2025

[NVIDIA] Enable Thor and Spark with CUDA 13 #23469

Closed

pytorch-bot bot removed the ci/build label Sep 13, 2025

mergify bot added the ci/build label Sep 13, 2025

johnnynunez and others added 2 commits September 13, 2025 22:56

Update CMakeLists.txt

90df7a0

Signed-off-by: Johnny <[email protected]>

Merge branch 'vllm-project:main' into main

7e0ad57

johnnynunez requested a review from ProExpertProg September 13, 2025 20:58

pytorch-bot bot removed the ci/build label Sep 13, 2025

mergify bot added the ci/build label Sep 13, 2025

pytorch-bot bot removed the ci/build label Sep 13, 2025

johnnynunez requested a review from DrStone1971 September 13, 2025 21:02

mergify bot added the ci/build label Sep 13, 2025

johnnynunez changed the title ~~Blackwell Family~~ [NVIDIA] Blackwell Family Sep 13, 2025

ProExpertProg reviewed Sep 15, 2025

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

huydhn mentioned this pull request Sep 15, 2025

Enable pytorchbot on vLLM pytorch/test-infra#7103

Closed

johnnynunez mentioned this pull request Sep 15, 2025

Fix for compile on CUDA 13 -- Fix for make a correct Compile (Merge of Pull Request) #24809

Closed

Aidyn-A suggested changes Sep 18, 2025

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

johnnynunez requested a review from ProExpertProg September 18, 2025 08:40

khluu removed the ready ONLY add when PR is ready to merge/full CI is needed label Sep 29, 2025

henrylhtsang mentioned this pull request Sep 29, 2025

[cutlass-4][take 2] upgrade to cutlass 4.2.1 pytorch/pytorch#164159

Closed

johnnynunez and others added 3 commits October 1, 2025 06:38

Merge branch 'vllm-project:main' into patch-1

ac301ea

fix correct support

82ee77a

Signed-off-by: johnnynunez <[email protected]>

Merge branch 'vllm-project:main' into patch-1

56b0ff7

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 1, 2025

mgoin approved these changes Oct 1, 2025

View reviewed changes

vllm-bot merged commit 5234dc7 into vllm-project:main Oct 1, 2025
83 of 85 checks passed

ZJY0516 mentioned this pull request Oct 2, 2025

[Compile] Fix import error of vllm._C #26077

Closed

5 tasks

daniel-fahey mentioned this pull request Oct 5, 2025

python3Packages.vllm: disable Blackwell GPU support to fix CUDA build NixOS/nixpkgs#448965

Merged

13 tasks

youkaichao reviewed Oct 14, 2025

View reviewed changes

	set(_CUDA_ARCHS)
	foreach(_arch ${_SRC_CUDA_ARCHS})
	if(_arch MATCHES "\\a$")
	list(REMOVE_ITEM _SRC_CUDA_ARCHS "${_arch}")
	string(REPLACE "a" "" _base "${_arch}")
	if ("${_base}" IN_LIST TGT_CUDA_ARCHS)
	list(REMOVE_ITEM _TGT_CUDA_ARCHS "${_base}")
	list(APPEND _CUDA_ARCHS "${_arch}")
	endif()
	endif()
	endforeach()

Uh oh!

[NVIDIA] Blackwell Family #24673

[NVIDIA] Blackwell Family #24673

Uh oh!

Conversation

johnnynunez commented Sep 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 13, 2025

Uh oh!

pytorch-bot bot commented Sep 13, 2025

Uh oh!

pytorch-bot bot commented Sep 13, 2025

Uh oh!

hmellor commented Sep 15, 2025

Uh oh!

johnnynunez commented Sep 15, 2025

Uh oh!

Uh oh!

mgoin commented Sep 15, 2025

Uh oh!

johnnynunez commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnnynunez commented Sep 16, 2025

Uh oh!

johnnynunez commented Sep 17, 2025

Uh oh!

Aidyn-A left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jasl commented Sep 28, 2025

Uh oh!

DrStone1971 commented Sep 29, 2025

Uh oh!

jasl commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jasl commented Sep 29, 2025

Uh oh!

DrStone1971 commented Oct 1, 2025

Uh oh!

johnnynunez commented Oct 1, 2025

Uh oh!

Uh oh!

youkaichao Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

youkaichao Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

johnnynunez commented Sep 11, 2025 •

edited by github-actions bot

Loading

johnnynunez commented Sep 15, 2025 •

edited

Loading

jasl commented Sep 29, 2025 •

edited

Loading