[BugFix] Ray with multiple nodes #28873

juliendenize · 2025-11-17T18:17:23Z

Purpose

This is a hotfix to launch vllm with multiple nodes and Ray.

To reproduce on a 2x8 (nodes x gpus) Ray cluster:

vllm serve mistralai/Mistral-Small-3.2-24B-Instruct-2506   --tokenizer_mode mistral --config_format mistral   --load_format mistral --tool-call-parser mistral   --enable-auto-tool-choice --limit-mm-per-prompt '{"image":10}'   --tensor-parallel-size 16   --max_model_len 65536   --max_num_seqs 128   --distributed-executor-backend ray --enforce_eager

Error:

(EngineCore_DP0 pid=570282)   File "/home/julien.denize/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 212, in init_device
(EngineCore_DP0 pid=570282)     assert self.parallel_config.local_world_size <= visible_device_count, (
(EngineCore_DP0 pid=570282)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=570282) AssertionError: local_world_size (16) must be less than or equal to the number of visible devices (8).

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Julien Denize <[email protected]>

gemini-code-assist

Code Review

This pull request aims to fix a crash when running vLLM with Ray on multiple nodes. The crash is caused by an assertion that checks if the number of workers on a node exceeds the number of available GPUs. The fix in this PR is to move this assertion inside an if block that is skipped for Ray, which resolves the immediate issue. However, my review finds that this change is too broad and incorrectly disables this important sanity check for other valid configurations, which could lead to other crashes. I've provided a critical comment explaining the issue and suggesting a more targeted fix.

gemini-code-assist · 2025-11-17T18:21:09Z

vllm/v1/worker/gpu_worker.py

+                visible_device_count = (
+                    torch.cuda.device_count() if torch.cuda.is_available() else 0
+                )
+                assert self.parallel_config.local_world_size <= visible_device_count, (
+                    f"local_world_size ({self.parallel_config.local_world_size}) must "
+                    f"be less than or equal to the number of visible devices "
+                    f"({visible_device_count})."
+                )


This change moves the assertion for local_world_size inside the if block that handles a specific single-node data parallelism setup. While this fixes the issue for multi-node Ray where local_world_size might be miscalculated, it incorrectly disables this important sanity check for other valid configurations, such as multi-node setups without Ray or single-node setups without data parallelism.

The assertion self.parallel_config.local_world_size <= visible_device_count is a general check to ensure that the number of workers on a node does not exceed the number of available GPUs. It should not be confined to the specific data parallelism case.

A more targeted fix would be to skip this check only for Ray, or to fix the underlying issue with the calculation of local_world_size for Ray environments. Disabling this check for all other configurations could hide potential resource allocation issues and lead to crashes in other scenarios.

njhill · 2025-11-18T03:45:07Z

Thanks @juliendenize... looks like this was introduced by #23691? cc @luccafong

luccafong

lgtm, thanks for the fix!

Signed-off-by: Julien Denize <[email protected]> (cherry picked from commit cdeec2e)

Signed-off-by: Julien Denize <[email protected]>

Signed-off-by: Julien Denize <[email protected]> Signed-off-by: LuminolT <[email protected]>

Signed-off-by: Julien Denize <[email protected]> Signed-off-by: jiang1.li <[email protected]>

Signed-off-by: Julien Denize <[email protected]>

…sible device count error (#4457) ### What this PR does / why we need it? Fix the ray start failed bug: local_world_size cannot little than visible device count error detail see issue #4456. This fix code is copied from vllm fixing modify, PR: [#28873](vllm-project/vllm#28873) - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <[email protected]>

ortegaalfredo · 2025-11-28T10:02:50Z

Still not working for me:

(APIServer pid=3167402) Value error, Tensor parallel size (10) cannot be larger than the number of available GPUs (8). [type=value_error, input_value=ArgsKwargs((), {'pipeline...'_api_process_rank': 0}), input_type=ArgsKwargs]
(APIServer pid=3167402) For further information visit https://errors.pydantic.dev/2.12/v/value_error

However, this is the output of ray status showing 12 GPUs available:

ray status
======== Autoscaler status: 2025-11-28 07:01:38.452724 ========
Node status

Active:
1 node_5bc3651f81e58183465936ce830e1197646fbeca8a72b261d79a0b17
1 node_6e0cdeabd00bec7a6039f1745ddbeb6aa2a6592fcd4feb7e971b2bdb
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Total Usage:
0.0/40.0 CPU
0.0/12.0 GPU

Signed-off-by: Julien Denize <[email protected]>

devonthomas35 · 2025-12-02T18:56:17Z

Still not working for me either

…sible device count error (vllm-project#4457) ### What this PR does / why we need it? Fix the ray start failed bug: local_world_size cannot little than visible device count error detail see issue vllm-project#4456. This fix code is copied from vllm fixing modify, PR: [#28873](vllm-project/vllm#28873) - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <[email protected]>

…sible device count error (vllm-project#4457) ### What this PR does / why we need it? Fix the ray start failed bug: local_world_size cannot little than visible device count error detail see issue vllm-project#4456. This fix code is copied from vllm fixing modify, PR: [#28873](vllm-project/vllm#28873) - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <[email protected]> Signed-off-by: Che Ruan <[email protected]>

Signed-off-by: Julien Denize <[email protected]> Signed-off-by: Xingyu Liu <[email protected]>

…sible device count error (vllm-project#4457) ### What this PR does / why we need it? Fix the ray start failed bug: local_world_size cannot little than visible device count error detail see issue vllm-project#4456. This fix code is copied from vllm fixing modify, PR: [#28873](vllm-project/vllm#28873) - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <[email protected]>

Signed-off-by: Julien Denize <[email protected]>

Hotfix: ray with multiple nodes

7e8304e

Signed-off-by: Julien Denize <[email protected]>

mergify bot added the v1 label Nov 17, 2025

gemini-code-assist bot reviewed Nov 17, 2025

View reviewed changes

njhill changed the title ~~Hotfix: ray with multiple nodes~~ [BugFix] Ray with multiple nodes Nov 18, 2025

njhill requested a review from luccafong November 18, 2025 03:52

youkaichao approved these changes Nov 18, 2025

View reviewed changes

youkaichao enabled auto-merge (squash) November 18, 2025 13:53

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 18, 2025

cjackal mentioned this pull request Nov 19, 2025

[Bug]: AssertionError: local_world_size must be less than or equal to the number of visible devices #28997

Closed

1 task

luccafong approved these changes Nov 19, 2025

View reviewed changes

Merge branch 'main' into hotfix_ray

2dc905d

youkaichao merged commit cdeec2e into vllm-project:main Nov 19, 2025
45 checks passed

khluu pushed a commit that referenced this pull request Nov 19, 2025

[BugFix] Ray with multiple nodes (#28873)

fa3ffb4

Signed-off-by: Julien Denize <[email protected]> (cherry picked from commit cdeec2e)

Victor49152 pushed a commit to Victor49152/vllm that referenced this pull request Nov 20, 2025

[BugFix] Ray with multiple nodes (vllm-project#28873)

27296a5

Signed-off-by: Julien Denize <[email protected]>

LuminolT pushed a commit to LuminolT/vllm that referenced this pull request Nov 21, 2025

[BugFix] Ray with multiple nodes (vllm-project#28873)

37fc097

Signed-off-by: Julien Denize <[email protected]> Signed-off-by: LuminolT <[email protected]>

bigPYJ1151 pushed a commit that referenced this pull request Nov 25, 2025

[BugFix] Ray with multiple nodes (#28873)

db16159

Signed-off-by: Julien Denize <[email protected]> Signed-off-by: jiang1.li <[email protected]>

leo-pony mentioned this pull request Nov 26, 2025

[bugfix] fix ray start failed: local_world_size cannot little than visible device count error vllm-project/vllm-ascend#4457

Merged

bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025

[BugFix] Ray with multiple nodes (vllm-project#28873)

78ce180

Signed-off-by: Julien Denize <[email protected]>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[BugFix] Ray with multiple nodes (vllm-project#28873)

1e74276

Signed-off-by: Julien Denize <[email protected]>

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[BugFix] Ray with multiple nodes (vllm-project#28873)

bbf3e2c

Signed-off-by: Julien Denize <[email protected]>

charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 5, 2025

[BugFix] Ray with multiple nodes (vllm-project#28873)

08a1274

Signed-off-by: Julien Denize <[email protected]> Signed-off-by: Xingyu Liu <[email protected]>

Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Dec 6, 2025

[BugFix] Ray with multiple nodes (vllm-project#28873)

317aeee

Signed-off-by: Julien Denize <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] Ray with multiple nodes #28873

[BugFix] Ray with multiple nodes #28873

Uh oh!

juliendenize commented Nov 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 17, 2025

Uh oh!

njhill commented Nov 18, 2025

Uh oh!

luccafong left a comment

Uh oh!

Uh oh!

ortegaalfredo commented Nov 28, 2025

Uh oh!

devonthomas35 commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

[BugFix] Ray with multiple nodes #28873

[BugFix] Ray with multiple nodes #28873

Uh oh!

Conversation

juliendenize commented Nov 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

njhill commented Nov 18, 2025

Uh oh!

luccafong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ortegaalfredo commented Nov 28, 2025

ray status ======== Autoscaler status: 2025-11-28 07:01:38.452724 ======== Node status

Resources

Uh oh!

devonthomas35 commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

juliendenize commented Nov 17, 2025 •

edited by github-actions bot

Loading

ray status
======== Autoscaler status: 2025-11-28 07:01:38.452724 ========
Node status