-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[BugFix] Ray with multiple nodes #28873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Julien Denize <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a crash when running vLLM with Ray on multiple nodes. The crash is caused by an assertion that checks if the number of workers on a node exceeds the number of available GPUs. The fix in this PR is to move this assertion inside an if block that is skipped for Ray, which resolves the immediate issue. However, my review finds that this change is too broad and incorrectly disables this important sanity check for other valid configurations, which could lead to other crashes. I've provided a critical comment explaining the issue and suggesting a more targeted fix.
| visible_device_count = ( | ||
| torch.cuda.device_count() if torch.cuda.is_available() else 0 | ||
| ) | ||
| assert self.parallel_config.local_world_size <= visible_device_count, ( | ||
| f"local_world_size ({self.parallel_config.local_world_size}) must " | ||
| f"be less than or equal to the number of visible devices " | ||
| f"({visible_device_count})." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change moves the assertion for local_world_size inside the if block that handles a specific single-node data parallelism setup. While this fixes the issue for multi-node Ray where local_world_size might be miscalculated, it incorrectly disables this important sanity check for other valid configurations, such as multi-node setups without Ray or single-node setups without data parallelism.
The assertion self.parallel_config.local_world_size <= visible_device_count is a general check to ensure that the number of workers on a node does not exceed the number of available GPUs. It should not be confined to the specific data parallelism case.
A more targeted fix would be to skip this check only for Ray, or to fix the underlying issue with the calculation of local_world_size for Ray environments. Disabling this check for all other configurations could hide potential resource allocation issues and lead to crashes in other scenarios.
|
Thanks @juliendenize... looks like this was introduced by #23691? cc @luccafong |
luccafong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks for the fix!
Signed-off-by: Julien Denize <[email protected]> (cherry picked from commit cdeec2e)
Signed-off-by: Julien Denize <[email protected]>
Signed-off-by: Julien Denize <[email protected]> Signed-off-by: LuminolT <[email protected]>
Signed-off-by: Julien Denize <[email protected]> Signed-off-by: jiang1.li <[email protected]>
Signed-off-by: Julien Denize <[email protected]>
…sible device count error (#4457) ### What this PR does / why we need it? Fix the ray start failed bug: local_world_size cannot little than visible device count error detail see issue #4456. This fix code is copied from vllm fixing modify, PR: [#28873](vllm-project/vllm#28873) - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <[email protected]>
|
Still not working for me: (APIServer pid=3167402) Value error, Tensor parallel size (10) cannot be larger than the number of available GPUs (8). [type=value_error, input_value=ArgsKwargs((), {'pipeline...'_api_process_rank': 0}), input_type=ArgsKwargs] However, this is the output of ray status showing 12 GPUs available: ray status
|
Signed-off-by: Julien Denize <[email protected]>
Signed-off-by: Julien Denize <[email protected]>
|
Still not working for me either |
…sible device count error (vllm-project#4457) ### What this PR does / why we need it? Fix the ray start failed bug: local_world_size cannot little than visible device count error detail see issue vllm-project#4456. This fix code is copied from vllm fixing modify, PR: [#28873](vllm-project/vllm#28873) - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <[email protected]>
…sible device count error (vllm-project#4457) ### What this PR does / why we need it? Fix the ray start failed bug: local_world_size cannot little than visible device count error detail see issue vllm-project#4456. This fix code is copied from vllm fixing modify, PR: [#28873](vllm-project/vllm#28873) - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <[email protected]> Signed-off-by: Che Ruan <[email protected]>
…sible device count error (vllm-project#4457) ### What this PR does / why we need it? Fix the ray start failed bug: local_world_size cannot little than visible device count error detail see issue vllm-project#4456. This fix code is copied from vllm fixing modify, PR: [#28873](vllm-project/vllm#28873) - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <[email protected]> Signed-off-by: Che Ruan <[email protected]>
Signed-off-by: Julien Denize <[email protected]> Signed-off-by: Xingyu Liu <[email protected]>
…sible device count error (vllm-project#4457) ### What this PR does / why we need it? Fix the ray start failed bug: local_world_size cannot little than visible device count error detail see issue vllm-project#4456. This fix code is copied from vllm fixing modify, PR: [#28873](vllm-project/vllm#28873) - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: leo-pony <[email protected]>
Signed-off-by: Julien Denize <[email protected]>
Purpose
This is a hotfix to launch vllm with multiple nodes and Ray.
To reproduce on a 2x8 (nodes x gpus) Ray cluster:
Error:
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.