Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions vllm/v1/worker/gpu_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -204,14 +204,14 @@ def init_device(self):
assert self.local_rank < torch.cuda.device_count(), (
f"DP adjusted local rank {self.local_rank} is out of bounds. "
)
visible_device_count = (
torch.cuda.device_count() if torch.cuda.is_available() else 0
)
assert self.parallel_config.local_world_size <= visible_device_count, (
f"local_world_size ({self.parallel_config.local_world_size}) must be "
f"less than or equal to the number of visible devices "
f"({visible_device_count})."
)
visible_device_count = (
torch.cuda.device_count() if torch.cuda.is_available() else 0
)
assert self.parallel_config.local_world_size <= visible_device_count, (
f"local_world_size ({self.parallel_config.local_world_size}) must "
f"be less than or equal to the number of visible devices "
f"({visible_device_count})."
)
Comment on lines +207 to +214
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change moves the assertion for local_world_size inside the if block that handles a specific single-node data parallelism setup. While this fixes the issue for multi-node Ray where local_world_size might be miscalculated, it incorrectly disables this important sanity check for other valid configurations, such as multi-node setups without Ray or single-node setups without data parallelism.

The assertion self.parallel_config.local_world_size <= visible_device_count is a general check to ensure that the number of workers on a node does not exceed the number of available GPUs. It should not be confined to the specific data parallelism case.

A more targeted fix would be to skip this check only for Ray, or to fix the underlying issue with the calculation of local_world_size for Ray environments. Disabling this check for all other configurations could hide potential resource allocation issues and lead to crashes in other scenarios.

self.device = torch.device(f"cuda:{self.local_rank}")
current_platform.set_device(self.device)

Expand Down