-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Description
Your current environment
I'm working on a dev setup, so collect_env.py is not an option; however, I found the root cause in a commit from 2 hours ago, which explains why I had no issues 2 hours ago.
🐛 Describe the bug
When running vLLM in headless mode on a worker pod in a Kubernetes cluster, the command fails with the following error:
ValueError: data_parallel_rank is not applicable in headless mode
despite not explicitly setting the --data-parallel-rank or --data-parallel-start-rank flags in the command. The error originates from the run_headless function in vllm/entrypoints/cli/serve.py, specifically at this check:
https://github.com/vllm-project/vllm/blame/657f2f301a431542a731719fa8c6326deacc317d/vllm/entrypoints/cli/serve.py#L130
`
if parallel_config.data_parallel_rank is not None:
raise ValueError("data_parallel_rank is not applicable in "
"headless mode")
parallel_config.data_parallel_rank is being set to a non-None value internally, even though no rank-related flags are provided in the command. This prevents the headless worker from starting and connecting to the master node (vllm-inference-pytorchjob-final-master-0:29500) for data-parallel coordination. The issue appears to have been introduced or exacerbated by commit 657f2f3, as you noted that the setup worked two hours prior to this change being pushed.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.