Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 31 additions & 1 deletion core/helm-charts/vllm/xeon-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,36 @@ modelConfigs:
tensor_parallel_size: "{{ .Values.tensor_parallel_size }}"
pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}"

"Qwen/Qwen2.5-VL-7B-Instruct":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the support for this VL model. For better performance in the xeon include these additional variables and extra command arguments. Also tensor parallel is calculated dynamically based on the system configuration where models are deployed.

configMapValues:
VLLM_CPU_KVCACHE_SPACE: "40"
VLLM_RPC_TIMEOUT: "100000"
VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
VLLM_ENGINE_ITERATION_TIMEOUT_S: "120"
VLLM_CPU_NUM_OF_RESERVED_CPU: "0"
VLLM_CPU_SGL_KERNEL: "1"
HF_HUB_DISABLE_XET: "1"
extraCmdArgs:
[
"--block-size",
"128",
"--dtype",
"bfloat16",
"--distributed_executor_backend",
"mp",
"--enable_chunked_prefill",
"--enforce-eager",
"--max-model-len",
"33024",
"--max-num-batched-tokens",
"2048",
"--max-num-seqs",
"256",
]
tensor_parallel_size: "{{ .Values.tensor_parallel_size }}"
pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions!
I’ve updated xeon-values.yaml to include the additional configMap values and extra command arguments as suggested.
Please let me know if anything else needs adjustment.

configMapValues:
VLLM_SKIP_WARMUP: true
VLLM_CPU_KVCACHE_SPACE: "40"
VLLM_RPC_TIMEOUT: "100000"
VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
VLLM_ENGINE_ITERATION_TIMEOUT_S: "120"
VLLM_CPU_NUM_OF_RESERVED_CPU: "0"
VLLM_CPU_SGL_KERNEL: "1"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prernanookala-ai, Have you tried running Qwen2.5-vl-7b-instruct with this patch? When I used these settings before, the model either failed to start or the server crashed on a /chat/completions request.

For testing you can use the curl command:

curl -X POST "http:///v1/chat/completions"
-H "Content-Type: application/json"
--data '{
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zahidulhaque - You're right, this path is not working for me either.

Below are the values, I've tested and found to be stable for Qwen2,5VL-7B-Instruct, setting VLLM_CPU_KVCACHE_SPACE to 16 and disabling triton resolved the issue on my end.

"Qwen/Qwen2.5-VL-7B-Instruct":
configMapValues:
VLLM_SKIP_WARMUP: "true"
VLLM_CPU_KVCACHE_SPACE: "16"
VLLM_DISABLE_TRITON: "1"
extraCmdArgs: ["--max-model-len","8192"]
tensor_parallel_size: "1"

Let me know if I can update PR with these values?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure @HarikaDev296 , If things are working fine with the above configuration, you can go ahead and update the code.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also VLLM_CPU_KVCACHE_SPACE: "16" might be too less for mutimodal models. Try setting it to atleast 40. Also make sure to test with curl command once the server is up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zahidulhaque - I test Qwen model with below config, increased VLLM_CPU_KVCACHE_SPACE to 40 and was successfully able to get inference.
"Qwen/Qwen2.5-VL-7B-Instruct":
configMapValues:
VLLM_SKIP_WARMUP: "true"
VLLM_CPU_KVCACHE_SPACE: "40"
VLLM_DISABLE_TRITON: "1"
extraCmdArgs: ["--max-model-len","8192"]
tensor_parallel_size: "1"

HF_HUB_DISABLE_XET: "1"
extraCmdArgs:
[
"--block-size",
"128",
"--dtype",
"bfloat16",
"--distributed_executor_backend",
"mp",
"--enable_chunked_prefill",
"--enforce-eager",
"--max-model-len",
"33024",
"--max-num-batched-tokens",
"2048",
"--max-num-seqs",
"256",
]
tensor_parallel_size: "{{ .Values.tensor_parallel_size }}"
pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}"

defaultModelConfigs:
configMapValues:
VLLM_CPU_KVCACHE_SPACE: "40"
Expand Down Expand Up @@ -270,4 +300,4 @@ defaultModelConfigs:
"256",
]
tensor_parallel_size: "{{ .Values.tensor_parallel_size }}"
pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}"
pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}"