-
Notifications
You must be signed in to change notification settings - Fork 25
Add parameters for Qwen2.5-vl-7b-instruct model #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -243,6 +243,36 @@ modelConfigs: | |
| tensor_parallel_size: "{{ .Values.tensor_parallel_size }}" | ||
| pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}" | ||
|
|
||
| "Qwen/Qwen2.5-VL-7B-Instruct": | ||
| configMapValues: | ||
| VLLM_SKIP_WARMUP: true | ||
| VLLM_CPU_KVCACHE_SPACE: "40" | ||
| VLLM_RPC_TIMEOUT: "100000" | ||
| VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1" | ||
| VLLM_ENGINE_ITERATION_TIMEOUT_S: "120" | ||
| VLLM_CPU_NUM_OF_RESERVED_CPU: "0" | ||
| VLLM_CPU_SGL_KERNEL: "1" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @prernanookala-ai, Have you tried running Qwen2.5-vl-7b-instruct with this patch? When I used these settings before, the model either failed to start or the server crashed on a /chat/completions request. For testing you can use the curl command: curl -X POST "http:///v1/chat/completions"
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @zahidulhaque - You're right, this path is not working for me either. Below are the values, I've tested and found to be stable for Qwen2,5VL-7B-Instruct, setting VLLM_CPU_KVCACHE_SPACE to 16 and disabling triton resolved the issue on my end. "Qwen/Qwen2.5-VL-7B-Instruct": Let me know if I can update PR with these values? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure @HarikaDev296 , If things are working fine with the above configuration, you can go ahead and update the code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also VLLM_CPU_KVCACHE_SPACE: "16" might be too less for mutimodal models. Try setting it to atleast 40. Also make sure to test with curl command once the server is up.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @zahidulhaque - I test Qwen model with below config, increased VLLM_CPU_KVCACHE_SPACE to 40 and was successfully able to get inference. |
||
| HF_HUB_DISABLE_XET: "1" | ||
| extraCmdArgs: | ||
| [ | ||
| "--block-size", | ||
| "128", | ||
| "--dtype", | ||
| "bfloat16", | ||
| "--distributed_executor_backend", | ||
| "mp", | ||
| "--enable_chunked_prefill", | ||
| "--enforce-eager", | ||
| "--max-model-len", | ||
| "33024", | ||
| "--max-num-batched-tokens", | ||
| "2048", | ||
| "--max-num-seqs", | ||
| "256", | ||
| ] | ||
| tensor_parallel_size: "{{ .Values.tensor_parallel_size }}" | ||
| pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}" | ||
|
|
||
| defaultModelConfigs: | ||
| configMapValues: | ||
| VLLM_CPU_KVCACHE_SPACE: "40" | ||
|
|
@@ -270,4 +300,4 @@ defaultModelConfigs: | |
| "256", | ||
| ] | ||
| tensor_parallel_size: "{{ .Values.tensor_parallel_size }}" | ||
| pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}" | ||
| pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}" | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the support for this VL model. For better performance in the xeon include these additional variables and extra command arguments. Also tensor parallel is calculated dynamically based on the system configuration where models are deployed.
configMapValues:
VLLM_CPU_KVCACHE_SPACE: "40"
VLLM_RPC_TIMEOUT: "100000"
VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
VLLM_ENGINE_ITERATION_TIMEOUT_S: "120"
VLLM_CPU_NUM_OF_RESERVED_CPU: "0"
VLLM_CPU_SGL_KERNEL: "1"
HF_HUB_DISABLE_XET: "1"
extraCmdArgs:
[
"--block-size",
"128",
"--dtype",
"bfloat16",
"--distributed_executor_backend",
"mp",
"--enable_chunked_prefill",
"--enforce-eager",
"--max-model-len",
"33024",
"--max-num-batched-tokens",
"2048",
"--max-num-seqs",
"256",
]
tensor_parallel_size: "{{ .Values.tensor_parallel_size }}"
pipeline_parallel_size: "{{ .Values.pipeline_parallel_size }}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions!
I’ve updated xeon-values.yaml to include the additional configMap values and extra command arguments as suggested.
Please let me know if anything else needs adjustment.