-
Notifications
You must be signed in to change notification settings - Fork 45
If user set acceleratorResource, respect the value #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Also added temporary and basic test suite to validate the chart's rendering logic. It is intended as a stopgap solution until a more formal testing framework. |
| @@ -0,0 +1,14 @@ | |||
| # Test values for default accelerator resource behavior. | |||
| # The chart should automatically set the GPU count to match tensor parallelism. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default match tensor x data?
| modelCommand: vllmServe | ||
| resources: | ||
| limits: | ||
| nvidia.com/gpu: "8" # User-defined value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this still work if I want to set gpu to 0? For example, for vLLM simulators that won't require GPUs but the args would still use tensor-parallel-size=2.
| echo "Running Helm template rendering tests..." | ||
| echo "========================================" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice. I wonder if you want to include this as part of the Lint/Test Chart github action workflow.
|
Agree that there is a problem. However, I think the solution is incomplete. My understanding of the concepts and relationships is: Data parallelism (dp) indicates the number of replicas of the full model. Each replica corresponds to a vllm engine. These vllm engines can run in the same pod ("single node") or in different pods ("multi-node"). In all cases, the total number of gpus needed is dp * tp. However, in a multi-node scenario there is also the dp local (dpl) size (vllm option In the case of a single pod (w=1), dp = dpl For a given pod, the number of gpu required is dpl * tp There are 4 variables: tp, dp, dpl w. tp is always required (default 1). Any 2 of the remaining 2 allow us to compute the third and the number of gpu per pod. Today modelservice allows specifying only 2 of these (tp, dp) which is sufficent for single node case. Propose allowing the user to be able to specify dpl and w as well. Only 2 are required (default for all is 1). It is easiest if user specifies dpl and w. then dp = dpl * w and #gpu/node = dpl * tp If the user specifies other combinations, they have to be sure to get the ratios correct.
|
|
These changes have been incorporated into #159. Closing. |
Fixes #140