Skip to content

Conversation

@yangligt2
Copy link

Fixes #140

@yangligt2
Copy link
Author

Also added temporary and basic test suite to validate the chart's rendering logic. It is intended as a stopgap solution until a more formal testing framework.

@@ -0,0 +1,14 @@
# Test values for default accelerator resource behavior.
# The chart should automatically set the GPU count to match tensor parallelism.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default match tensor x data?

modelCommand: vllmServe
resources:
limits:
nvidia.com/gpu: "8" # User-defined value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this still work if I want to set gpu to 0? For example, for vLLM simulators that won't require GPUs but the args would still use tensor-parallel-size=2.

Comment on lines +15 to +16
echo "Running Helm template rendering tests..."
echo "========================================"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice. I wonder if you want to include this as part of the Lint/Test Chart github action workflow.

@kalantar
Copy link
Collaborator

Agree that there is a problem. However, I think the solution is incomplete.
It does not take into account data local parallelism. Nor does it take into account multi-node scenarios (multinode: true).

My understanding of the concepts and relationships is:
tensor parallelism (tp) indicates how many GPUs a model (engine) is distributed over.

Data parallelism (dp) indicates the number of replicas of the full model. Each replica corresponds to a vllm engine. These vllm engines can run in the same pod ("single node") or in different pods ("multi-node").

In all cases, the total number of gpus needed is dp * tp.

However, in a multi-node scenario there is also the dp local (dpl) size (vllm option --data-parallel-size-local) which indicates the number vllm instances on a single pod. The sum of dpl over the number of nodes equals dp. In principle, the dpl for each pod can be different. However, since modelservice implements this using leaderworkersets, we will asssume that the number is the same. In this case, w * dpl = dp where w is the number of workers.

In the case of a single pod (w=1), dp = dpl

For a given pod, the number of gpu required is dpl * tp

There are 4 variables: tp, dp, dpl w. tp is always required (default 1). Any 2 of the remaining 2 allow us to compute the third and the number of gpu per pod.

Today modelservice allows specifying only 2 of these (tp, dp) which is sufficent for single node case.

Propose allowing the user to be able to specify dpl and w as well. Only 2 are required (default for all is 1).

It is easiest if user specifies dpl and w. then dp = dpl * w and #gpu/node = dpl * tp

If the user specifies other combinations, they have to be sure to get the ratios correct.

  • If the user specifies dp and w then dp/w = dpl must be an integer
  • If the user specifies tp, dp, and dpl then d/dpl = w must be an integer

@kalantar
Copy link
Collaborator

These changes have been incorporated into #159. Closing.

@kalantar kalantar closed this Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

acceleratorResource is enforced to be equal to tensorParallelism

3 participants