Workload-Variant-Autoscaler (WVA)

The Workload Variant Autoscaler (WVA) is a Kubernetes-based global autoscaler for inference model servers serving LLMs. WVA works alongside standard Kubernetes HPA autoscaler and external autoscalers like KEDA to scale the object supporting scale subresource. The high-level details of the algorithm are here. It determines optimal replica counts for given request traffic loads for inference servers by considering constraints such as GPU count (cluster resources), energy-budget and performance-budget (latency/throughput).

What is a variant?

In WVA, a variant is a way of serving a given model: a scale target (Deployment, StatefulSet, or LWS) with a particular combination of hardware, runtimes, and serving approach. Variants for the same model share the same base model (e.g. meta/llama-3.1-8b); LoRA adapters can differ per variant. Each variant is a distinct setup—e.g. different accelerators (A100, H100, L4), parallelism, or performance requirements. Create one VariantAutoscaling per variant; when several variants serve the same model, WVA chooses which to scale (e.g. add capacity on the cheapest variant, remove it from the most expensive). See Configuration and Saturation Analyzer for details.

Key Features

Intelligent Autoscaling: Optimizes replica count by observing the current state of the system
Cost Optimization: Minimizes infrastructure costs by picking the correct accelerator variant

Documentation

User Guide

Integrations

How It Works

Platform admin deploys llm-d infrastructure (including model servers) and waits for servers to warm up and start serving requests
Platform admin creates a VariantAutoscaling CR for the running deployment
WVA continuously monitors request rates and server performance via Prometheus metrics

Capacity model obtains KV cache utilization and queue depth of inference servers with slack capacity to determine replicas
Actuator emits optimization metrics to Prometheus and updates VariantAutoscaling status
External autoscaler (HPA/KEDA) reads the metrics and scales the deployment accordingly

Important Notes:

WVA handles the creation order gracefully - you can create the VA before or after the deployment
If a deployment is deleted, the VA status is immediately updated to reflect the missing deployment
When the deployment is recreated, the VA automatically resumes operation
Configure HPA stabilization window (recommend 120s+) for gradual scaling behavior
WVA updates the VA status with current and desired allocations every reconciliation cycle

Example

apiVersion: llmd.ai/v1alpha1
kind: VariantAutoscaling
metadata:
  name: llama-8b-autoscaler
  namespace: llm-inference
spec:
  scaleTargetRef:
    kind: Deployment
    name: llama-8b
  modelID: "meta/llama-3.1-8b"
  variantCost: "10.0"  # Optional, defaults to "10.0"

More examples in config/samples/.

Upgrading

CRD Updates

Important: Helm does not automatically update CRDs during helm upgrade. When upgrading WVA to a new version with CRD changes, you must manually apply the updated CRDs first:

# Apply the latest CRDs before upgrading
kubectl apply -f charts/workload-variant-autoscaler/crds/

# Then upgrade the Helm release
helm upgrade workload-variant-autoscaler ./charts/workload-variant-autoscaler \
  --namespace workload-variant-autoscaler-system \
  [your-values...]

Breaking Changes

v0.5.1

VariantAutoscaling CRD: Added scaleTargetRef field as required. v0.4.1 VariantAutoscaling resources without scaleTargetRef must be updated before upgrading:
- Impact on Scale-to-Zero: VAs without scaleTargetRef will not scale to zero properly, even with HPAScaleToZero enabled and HPA minReplicas: 0, because the HPA cannot reference the target deployment.
- Migration: Update existing VAs to include scaleTargetRef:
```
spec:
  scaleTargetRef:
    kind: Deployment
    name: <your-deployment-name>
```
- Validation: After CRD update, VAs without scaleTargetRef will fail validation.

Verifying CRD Version

To check if your cluster has the latest CRD schema:

# Check the CRD fields
kubectl get crd variantautoscalings.llmd.ai -o jsonpath='{.spec.versions[0].schema.openAPIV3Schema.properties.spec.properties}' | jq 'keys'

Contributing

We welcome contributions! See the llm-d Contributing Guide for guidelines.

Join the llm-d autoscaling community meetings to get involved.

License

Apache 2.0 - see LICENSE for details.

Related Projects

References

Saturation based design discussion

For detailed documentation, visit the docs directory.

Name		Name	Last commit message	Last commit date
Latest commit History 1,550 Commits
.github		.github
api/v1alpha1		api/v1alpha1
charts/workload-variant-autoscaler		charts/workload-variant-autoscaler
cmd		cmd
config		config
deploy		deploy
docs		docs
hack		hack
internal		internal
pkg		pkg
test		test
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.lycheeignore		.lycheeignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prowlabels.yaml		.prowlabels.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
PROJECT		PROJECT
README.md		README.md
_typos.toml		_typos.toml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Workload-Variant-Autoscaler (WVA)

What is a variant?

Key Features

Documentation

User Guide

Integrations

How It Works

Example

Upgrading

CRD Updates

Breaking Changes

v0.5.1

Verifying CRD Version

Contributing

License

Related Projects

References

About

Uh oh!

Releases 13

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Workload-Variant-Autoscaler (WVA)

What is a variant?

Key Features

Documentation

User Guide

Integrations

How It Works

Example

Upgrading

CRD Updates

Breaking Changes

v0.5.1

Verifying CRD Version

Contributing

License

Related Projects

References

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages