The SPNL CLI provides commands to easily deploy and manage vLLM inference servers on Kubernetes or Google Compute Engine.
# Bring up a vLLM server on Kubernetes (requires HuggingFace token)
spnl vllm up my-deployment --target k8s --hf-token YOUR_HF_TOKEN
# Optionally specify a different model from HuggingFace (default: ibm-granite/granite-3.3-8b-instruct)
spnl vllm up my-deployment --target k8s --model meta-llama/Llama-3.1-8B-Instruct --hf-token YOUR_HF_TOKEN
# Bring down the vLLM server
spnl vllm down my-deployment --target k8sThe up command deploys a vLLM server with a model from HuggingFace and automatically sets up port forwarding to localhost:8000. You can customize the number of GPUs with --gpus and ports with --local-port and --remote-port. The down command tears down the deployment.
For Google Compute Engine (--target gce), you must set the following environment variables:
GCP_PROJECTorGOOGLE_CLOUD_PROJECT: Your GCP project IDGCP_SERVICE_ACCOUNT: Service account name for the instanceGOOGLE_APPLICATION_CREDENTIALS(optional): Path to your service account key file, only needed if not already logged in viagcloud auth login(see GCP authentication docs)