vllm-project · simon-mo · Oct 7, 2024 · Sep 13, 2024 · Sep 24, 2024 · simon-mo
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -79,6 +79,7 @@ Documentation
 
    serving/openai_compatible_server
    serving/deploying_with_docker
+   serving/deploying_with_k8s
    serving/distributed_serving
    serving/metrics
    serving/env_vars

diff --git a/docs/source/serving/deploying_with_k8s.rst b/docs/source/serving/deploying_with_k8s.rst
@@ -0,0 +1,173 @@
+.. _deploying_with_k8s:
+
+Deploying with Kubernetes
+==========================
+
+Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
+
+Prerequisites
+-------------
+Before you begin, ensure that you have the following:
+
+- A running Kubernetes cluster
+- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
+- Available GPU resources in your cluster
+
+Deployment Steps
+----------------
+
+1.  **Create a PVC , Secret and Deployment for vLLM**
+
+Create a Kubernetes deployment file that includes a PersistentVolumeClaim (PVC) and a Secret necessary for pulling the Hugging Face token. Here is an example deployment file:
+
+.. code-block:: yaml
+
+    apiVersion: v1
+    kind: PersistentVolumeClaim
+    metadata:
+      name: mistral-7b
+      namespace: llm
+    spec:
+      accessModes:
+      - ReadWriteOnce
+      resources:
+        requests:
+          storage: 50Gi
+      storageClassName: default
+      volumeMode: Filesystem
+    ---
+    apiVersion: v1
+    kind: Secret
+    metadata:
+      name: hf-token-secret
+      namespace: llm
+    type: Opaque
+    data:
+      token: *******
+    ---
+    apiVersion: apps/v1
+    kind: Deployment
+    metadata:
+      name: mistral-7b
+      namespace: llm
+      labels:
+        app: mistral-7b
+    spec:
+      replicas: 1
+      revisionHistoryLimit: 1
-      revisionHistoryLimit: 1
-      revisionHistoryLimit: 1
+      selector:
+        matchLabels:
+          app: mistral-7b
+      template:
+        metadata:
+          labels:
+            app: mistral-7b
+        spec:
+          volumes:
+          - name: cache-volume
+            persistentVolumeClaim:
+              claimName: mistral-7b
+          - name: shm
+            emptyDir:
+              medium: Memory
+              sizeLimit: "2Gi"
+          containers:
+          - name: mistral-7b
+            image: vllm/vllm-openai:latest
+            command: ["/bin/sh", "-c"]
+            args: [
+              "vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
+            ]
+            env:
+            - name: HUGGING_FACE_HUB_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: hf-token-secret
+                  key: token
+            - name: VLLM_NO_USAGE_STATS
+              value: "1"
+            - name: DO_NOT_TRACK
+              value: "1"
+            ports:
+            - containerPort: 8000
+            resources:
+              limits:
+                cpu: "10"
+                memory: 20G
+                nvidia.com/gpu: "1"
+              requests:
+                cpu: "2"
+                memory: 6G
+                nvidia.com/gpu: "1"
+            volumeMounts:
+            - mountPath: /root/.cache/huggingface
+              name: cache-volume
+            - name: shm
+              mountPath: /dev/shm
+            livenessProbe:
+              httpGet:
+                path: /health
+                port: 8000
+              initialDelaySeconds: 60
+              periodSeconds: 10
+            readinessProbe:
+              httpGet:
+                path: /health
+                port: 8000
+              initialDelaySeconds: 60
+              periodSeconds: 5
+    ---
+
+2. **Create a Kubernetes Service for vLLM**
+
+Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
+
+.. code-block:: yaml
+
+    apiVersion: v1
+    kind: Service
+    metadata:
+      name: mistral-7b
+      namespace: llm
+    spec:
+      internalTrafficPolicy: Cluster
+      ipFamilies:
+      - IPv4
+      ipFamilyPolicy: SingleStack
+      ports:
+      - name: http-mistral-7b
+        port: 80
+        protocol: TCP
+        targetPort: 8000
+      selector:
+        app: mistral-7b
+      sessionAffinity: None
+      type: ClusterIP
+
+3. **Deploy and Test**
+
+Apply the deployment and service configurations using ``kubectl apply -f <filename>``:
+
+.. code-block:: console
+
+    kubectl apply -f deployment.yaml
+    kubectl apply -f service.yaml
+
+To test the deployment, run the following ``curl`` command:
+
+.. code-block:: console
+
+    curl http://mistral-7b.llm.svc.cluster.local/v1/completions \
+      -H "Content-Type: application/json" \
+      -d '{
+            "model": "facebook/opt-125m",
+            "prompt": "San Francisco is a",
+            "max_tokens": 7,
+            "temperature": 0
+          }'
+
+If the service is correctly deployed, you should receive a response from the vLLM model.
+
+Conclusion
+----------
+Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.