kubernetes-sigs · k8s-ci-robot · May 22, 2025 · May 9, 2025 · May 9, 2025 · May 15, 2025
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -61,9 +61,12 @@ nav:
   - Guides:
     - User Guides:
       - Getting started: guides/index.md
+      - Use Cases:
+        - Serve Multiple GenAI models: guides/serve-multiple-genai-models.md
+        - Serve Multiple LoRA adapters: guides/serve-multiple-lora-adapters.md
       - Rollout:
-          - Adapter Rollout: guides/adapter-rollout.md
-          - InferencePool Rollout: guides/inferencepool-rollout.md
+        - Adapter Rollout: guides/adapter-rollout.md
+        - InferencePool Rollout: guides/inferencepool-rollout.md
       - Metrics: guides/metrics.md
     - Implementer's Guide: guides/implementers.md
   - Performance:

diff --git a/site-src/guides/serve-multiple-genai-models.md b/site-src/guides/serve-multiple-genai-models.md
@@ -0,0 +1,70 @@
+# Serve multiple generative AI models
+A company wants to deploy multiple large language models (LLMs) to serve different workloads. 
+For example, they might want to deploy a Gemma3 model for a chatbot interface and a Deepseek model for a recommendation application. 
+The company needs to ensure optimal serving performance for these LLMs.
+Using Gateway API Inference Extension, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`. 
+You can then route requests based on the model name (such as "chatbot" and "recommender") and the `Criticality` property.
+
+The following diagram illustrates how Gateway API Inference Extension routes requests to different models based on the model name.
+![Serving multiple generative AI models](../images/serve-mul-gen-AI-models.png)
+
+This example illustrates a conceptual example regarding how to use the `HTTPRoute` object to route based on model name like “chatbot” or “recommender” to `InferencePool`.
+```yaml
+apiVersion: gateway.networking.k8s.io/v1
+kind: HTTPRoute
+metadata:
+  name: routes-to-llms
+spec:
+  parentRefs:
+  - name: inference-gateway
+  rules:
+  - matches:
+    - headers:
+      - type: Exact
+        #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
+        name: X-Gateway-Model-Name
+        value: chatbot
+      path:
+        type: PathPrefix
+        value: /
+    backendRefs:
+    - name: gemma3
+      kind: InferencePool
+  - matches:
+    - headers:
+      - type: Exact
+        #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
+        name: X-Gateway-Model-Name
+        value: recommender
+      path:
+        type: PathPrefix
+        value: /
+    backendRefs:
+    - name: deepseek-r1
+      kind: InferencePool     
+```
+
+Try it out:
+
+1. Get the gateway IP:
+```bash
+IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80
+```
+2. Send a few requests to model "chatbot" as follows:
+```bash
+curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
+"model": "chatbot",
+"prompt": "What is the color of the sky",
+"max_tokens": 100,
+"temperature": 0
+}'
+```
+3. Send a few requests to model "recommender" as follows:
+```bash
+curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
+"model": "chatbot",
+"prompt": "Give me restaurant recommendations in Paris",
+"max_tokens": 100,
+"temperature": 0
+}'
+```
diff --git a/site-src/guides/serve-multiple-lora-adapters.md b/site-src/guides/serve-multiple-lora-adapters.md
@@ -0,0 +1,99 @@
+# Serve LoRA adapters on a shared pool
+A company wants to serve LLMs for document analysis and focuses on audiences in multiple languages, such as English and Spanish.
+They have a fine-tuned LoRA adapter for each language, but need to efficiently use their GPU and TPU capacity.
+You can use Gateway API Inference Extension to deploy dynamic LoRA fine-tuned adapters for each language (for example, `english-bot` and `spanish-bot`) on a common base model and accelerator.
+This lets you reduce the number of required accelerators by densely packing multiple models in a shared pool.
+
+The following diagram illustrates how Gateway API Inference Extension serves multiple LoRA adapters on a shared pool.
+![Serving LoRA adapters on a shared pool](../images/serve-LoRA-adapters.png)
+This example illustrates how you can densely serve multiple LoRA adapters with distinct workload performance objectives on a common InferencePool.
+```yaml
+apiVersion: gateway.networking.x-k8s.io/v1alpha1
+kind: InferencePool
+metadata:
+  name: gemma3
+spec:
+  selector:
+    pool: gemma3
+```
+Let us say we have a couple of LoRA adapters named “english-bot” and “spanish-bot” for the Gemma3 base model.
+You can create an `InferenceModel` resource and associate these LoRA adapters to the relevant InferencePool resource.  
+In this case, we associate these LoRA adapters to the gemma3 InferencePool resource created above.
+
+```yaml
+apiVersion: inference.networking.x-k8s.io/v1alpha2
+kind: InferenceModel
+metadata:
+  name: english-bot
+spec:
+  modelName: english-bot
+  criticality: Standard
+  poolRef:
+    name: gemma3
+
+---
+apiVersion: inference.networking.x-k8s.io/v1alpha2
+kind: InferenceModel
+metadata:
+  name: spanish-bot
+spec:
+  modelName: spanish-bot
+  criticality: Critical
+  poolRef:
+    name: gemma3
+
+```
+Now, you can route your requests from the gateway using the `HTTPRoute` object.
+```yaml
+apiVersion: gateway.networking.k8s.io/v1
+kind: Gateway
+metadata:
+  name: inference-gateway
+spec:
+  listeners:
+  - protocol: HTTP
+    port: 80
+    name: http
+
+---
+kind: HTTPRoute
+apiVersion: gateway.networking.k8s.io/v1
+metadata:
+  name: routes-to-llms
+spec:
+  parentRefs:
+    - name: inference-gateway
+  rules:
+  - matches:
+      path:
+        type: PathPrefix
+        value: /
+    backendRefs:
+    - name: gemma3
+      kind: InferencePool
+```
+
+Try it out:
+
+1. Get the gateway IP:
+```bash
+IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80
+```
+2. Send a few requests to model "english-bot" as follows:
+```bash
+curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
+"model": "english-bot",
+"prompt": "What is the color of the sky",
+"max_tokens": 100,
+"temperature": 0
+}'
+```
+3. Send a few requests to model "spanish-bot" as follows:
+```bash
+curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
+"model": "spanish-bot",
+"prompt": "¿De qué color es...?",
+"max_tokens": 100,
+"temperature": 0
+}'
+```
diff --git a/site-src/images/serve-LoRA-adapters.png b/site-src/images/serve-LoRA-adapters.png
diff --git a/site-src/images/serve-mul-gen-AI-models.png b/site-src/images/serve-mul-gen-AI-models.png