From 936f9b3fbc27deec373811de5a888e1c55ccaec1 Mon Sep 17 00:00:00 2001
From: Your Name <conliu@google.com>
Date: Thu, 6 Nov 2025 15:19:23 -0800
Subject: [PATCH 1/2] Update docs on prefix cache plugin related metrics

---
 .../003-model-server-protocol/README.md       |  8 +--
 .../guides/epp-configuration/prefix-aware.md  | 67 +++++++++++--------
 2 files changed, 43 insertions(+), 32 deletions(-)

diff --git a/docs/proposals/003-model-server-protocol/README.md b/docs/proposals/003-model-server-protocol/README.md
index 65e440a12..0f1ef0786 100644
--- a/docs/proposals/003-model-server-protocol/README.md
+++ b/docs/proposals/003-model-server-protocol/README.md
@@ -24,13 +24,13 @@ Note the requirements here are aligned with the
 [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
 effort.
 
-The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated
-into the reference endpoint picker implementation.
 
 | Metric | Type | Description | vLLM metric | Triton TensorRT-LLM| SGLang |
-| ----- | ---- | ---- | ---- | ---- | ---- |
+| ----- | ---- | ------------ | ---- | ---- | ---- |
 | TotalQueuedRequests         | Gauge     | The current total number of requests in the queue.| `vllm:num_requests_waiting`| `nv_trt_llm_request_metrics{request_type=waiting}`| `sglang:num_queue_reqs`
 | KVCacheUtilization| Gauge     | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`| `sglang:token_usage`
+| [Optional] BlockSize         | Labeled     | The block size in tokens to allocate memory. If this metrics is not available, it can be configured via the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `block_size`| | 
+| [Optional] KVCacheUtilization| Labeled     | The total number of blocks in the HBM KV cache. If this metrics is not available, it can be configured via the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `num_gpu_blocks`| | 
 
 
 ### LoRA Adapter Serving
@@ -60,4 +60,4 @@ The model server MUST expose the following LoRA adapter metrics via the same Pro
 Starting from [v0.4.0](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.4.0),
 the EPP supports [prefix cache optimized request scheduling](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/).
 To benefit from the optimal prefix aware request scheduling, model servers SHOULD support prefix
-cache reuse, such as the [vllm automatic prefix caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html) feature.
\ No newline at end of file
+cache reuse, such as the [vllm automatic prefix caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html) feature.
diff --git a/site-src/guides/epp-configuration/prefix-aware.md b/site-src/guides/epp-configuration/prefix-aware.md
index 88573c466..07913f2ec 100644
--- a/site-src/guides/epp-configuration/prefix-aware.md
+++ b/site-src/guides/epp-configuration/prefix-aware.md
@@ -15,43 +15,54 @@ Like any other plugins, the prefix cache aware plugin can be enabled/disabled vi
 The prefix cache plugin exposes the following advanced configuration parameters:
 
 * `blockSize`: The plugin matches prefixes in the unit of blocks. This is the size
-of each block in number of bytes. vLLM default block size is 16 tokens. Assume 4 characters per token, the default
-is set to 64 in EPP. The default is recommended unless performance is critical for use cases with
-extremely long inputs.
+of each block in number of bytes. At runtime, EPP can dynamically fetch this information from the
+inference engine metrics, therefore this config is only used when such metric is not available. In
+vLLM, the metric name is `vllm:cache_config_info` and the metric label is `block_size`. See the
+[model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol)
+for more details.
+
+    vLLM default block size is 16 tokens. Assume 4 characters per token, the default
+    is set to 64 in EPP. The default is recommended unless performance is critical for use cases with
+    extremely long inputs.
 
 * `maxPrefixBlocksToMatch`: The maximum number of blocks to find prefix match. The default is
 256 (or 256*64=16384 characters, or roughly 4096 tokens). This is useful to tradeoff prefix match accuracy
 for performance.
 
-* `lruCapacityPerServer`: Maximum capacity the prefix LRU cache in number of block hashes per server (pod). Below
-shows a detailed analysis on how to estimate this.
+* `lruCapacityPerServer`: Maximum capacity the prefix LRU cache in number of block hashes per server (pod). 
+Similar to `blockSize`, EPP can dynamically fetch this from the inference engine metrics endpoints. 
+In vLLM, the metric name is `vllm:cache_config_info` and the metric label is `num_gpu_blocks`. See the
+[model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol)
+for more details.
+
+    If such metric is not available, you can follow the guide below on how to estimate this.
 
-    The prefix cache plugin estimates the prefix cache indexes in model server HBMs.  In the perfect
-    scenario, EPP has the exact same prefix cache entries per model server as their HBM cache entries. If
-    the EPP cache is smaller than HBM cache, a positive EPP cache match is more accurate, but there are more
-    false cache misses. If the EPP cache is larger than the HBM cache, then there are more false cache hits.
-    Therefore **the EPP prefix cache indexer size should be as close as possible to the HBM cache size.**
+        The prefix cache plugin estimates the prefix cache indexes in model server HBMs.  In the perfect
+        scenario, EPP has the exact same prefix cache entries per model server as their HBM cache entries. If
+        the EPP cache is smaller than HBM cache, a positive EPP cache match is more accurate, but there are more
+        false cache misses. If the EPP cache is larger than the HBM cache, then there are more false cache hits.
+        Therefore **the EPP prefix cache indexer size should be as close as possible to the HBM cache size.**
 
-    NOTE: EPP builds prefix cache based on characters, while model server maintains prefix cache entries
-    in tokens, a conversion between character <-> token is needed.
+        NOTE: EPP builds prefix cache based on characters, while model server maintains prefix cache entries
+        in tokens, a conversion between character <-> token is needed.
 
-    Below are the formulas to estimate the EPP prefix indexer size:
+        Below are the formulas to estimate the EPP prefix indexer size:
 
-    ```
-    max_kv_tokens_per_server = (HBM_size - model_size)/ kv_size_per_token
-    lru_indexer_capacity_per_server = (max_kv_tokens_per_server * avg_chars_per_token)/prefix_indexer_hash_block_size
-    ```
+        ```
+        max_kv_tokens_per_server = (HBM_size - model_size)/ kv_size_per_token
+        lru_indexer_capacity_per_server = (max_kv_tokens_per_server * avg_chars_per_token)/prefix_indexer_hash_block_size
+        ```
 
-    Let's take an example:
+        Let's take an example:
 
-    * Model: llama3 8B
-    * Accelerator: Nvidia H100 80GB
-    * Num replicas: 3
-    * Estimated # characters per token: 4 ([source](https://genai.stackexchange.com/questions/34/how-long-is-a-token))
+        * Model: llama3 8B
+        * Accelerator: Nvidia H100 80GB
+        * Num replicas: 3
+        * Estimated # characters per token: 4 ([source](https://genai.stackexchange.com/questions/34/how-long-is-a-token))
 
-    ```
-    max_kv_tokens_per_server = (80GB - 16GB) / 128KB = 500,000
-    # assume avg_chars_per_token = 4, prefix_indexer_hash_block_size = 64 (default)
-    # each entry is about 358KB, so the memory footrpint is abut 11 MB per server
-    lru_indexer_capacity_per_server = 500,000*4/64 = 31250
-    ```
\ No newline at end of file
+        ```
+        max_kv_tokens_per_server = (80GB - 16GB) / 128KB = 500,000
+        # assume avg_chars_per_token = 4, prefix_indexer_hash_block_size = 64 (default)
+        # each entry is about 358KB, so the memory footrpint is abut 11 MB per server
+        lru_indexer_capacity_per_server = 500,000*4/64 = 31250
+        ```

From 0ceb3a9c9146da8fe7f8db9c5d2025c51f537dbe Mon Sep 17 00:00:00 2001
From: Your Name <conliu@google.com>
Date: Fri, 7 Nov 2025 12:36:24 -0800
Subject: [PATCH 2/2] Address comment

---
 docs/proposals/003-model-server-protocol/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/proposals/003-model-server-protocol/README.md b/docs/proposals/003-model-server-protocol/README.md
index 0f1ef0786..5cdd43d39 100644
--- a/docs/proposals/003-model-server-protocol/README.md
+++ b/docs/proposals/003-model-server-protocol/README.md
@@ -29,8 +29,8 @@ effort.
 | ----- | ---- | ------------ | ---- | ---- | ---- |
 | TotalQueuedRequests         | Gauge     | The current total number of requests in the queue.| `vllm:num_requests_waiting`| `nv_trt_llm_request_metrics{request_type=waiting}`| `sglang:num_queue_reqs`
 | KVCacheUtilization| Gauge     | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`| `sglang:token_usage`
-| [Optional] BlockSize         | Labeled     | The block size in tokens to allocate memory. If this metrics is not available, it can be configured via the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `block_size`| | 
-| [Optional] KVCacheUtilization| Labeled     | The total number of blocks in the HBM KV cache. If this metrics is not available, it can be configured via the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `num_gpu_blocks`| | 
+| [Optional] BlockSize         | Labeled     | The block size in tokens to allocate memory, used by the prefix cache scorer. If this metric is not available, the BlockSize will be derived from the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `block_size`| | 
+| [Optional] NumGPUBlocks| Labeled     | The total number of blocks in the HBM KV cache, used by the prefix cache scorer. If this metric is not available, the NumGPUBlocks will be derived from the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `num_gpu_blocks`| | 
 
 
 ### LoRA Adapter Serving