From 936f9b3fbc27deec373811de5a888e1c55ccaec1 Mon Sep 17 00:00:00 2001 From: Your Name Date: Thu, 6 Nov 2025 15:19:23 -0800 Subject: [PATCH 1/2] Update docs on prefix cache plugin related metrics --- .../003-model-server-protocol/README.md | 8 +-- .../guides/epp-configuration/prefix-aware.md | 67 +++++++++++-------- 2 files changed, 43 insertions(+), 32 deletions(-) diff --git a/docs/proposals/003-model-server-protocol/README.md b/docs/proposals/003-model-server-protocol/README.md index 65e440a12..0f1ef0786 100644 --- a/docs/proposals/003-model-server-protocol/README.md +++ b/docs/proposals/003-model-server-protocol/README.md @@ -24,13 +24,13 @@ Note the requirements here are aligned with the [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk) effort. -The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated -into the reference endpoint picker implementation. | Metric | Type | Description | vLLM metric | Triton TensorRT-LLM| SGLang | -| ----- | ---- | ---- | ---- | ---- | ---- | +| ----- | ---- | ------------ | ---- | ---- | ---- | | TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| `nv_trt_llm_request_metrics{request_type=waiting}`| `sglang:num_queue_reqs` | KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`| `sglang:token_usage` +| [Optional] BlockSize | Labeled | The block size in tokens to allocate memory. If this metrics is not available, it can be configured via the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `block_size`| | +| [Optional] KVCacheUtilization| Labeled | The total number of blocks in the HBM KV cache. If this metrics is not available, it can be configured via the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `num_gpu_blocks`| | ### LoRA Adapter Serving @@ -60,4 +60,4 @@ The model server MUST expose the following LoRA adapter metrics via the same Pro Starting from [v0.4.0](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.4.0), the EPP supports [prefix cache optimized request scheduling](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/). To benefit from the optimal prefix aware request scheduling, model servers SHOULD support prefix -cache reuse, such as the [vllm automatic prefix caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html) feature. \ No newline at end of file +cache reuse, such as the [vllm automatic prefix caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html) feature. diff --git a/site-src/guides/epp-configuration/prefix-aware.md b/site-src/guides/epp-configuration/prefix-aware.md index 88573c466..07913f2ec 100644 --- a/site-src/guides/epp-configuration/prefix-aware.md +++ b/site-src/guides/epp-configuration/prefix-aware.md @@ -15,43 +15,54 @@ Like any other plugins, the prefix cache aware plugin can be enabled/disabled vi The prefix cache plugin exposes the following advanced configuration parameters: * `blockSize`: The plugin matches prefixes in the unit of blocks. This is the size -of each block in number of bytes. vLLM default block size is 16 tokens. Assume 4 characters per token, the default -is set to 64 in EPP. The default is recommended unless performance is critical for use cases with -extremely long inputs. +of each block in number of bytes. At runtime, EPP can dynamically fetch this information from the +inference engine metrics, therefore this config is only used when such metric is not available. In +vLLM, the metric name is `vllm:cache_config_info` and the metric label is `block_size`. See the +[model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol) +for more details. + + vLLM default block size is 16 tokens. Assume 4 characters per token, the default + is set to 64 in EPP. The default is recommended unless performance is critical for use cases with + extremely long inputs. * `maxPrefixBlocksToMatch`: The maximum number of blocks to find prefix match. The default is 256 (or 256*64=16384 characters, or roughly 4096 tokens). This is useful to tradeoff prefix match accuracy for performance. -* `lruCapacityPerServer`: Maximum capacity the prefix LRU cache in number of block hashes per server (pod). Below -shows a detailed analysis on how to estimate this. +* `lruCapacityPerServer`: Maximum capacity the prefix LRU cache in number of block hashes per server (pod). +Similar to `blockSize`, EPP can dynamically fetch this from the inference engine metrics endpoints. +In vLLM, the metric name is `vllm:cache_config_info` and the metric label is `num_gpu_blocks`. See the +[model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol) +for more details. + + If such metric is not available, you can follow the guide below on how to estimate this. - The prefix cache plugin estimates the prefix cache indexes in model server HBMs. In the perfect - scenario, EPP has the exact same prefix cache entries per model server as their HBM cache entries. If - the EPP cache is smaller than HBM cache, a positive EPP cache match is more accurate, but there are more - false cache misses. If the EPP cache is larger than the HBM cache, then there are more false cache hits. - Therefore **the EPP prefix cache indexer size should be as close as possible to the HBM cache size.** + The prefix cache plugin estimates the prefix cache indexes in model server HBMs. In the perfect + scenario, EPP has the exact same prefix cache entries per model server as their HBM cache entries. If + the EPP cache is smaller than HBM cache, a positive EPP cache match is more accurate, but there are more + false cache misses. If the EPP cache is larger than the HBM cache, then there are more false cache hits. + Therefore **the EPP prefix cache indexer size should be as close as possible to the HBM cache size.** - NOTE: EPP builds prefix cache based on characters, while model server maintains prefix cache entries - in tokens, a conversion between character <-> token is needed. + NOTE: EPP builds prefix cache based on characters, while model server maintains prefix cache entries + in tokens, a conversion between character <-> token is needed. - Below are the formulas to estimate the EPP prefix indexer size: + Below are the formulas to estimate the EPP prefix indexer size: - ``` - max_kv_tokens_per_server = (HBM_size - model_size)/ kv_size_per_token - lru_indexer_capacity_per_server = (max_kv_tokens_per_server * avg_chars_per_token)/prefix_indexer_hash_block_size - ``` + ``` + max_kv_tokens_per_server = (HBM_size - model_size)/ kv_size_per_token + lru_indexer_capacity_per_server = (max_kv_tokens_per_server * avg_chars_per_token)/prefix_indexer_hash_block_size + ``` - Let's take an example: + Let's take an example: - * Model: llama3 8B - * Accelerator: Nvidia H100 80GB - * Num replicas: 3 - * Estimated # characters per token: 4 ([source](https://genai.stackexchange.com/questions/34/how-long-is-a-token)) + * Model: llama3 8B + * Accelerator: Nvidia H100 80GB + * Num replicas: 3 + * Estimated # characters per token: 4 ([source](https://genai.stackexchange.com/questions/34/how-long-is-a-token)) - ``` - max_kv_tokens_per_server = (80GB - 16GB) / 128KB = 500,000 - # assume avg_chars_per_token = 4, prefix_indexer_hash_block_size = 64 (default) - # each entry is about 358KB, so the memory footrpint is abut 11 MB per server - lru_indexer_capacity_per_server = 500,000*4/64 = 31250 - ``` \ No newline at end of file + ``` + max_kv_tokens_per_server = (80GB - 16GB) / 128KB = 500,000 + # assume avg_chars_per_token = 4, prefix_indexer_hash_block_size = 64 (default) + # each entry is about 358KB, so the memory footrpint is abut 11 MB per server + lru_indexer_capacity_per_server = 500,000*4/64 = 31250 + ``` From 0ceb3a9c9146da8fe7f8db9c5d2025c51f537dbe Mon Sep 17 00:00:00 2001 From: Your Name Date: Fri, 7 Nov 2025 12:36:24 -0800 Subject: [PATCH 2/2] Address comment --- docs/proposals/003-model-server-protocol/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/proposals/003-model-server-protocol/README.md b/docs/proposals/003-model-server-protocol/README.md index 0f1ef0786..5cdd43d39 100644 --- a/docs/proposals/003-model-server-protocol/README.md +++ b/docs/proposals/003-model-server-protocol/README.md @@ -29,8 +29,8 @@ effort. | ----- | ---- | ------------ | ---- | ---- | ---- | | TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| `nv_trt_llm_request_metrics{request_type=waiting}`| `sglang:num_queue_reqs` | KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`| `sglang:token_usage` -| [Optional] BlockSize | Labeled | The block size in tokens to allocate memory. If this metrics is not available, it can be configured via the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `block_size`| | -| [Optional] KVCacheUtilization| Labeled | The total number of blocks in the HBM KV cache. If this metrics is not available, it can be configured via the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `num_gpu_blocks`| | +| [Optional] BlockSize | Labeled | The block size in tokens to allocate memory, used by the prefix cache scorer. If this metric is not available, the BlockSize will be derived from the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `block_size`| | +| [Optional] NumGPUBlocks| Labeled | The total number of blocks in the HBM KV cache, used by the prefix cache scorer. If this metric is not available, the NumGPUBlocks will be derived from the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `num_gpu_blocks`| | ### LoRA Adapter Serving