EricLBuehler
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/PAGED_ATTENTION.md‎
Lines changed: 61 additions & 0 deletions b/‎docs/PAGED_ATTENTION.md‎
Lines changed: 61 additions & 0 deletions
diff --git a/‎mistralrs-bench/src/main.rs‎
Lines changed: 18 additions & 2 deletions b/‎mistralrs-bench/src/main.rs‎
Lines changed: 18 additions & 2 deletions
diff --git a/‎mistralrs-core/src/dummy_paged_attention/block_engine.rs‎
Lines changed: 4 additions & 3 deletions b/‎mistralrs-core/src/dummy_paged_attention/block_engine.rs‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎mistralrs-core/src/dummy_paged_attention/cache_engine.rs‎
Lines changed: 33 additions & 0 deletions b/‎mistralrs-core/src/dummy_paged_attention/cache_engine.rs‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎mistralrs-core/src/dummy_paged_attention/mod.rs‎
Lines changed: 7 additions & 1 deletion b/‎mistralrs-core/src/dummy_paged_attention/mod.rs‎
Lines changed: 7 additions & 1 deletion
diff --git a/‎mistralrs-core/src/lib.rs‎
Lines changed: 1 addition & 1 deletion b/‎mistralrs-core/src/lib.rs‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎mistralrs-core/src/paged_attention/cache_engine.rs‎
Lines changed: 34 additions & 0 deletions b/‎mistralrs-core/src/paged_attention/cache_engine.rs‎
Lines changed: 34 additions & 0 deletions
@@ -205,6 +205,7 @@ Please submit requests for new models [here](https://github.com/EricLBuehler/mis
    - [GGML & GGUF support](docs/QUANTS.md): 2–8 bit
    - [GPTQ](docs/QUANTS.md), [AWQ](scripts/convert_awq_marlin.py), [AFQ](docs/QUANTS.md#afq), [HQQ](docs/QUANTS.md#hqq), [FP8](docs/QUANTS.md), [BNB](https://github.com/TimDettmers/bitsandbytes) (int8/fp4/nf4)
    - ⭐ Auto-select the fastest quant method
+   - [KV cache quantization](docs/PAGED_ATTENTION.md#kv-cache-quantization)
 
 4. **Flexibility**
    - [LoRA](docs/ADAPTER_MODELS.md) & [X-LoRA](docs/ADAPTER_MODELS.md) adapters with weight merging
 
@@ -6,6 +6,16 @@ Mistral.rs supports PagedAttention ([paper here](https://arxiv.org/abs/2309.0618
 
 Our PagedAttention implementation has 2 inputs: GPU KV cache memory size, and block size. This enables you to have fine-tuned control over the available context length, by configuring the available memory for KV cache. When using a CUDA device, PagedAttention is actiated by default but can be disabled with `no_paged_attn` for Python or `no-paged-attn` for the CLI tools.
 
+## KV Cache Quantization
+
+PagedAttention now supports KV cache quantization to reduce memory usage and potentially improve performance. The KV cache can be quantized to FP8 (F8E4M3 format) instead of using the model's native dtype, significantly reducing memory requirements while maintaining model quality.
+
+**Available cache types:**
+- `auto` (default): Uses the model's native dtype for KV cache
+- `f8e4m3`: Quantizes KV cache to 8-bit floating point (E4M3 format)
+
+When using FP8 quantization, the memory usage for KV cache is approximately halved compared to FP16, allowing for longer context lengths with the same GPU memory allocation.
+
 > Note: The default block size if not specified is 32.
 
 > Note: if OOM occurs (this can be caused by a variety of factors including adapter activation, re-ISQ, and others), it is likely because the PagedAttention KV cache has already been allocated. To counter this, either set the KV cache memory to a lower amount or usage percentage (recommended) or disable paged attention entirely for a dynamically allocated cache.
@@ -40,6 +50,8 @@ the prefill phase.
 
 Add the `--pa-gpu-mem`/`--pa-gpu-mem-usage` and `--pa-blk-size` parameters before the model kind selector. The GPU memory is in MBs and the block size means the number of tokens per block. These parameters may be passed on any supported model type.
 
+To enable KV cache quantization, use the `--pa-cache-type` parameter with either `auto` (default) or `f8e4m3`.
+
 ```
 cargo run --release --features cuda -- -i --pa-gpu-mem 8192 --pa-blk-size 32 --isq Q4K plain -m microsoft/Phi-3-mini-128k-instruct
 ```
@@ -48,6 +60,11 @@ cargo run --release --features cuda -- -i --pa-gpu-mem 8192 --pa-blk-size 32 --i
 cargo run --release --features cuda -- -i --pa-gpu-mem-usage .95 --pa-blk-size 32 gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
 ```
 
+Example with FP8 KV cache quantization:
+```
+cargo run --release --features metal -- -i --pa-gpu-mem 4096 --pa-blk-size 32 --pa-cache-type f8e4m3 plain -m microsoft/Phi-3-mini-128k-instruct
+```
+
 ## Using the Rust API
 You can find this example [here](../mistralrs/examples/paged_attn/main.rs).
 
@@ -94,6 +111,33 @@ async fn main() -> Result<()> {
 }
 ```
 
+Example with FP8 KV cache quantization:
+```rust
+use anyhow::Result;
+use mistralrs::{
+    IsqType, MemoryGpuConfig, PagedAttentionMetaBuilder, PagedCacheType, 
+    TextMessageRole, TextMessages, TextModelBuilder,
+};
+
+#[tokio::main]
+async fn main() -> Result<()> {
+    let model = TextModelBuilder::new("microsoft/Phi-3.5-mini-instruct")
+        .with_isq(IsqType::Q8_0)
+        .with_logging()
+        .with_paged_attn(|| {
+            PagedAttentionMetaBuilder::default()
+                .with_block_size(32)
+                .with_gpu_memory(MemoryGpuConfig::ContextSize(1024))
+                .with_cache_type(PagedCacheType::F8E4M3)
+                .build()
+        })?
+        .build()
+        .await?;
+
+    // ... rest of the code remains the same
+}
+```
+
 ## Using the Python API
 ```py
 from mistralrs import Runner, Which, ChatCompletionRequest, Architecture
@@ -121,4 +165,21 @@ res = runner.send_chat_completion_request(
 )
 print(res.choices[0].message.content)
 print(res.usage)
+```
+
+Example with FP8 KV cache quantization:
+```py
+from mistralrs import Runner, Which, ChatCompletionRequest, Architecture, PagedCacheType
+
+runner = Runner(
+    which=Which.Plain(
+        model_id="mistralai/Mistral-7B-Instruct-v0.1",
+        arch=Architecture.Mistral,
+    ),
+    pa_gpu_mem = 4096,
+    pa_blk_size = 32,
+    pa_cache_type = PagedCacheType.F8E4M3,
+)
+
+# ... rest of the code remains the same
 ```
@@ -5,8 +5,8 @@ use mistralrs_core::{
     get_auto_device_map_params, get_model_dtype, initialize_logging, paged_attn_supported,
     parse_isq_value, Constraint, DefaultSchedulerMethod, DeviceLayerMapMetadata, DeviceMapMetadata,
     DeviceMapSetting, DrySamplingParams, Loader, LoaderBuilder, MemoryGpuConfig, MistralRs,
-    MistralRsBuilder, ModelSelected, NormalRequest, PagedAttentionConfig, Request, RequestMessage,
-    Response, SamplingParams, SchedulerConfig, TokenSource, Usage,
+    MistralRsBuilder, ModelSelected, NormalRequest, PagedAttentionConfig, PagedCacheType, Request,
+    RequestMessage, Response, SamplingParams, SchedulerConfig, TokenSource, Usage,
 };
 use std::sync::Arc;
 use std::{fmt::Display, num::NonZeroUsize};
@@ -265,6 +265,10 @@ fn warmup_run(mistralrs: Arc<MistralRs>) {
     let _ = rx.blocking_recv();
 }
 
+fn parse_cache_type(s: &str) -> Result<PagedCacheType, String> {
+    s.parse()
+}
+
 #[derive(Parser)]
 #[command(version, about, long_about = None)]
 struct Args {
@@ -323,6 +327,11 @@ struct Args {
     #[arg(long = "pa-ctxt-len")]
     paged_ctxt_len: Option<usize>,
 
+    /// PagedAttention KV cache type (auto or f8e4m3).
+    /// Defaults to `auto`.
+    #[arg(long = "pa-cache-type", value_parser = parse_cache_type)]
+    cache_type: Option<PagedCacheType>,
+
     /// Block size (number of tokens per block) for PagedAttention. If this is not set and the device is CUDA, it will default to 32.
     /// PagedAttention is only supported on CUDA and is always automatically activated.
     #[arg(long = "pa-blk-size")]
@@ -448,28 +457,33 @@ async fn main() -> anyhow::Result<()> {
             block_size,
             512,
             MemoryGpuConfig::ContextSize(max_seq_len),
+            args.cache_type.unwrap_or_default(),
         )?),
         (block_size, None, None, Some(ctxt), true, false) => Some(PagedAttentionConfig::new(
             block_size,
             512,
             MemoryGpuConfig::ContextSize(ctxt),
+            args.cache_type.unwrap_or_default(),
         )?),
         (block_size, None, Some(f), None, true, false) => Some(PagedAttentionConfig::new(
             block_size,
             512,
             MemoryGpuConfig::Utilization(f),
+            args.cache_type.unwrap_or_default(),
         )?),
         (block_size, Some(m), None, None, true, false) => Some(PagedAttentionConfig::new(
             block_size,
             512,
             MemoryGpuConfig::MbAmount(m),
+            args.cache_type.unwrap_or_default(),
         )?),
         (block_size, Some(_m), Some(f), None, true, false) => {
             info!("Both memory size, and usage were specified, defaulting to the usage value.");
             Some(PagedAttentionConfig::new(
                 block_size,
                 512,
                 MemoryGpuConfig::Utilization(f),
+                args.cache_type.unwrap_or_default(),
             )?)
         }
         (block_size, Some(_m), None, Some(ctxt), true, false) => {
@@ -478,6 +492,7 @@ async fn main() -> anyhow::Result<()> {
                 block_size,
                 512,
                 MemoryGpuConfig::ContextSize(ctxt),
+                args.cache_type.unwrap_or_default(),
             )?)
         }
         (block_size, None, Some(f), Some(_ctxt), true, false) => {
@@ -486,6 +501,7 @@ async fn main() -> anyhow::Result<()> {
                 block_size,
                 512,
                 MemoryGpuConfig::Utilization(f),
+                args.cache_type.unwrap_or_default(),
             )?)
         }
         (_, _, _, _, _, _) => None,
 
@@ -51,6 +51,10 @@ impl LogicalTokenBlock {
         self.tokens.pop();
         self.num_tokens -= 1;
     }
+
+    pub fn toks(&self) -> &[usize] {
+        &self.tokens
+    }
 }
 
 impl Hash for LogicalTokenBlock {
@@ -272,9 +276,6 @@ impl BlockEngine {
         // If there are prefill physical blocks, use those here.
         if let Some(physical_blocks_prefill) = seq.take_physical_blocks_prefill() {
             let mut block_table = physical_blocks_prefill.clone();
-            for block in &mut block_table {
-                block.deref_mut().refcount = 1;
-            }
             let n_extra_blocks = seq.logical_token_blocks().len() - block_table.len();
             for _ in 0..n_extra_blocks {
                 block_table.push(self.gpu_allocator.allocate());
 
@@ -1,17 +1,50 @@
 use std::{
     collections::HashMap,
+    str::FromStr,
     sync::{Arc, Mutex, MutexGuard},
 };
 
 use candle_core::{DType, Device, Result, Tensor};
+use serde::{Deserialize, Serialize};
 
 use super::config::ModelConfigLike;
 
+#[derive(Clone, Copy, Debug, Serialize, Deserialize, PartialEq, Default)]
+#[cfg_attr(feature = "pyo3_macros", pyo3::pyclass(eq, eq_int))]
+pub enum PagedCacheType {
+    #[default]
+    Auto,
+    F8E4M3,
+}
+
+impl PagedCacheType {
+    pub fn to_dtype(&self, act_dtype: DType) -> DType {
+        match self {
+            PagedCacheType::F8E4M3 => DType::F8E4M3,
+            PagedCacheType::Auto => act_dtype,
+        }
+    }
+}
+
+impl FromStr for PagedCacheType {
+    type Err = String;
+    fn from_str(s: &str) -> std::result::Result<Self, Self::Err> {
+        match s {
+            "auto" => Ok(Self::Auto),
+            "f8e4m3" => Ok(Self::F8E4M3),
+            other => Err(format!(
+                "Unexpected `PagedCacheType`, got `{other}` but expected `auto` and `f8e4m3`."
+            )),
+        }
+    }
+}
+
 #[derive(Clone, Debug)]
 pub struct CacheConfig {
     pub block_size: usize,
     pub num_gpu_blocks: usize,
     pub num_cpu_blocks: usize,
+    pub cache_type: PagedCacheType,
 }
 
 pub type KVCache = (Tensor, Tensor);
 
@@ -13,7 +13,7 @@ pub const _PAD_SLOT_ID: i64 = -1;
 
 pub use block_engine::{BlockEngine, BlockTables, LogicalTokenBlock, PhysicalTokenBlock};
 pub use block_engine_sequence::BlockEngineSequence;
-pub use cache_engine::{CacheConfig, CacheEngine};
+pub use cache_engine::{CacheConfig, CacheEngine, PagedCacheType};
 use candle_core::{DType, Device};
 pub use config::{ModelConfigLike, ModelConfigMetadata};
 pub use layers::PagedAttention;
@@ -32,18 +32,21 @@ pub struct PagedAttentionConfig {
     pub(crate) block_size: Option<usize>,
     pub(crate) mem_cpu: usize,
     pub(crate) mem_gpu: MemoryGpuConfig,
+    pub(crate) cache_type: PagedCacheType,
 }
 
 impl PagedAttentionConfig {
     pub fn new(
         block_size: Option<usize>,
         mem_cpu: usize,
         mem_gpu: MemoryGpuConfig,
+        cache_type: PagedCacheType,
     ) -> anyhow::Result<Self> {
         Ok(Self {
             block_size,
             mem_cpu,
             mem_gpu,
+            cache_type,
         })
     }
 }
@@ -97,6 +100,7 @@ pub fn calculate_cache_config(
     mem_cpu: usize,
     block_size: Option<usize>,
     dtype: DType,
+    cache_type: PagedCacheType,
     config: &dyn ModelConfigLike,
     device: &Device,
     layer_devices: &[Option<Device>],
@@ -106,6 +110,7 @@ pub fn calculate_cache_config(
     if !SUPPORTED_BLOCK_SIZE.contains(&block_size) {
         anyhow::bail!("Block size must be in {SUPPORTED_BLOCK_SIZE:?}, got {block_size}");
     }
+    let dtype = cache_type.to_dtype(dtype);
     let dtype_size = dtype.size_in_bytes();
 
     let mut min_mem_gpu = usize::MAX;
@@ -148,5 +153,6 @@ pub fn calculate_cache_config(
         block_size,
         num_gpu_blocks,
         num_cpu_blocks,
+        cache_type,
     })
 }
@@ -89,7 +89,7 @@ pub use mistralrs_mcp::{
     McpClient, McpClientConfig, McpServerConfig, McpServerSource, McpToolInfo,
 };
 pub use mistralrs_quant::{IsqType, MULTI_LORA_DELIMITER};
-pub use paged_attention::{MemoryGpuConfig, PagedAttentionConfig};
+pub use paged_attention::{MemoryGpuConfig, PagedAttentionConfig, PagedCacheType};
 pub use pipeline::{
     chat_template::ChatTemplate, parse_isq_value, AdapterPaths, AnyMoeLoader, AnyMoePipeline,
     AutoDeviceMapParams, AutoLoader, AutoLoaderBuilder, DiffusionGenerationParams, DiffusionLoader,
 
@@ -1,20 +1,53 @@
 use std::{
     collections::HashMap,
+    str::FromStr,
     sync::{Arc, Mutex, MutexGuard},
 };
 
 use candle_core::{
     from_storage_no_op, DType, Device, MetalStorage, Result, Shape, Storage, Tensor,
 };
 use mistralrs_paged_attn::{copy_blocks, swap_blocks};
+use serde::{Deserialize, Serialize};
 
 use super::config::ModelConfigLike;
 
+#[derive(Clone, Copy, Debug, Serialize, Deserialize, PartialEq, Default)]
+#[cfg_attr(feature = "pyo3_macros", pyo3::pyclass(eq, eq_int))]
+pub enum PagedCacheType {
+    #[default]
+    Auto,
+    F8E4M3,
+}
+
+impl PagedCacheType {
+    pub fn to_dtype(&self, act_dtype: DType) -> DType {
+        match self {
+            PagedCacheType::F8E4M3 => DType::F8E4M3,
+            PagedCacheType::Auto => act_dtype,
+        }
+    }
+}
+
+impl FromStr for PagedCacheType {
+    type Err = String;
+    fn from_str(s: &str) -> std::result::Result<Self, Self::Err> {
+        match s {
+            "auto" => Ok(Self::Auto),
+            "f8e4m3" => Ok(Self::F8E4M3),
+            other => Err(format!(
+                "Unexpected `PagedCacheType`, got `{other}` but expected `auto` and `f8e4m3`."
+            )),
+        }
+    }
+}
+
 #[derive(Clone, Debug)]
 pub struct CacheConfig {
     pub block_size: usize,
     pub num_gpu_blocks: usize,
     pub num_cpu_blocks: usize,
+    pub cache_type: PagedCacheType,
 }
 
 pub type KVCache = (Tensor, Tensor);
@@ -33,6 +66,7 @@ impl CacheEngine {
         device: &Device,
         layer_devices: Vec<Option<Device>>,
     ) -> Result<Self> {
+        let dtype = cache_config.cache_type.to_dtype(dtype);
         Ok(Self {
             gpu_cache: Arc::new(Mutex::new(Self::allocate_gpu_cache(
                 model_config,