Mistral Server Llama 3 Paged Attention Sampler Infinite Loop?

## Describe the bug

**Edit: this only seems to be the case when using Llama 3.x and only under certain config. Further testing notes below.**

When using `--paged-attn`, it seems like the sampler hits an infinite loop, at least in my quick testing. _This is a follow up of https://github.com/EricLBuehler/mistral.rs/issues/1376#issuecomment-2910859566_

<details>
<summary>curl command</summary>

```sh
curl -X 'POST' \
  'http://localhost:1234/v1/chat/completions' \
  -H 'accept: */*' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "meta-llama/Llama-3.2-1B-Instruct",
  "messages": [{
    "role": "user",
    "content": "hi!"
  }]
}'
```

</details>

### No Paged Attention
`$ cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.2-1B-Instruct`

When running no paged attention, subsequent requests work as expected.

### Paged Attention
`$ cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct`

When running paged attention, the first request works, but the second request hangs. Canceling the request and then sending a third request causes a panic.

https://github.com/user-attachments/assets/6d7164a9-e08d-4ebb-8c75-aa2d1e9be793

#### Print debugging 

https://github.com/user-attachments/assets/d3701ebc-1879-4272-996d-584c50e34bd5

I added two debugging statements to show that the sampler picks the `!` forever. 

Also, I'm not sure if this is a red herring, but at the start of the _second_ request, the sequence shows a different set of tokens than without paged attn. However, after the first step, the sequence matches, but continues on forever.


<details>
<summary>no paged attn second request sequence ctx</summary>

```sh
2025-05-28T21:52:46.463856Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 0.80, Prefix cache hitrate 0.00%, 0 running, 0 waiting
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 9906,
    logprob: 1.0,
    bytes: Some(
        "Hello",
    ),
    top_logprobs: None,
}
```

</details>

<details>
<summary>paged attn second request sequence ctx</summary>

```sh
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 9906,
    logprob: 1.0,
    bytes: Some(
        "Hello",
    ),
    top_logprobs: None,
}
```

</details>

<details>
<summary>diff</summary>

```diff
diff --git a/mistralrs-core/src/pipeline/sampling.rs b/mistralrs-core/src/pipeline/sampling.rs
index 05bf6883..b610cceb 100644
--- a/mistralrs-core/src/pipeline/sampling.rs
+++ b/mistralrs-core/src/pipeline/sampling.rs
@@ -323,6 +323,7 @@ pub async fn sample_and_add_toks(

     for (sampled, seq) in std::iter::zip(sampled_vec, seqs.iter_mut()) {
         let next_token = crate::handle_seq_error_stateaware_ok!(sampled, seq);
+        dbg!(&next_token);

         let metadata = this.get_metadata();
         let eos_tok = if disable_eos_stop {
@@ -352,6 +353,7 @@ pub async fn sample_sequence(

     let sampler = seq.sampler();
     let ctx_clone = seq.get_toks().to_vec();
+    dbg!(&ctx_clone);
     let rng_clone = rng.clone();
     let logits_clone = logits.clone();
     let first_lobprobs_response = if use_async_pool {
```

</details>

<details>

<summary>full command output</summary>

```shell
 matt@Matts-MacBook-Pro-2024  ~  /Users/matt/Code/matthewhaynes/mistral.rs/mistralrs-server

 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs/mistralrs-server   master ±  clear
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs/mistralrs-server   master ±  git status
On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   ../mistralrs-core/src/pipeline/sampling.rs

no changes added to commit (use "git add" and/or "git commit -a")
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs/mistralrs-server   master ±  git --no-pager log -1 --oneline
ec9ee690 (HEAD -> master, origin/master, origin/HEAD) Improved generate_uqff_card
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs/mistralrs-server   master ±  git diff
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs/mistralrs-server   master ±  cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.2-1B-Instruct
    Finished `dev` profile [optimized + debuginfo] target(s) in 0.14s
     Running `/Users/matt/Code/matthewhaynes/mistral.rs/target/debug/mistralrs-server --port 1234 plain -m meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:52:40.170867Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2025-05-28T21:52:40.170965Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-05-28T21:52:40.171029Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2025-05-28T21:52:40.171234Z  INFO hf_hub: Using token file found "/Users/matt/.cache/huggingface/token"
2025-05-28T21:52:40.171659Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:52:40.171734Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:52:40.334680Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2025-05-28T21:52:40.388299Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:52:40.524822Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:52:40.573661Z  INFO mistralrs_quant::utils::log: Automatic loader type determined to be `llama`
2025-05-28T21:52:40.573774Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-05-28T21:52:40.579251Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-28T21:52:40.589449Z  INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-05-28T21:52:40.589559Z  INFO mistralrs_quant::utils::log: Model has 16 repeating layers.
2025-05-28T21:52:40.589600Z  INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-05-28T21:52:40.589650Z  INFO mistralrs_quant::utils::log: Layers 0-15: metal[4294968389] (36 GB)
2025-05-28T21:52:40.591081Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-28T21:52:40.591095Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_act: Silu, hidden_size: 2048, intermediate_size: 8192, vocab_size: 128256, num_hidden_layers: 16, num_attention_heads: 32, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 32.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: true }
2025-05-28T21:52:40.592352Z  INFO mistralrs_core::pipeline::normal: Applying ISQ to None
2025-05-28T21:52:40.592594Z  INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-05-28T21:52:41.450399Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2025-05-28T21:52:41.460379Z  INFO mistralrs_server: Model loaded.
2025-05-28T21:52:41.460496Z  INFO mistralrs_core: Beginning dummy run.
2025-05-28T21:52:41.460760Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    15339,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 11,
    logprob: 1.0,
    bytes: Some(
        ",",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    15339,
    11,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 1268,
    logprob: 1.0,
    bytes: Some(
        " how",
    ),
    top_logprobs: None,
}
2025-05-28T21:52:41.542654Z  INFO mistralrs_core: Dummy run completed in 0.082136333s.
2025-05-28T21:52:41.544387Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.
2025-05-28T21:52:46.463856Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 0.80, Prefix cache hitrate 0.00%, 0 running, 0 waiting
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 9906,
    logprob: 1.0,
    bytes: Some(
        "Hello",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 0,
    logprob: 1.0,
    bytes: Some(
        "!",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 2650,
    logprob: 1.0,
    bytes: Some(
        " How",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 649,
    logprob: 1.0,
    bytes: Some(
        " can",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 358,
    logprob: 1.0,
    bytes: Some(
        " I",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 7945,
    logprob: 1.0,
    bytes: Some(
        " assist",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 499,
    logprob: 1.0,
    bytes: Some(
        " you",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 3432,
    logprob: 1.0,
    bytes: Some(
        " today",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
    3432,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 30,
    logprob: 1.0,
    bytes: Some(
        "?",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
    3432,
    30,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 128009,
    logprob: 1.0,
    bytes: Some(
        "<|eot_id|>",
    ),
    top_logprobs: None,
}
2025-05-28T21:52:51.465783Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 10.00, Prefix cache hitrate 50.00%, 0 running, 0 waiting
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 9906,
    logprob: 1.0,
    bytes: Some(
        "Hello",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 0,
    logprob: 1.0,
    bytes: Some(
        "!",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 2650,
    logprob: 1.0,
    bytes: Some(
        " How",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 649,
    logprob: 1.0,
    bytes: Some(
        " can",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 358,
    logprob: 1.0,
    bytes: Some(
        " I",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 7945,
    logprob: 1.0,
    bytes: Some(
        " assist",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 499,
    logprob: 1.0,
    bytes: Some(
        " you",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 3432,
    logprob: 1.0,
    bytes: Some(
        " today",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
    3432,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 30,
    logprob: 1.0,
    bytes: Some(
        "?",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
    3432,
    30,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 128009,
    logprob: 1.0,
    bytes: Some(
        "<|eot_id|>",
    ),
    top_logprobs: None,
}
2025-05-28T21:52:56.470522Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 10.00, Prefix cache hitrate 33.33%, 0 running, 0 waiting
^C
 ✘ matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs/mistralrs-server   master ±  cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct
    Finished `dev` profile [optimized + debuginfo] target(s) in 0.13s
     Running `/Users/matt/Code/matthewhaynes/mistral.rs/target/debug/mistralrs-server --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:53:07.630667Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2025-05-28T21:53:07.630753Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-05-28T21:53:07.630813Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2025-05-28T21:53:07.631007Z  INFO hf_hub: Using token file found "/Users/matt/.cache/huggingface/token"
2025-05-28T21:53:07.631413Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:53:07.631477Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:53:07.821719Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2025-05-28T21:53:07.896532Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:53:08.046795Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:53:08.138120Z  INFO mistralrs_quant::utils::log: Automatic loader type determined to be `llama`
2025-05-28T21:53:08.138223Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-05-28T21:53:08.144127Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-28T21:53:08.152848Z  INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-05-28T21:53:08.152941Z  INFO mistralrs_quant::utils::log: Model has 16 repeating layers.
2025-05-28T21:53:08.152984Z  INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-05-28T21:53:08.153029Z  INFO mistralrs_quant::utils::log: Layers 0-15: metal[4294968389] (36 GB)
2025-05-28T21:53:08.154436Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-28T21:53:08.154449Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_act: Silu, hidden_size: 2048, intermediate_size: 8192, vocab_size: 128256, num_hidden_layers: 16, num_attention_heads: 32, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 32.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: true }
2025-05-28T21:53:08.155722Z  INFO mistralrs_core::pipeline::normal: Applying ISQ to None
2025-05-28T21:53:08.155969Z  INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-05-28T21:53:08.697797Z  INFO mistralrs_core::paged_attention: Allocating 128 MB for PagedAttention KV cache per GPU
2025-05-28T21:53:08.697808Z  INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 128 GPU blocks: available context length is 4096 tokens
2025-05-28T21:53:09.011410Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2025-05-28T21:53:09.021301Z  INFO mistralrs_server: Model loaded.
2025-05-28T21:53:09.021418Z  INFO mistralrs_core: Beginning dummy run.
2025-05-28T21:53:09.021672Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    15339,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 11,
    logprob: 1.0,
    bytes: Some(
        ",",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    15339,
    11,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 1268,
    logprob: 1.0,
    bytes: Some(
        " how",
    ),
    top_logprobs: None,
}
2025-05-28T21:53:09.104622Z  INFO mistralrs_core: Dummy run completed in 0.083181958s.
2025-05-28T21:53:09.106300Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 9906,
    logprob: 1.0,
    bytes: Some(
        "Hello",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 0,
    logprob: 1.0,
    bytes: Some(
        "!",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 2650,
    logprob: 1.0,
    bytes: Some(
        " How",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 649,
    logprob: 1.0,
    bytes: Some(
        " can",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 358,
    logprob: 1.0,
    bytes: Some(
        " I",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 7945,
    logprob: 1.0,
    bytes: Some(
        " assist",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 499,
    logprob: 1.0,
    bytes: Some(
        " you",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 3432,
    logprob: 1.0,
    bytes: Some(
        " today",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
    3432,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 30,
    logprob: 1.0,
    bytes: Some(
        "?",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
    3432,
    30,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 128009,
    logprob: 1.0,
    bytes: Some(
        "<|eot_id|>",
    ),
    top_logprobs: None,
}
2025-05-28T21:53:14.026452Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 10.80, Prefix cache hitrate 0.00%, 0 running, 0 waiting
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 9906,
    logprob: 1.0,
    bytes: Some(
        "Hello",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 0,
    logprob: 1.0,
    bytes: Some(
        "!",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 0,
    logprob: 1.0,
    bytes: Some(
        "!",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 0,
    logprob: 1.0,
    bytes: Some(
        "!",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    0,
    0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 0,
    logprob: 1.0,
    bytes: Some(
        "!",
    ),
    top_logprobs: None,
}
```

</details>

If I can chip in at all / if you have any pointers, just let me know!

### Further testing notes

This actually only seems to impact Llama 3 (3.1 and 3.2 tested) and only under certain configs with paged attention. For example, no quant has the infinite loop, but isq8 does not (works as expected). But gguf has the infinite loop too. Tested Gemma 3 and Phi 3 and those don't seem to be impacted.

| model | quant | paged attention | works? | command |
| --- | --- | --- | --- | --- | 
| llama 3.1 8b | none | no | :white_check_mark: | `cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.1-8B-Instruct` |
|  | none | :white_check_mark: | :x: | `cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.1-8B-Instruct` |
|  | isq8 | no | :white_check_mark: | `cargo run --features metal '--' --port 1234 --isq 8 plain -m meta-llama/Llama-3.1-8B-Instruct` |
|  | isq8 | :white_check_mark: | :white_check_mark: | `cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m meta-llama/Llama-3.1-8B-Instruct` |
| llama 3.2 1b | none | no | :white_check_mark: | `cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.2-1B-Instruct` |
|  | none | :white_check_mark: | :x: | `cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct` |
|  | isq8 | no | :white_check_mark: | `cargo run --features metal '--' --port 1234 --isq 8 plain -m meta-llama/Llama-3.2-1B-Instruct` |
|  | isq8 | :white_check_mark: | :white_check_mark: | `cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct` |
|  | gguf | no | :white_check_mark: | `cargo run --features metal '--' --port 1234 gguf -m bartowski/Llama-3.2-1B-Instruct-GGUF -f Llama-3.2-1B-Instruct-Q4_K_M.gguf` |
|  | gguf | :white_check_mark: | :x: | `cargo run --features metal '--' --port 1234 --paged-attn gguf -m bartowski/Llama-3.2-1B-Instruct-GGUF -f Llama-3.2-1B-Instruct-Q4_K_M.gguf` |
| llama 3.2 3b | none | no | :white_check_mark: | `cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.2-3B-Instruct` |
|  | none | :white_check_mark: | :x: | `cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-3B-Instruct` |
|  | isq8 | no | :white_check_mark: | `cargo run --features metal '--' --port 1234 --isq 8 plain -m meta-llama/Llama-3.2-3B-Instruct` |
|  | isq8 | :white_check_mark: | :white_check_mark: | `cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m meta-llama/Llama-3.2-3B-Instruct` |
| gemma-3-4b | none | no | :white_check_mark: | `cargo run --features metal '--' --port 1234 vision-plain -m google/gemma-3-4b-it` |
|  | none | :white_check_mark: | :white_check_mark: | `cargo run --features metal '--' --port 1234 --paged-attn vision-plain -m google/gemma-3-4b-it` |
|  | isq8 | no | :white_check_mark: | `cargo run --features metal '--' --port 1234 --isq 8 vision-plain -m google/gemma-3-4b-it` |
|  | isq8 | :white_check_mark: | :white_check_mark: | `cargo run --features metal '--' --port 1234 --isq 8 --paged-attn vision-plain -m google/gemma-3-4b-it` |
| phi-3-mini | none | no | :white_check_mark: | `cargo run --features metal '--' --port 1234 plain -m microsoft/Phi-3-mini-4k-instruct` |
|  | none | :white_check_mark: | :white_check_mark: | `cargo run --features metal '--' --port 1234 --paged-attn plain -m microsoft/Phi-3-mini-4k-instruct` |
|  | isq8 | no | :white_check_mark: | `cargo run --features metal '--' --port 1234 --isq 8 plain -m microsoft/Phi-3-mini-4k-instruct` |
|  | isq8 | :white_check_mark: | :white_check_mark: | `cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m microsoft/Phi-3-mini-4k-instruct` |

## Latest commit or version
ec9ee690



model	quant	paged attention	works?	command
llama 3.1 8b	none	no	✅	`cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.1-8B-Instruct`
	none	✅	❌	`cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.1-8B-Instruct`
	isq8	no	✅	`cargo run --features metal '--' --port 1234 --isq 8 plain -m meta-llama/Llama-3.1-8B-Instruct`
	isq8	✅	✅	`cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m meta-llama/Llama-3.1-8B-Instruct`
llama 3.2 1b	none	no	✅	`cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.2-1B-Instruct`
	none	✅	❌	`cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct`
	isq8	no	✅	`cargo run --features metal '--' --port 1234 --isq 8 plain -m meta-llama/Llama-3.2-1B-Instruct`
	isq8	✅	✅	`cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct`
	gguf	no	✅	`cargo run --features metal '--' --port 1234 gguf -m bartowski/Llama-3.2-1B-Instruct-GGUF -f Llama-3.2-1B-Instruct-Q4_K_M.gguf`
	gguf	✅	❌	`cargo run --features metal '--' --port 1234 --paged-attn gguf -m bartowski/Llama-3.2-1B-Instruct-GGUF -f Llama-3.2-1B-Instruct-Q4_K_M.gguf`
llama 3.2 3b	none	no	✅	`cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.2-3B-Instruct`
	none	✅	❌	`cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-3B-Instruct`
	isq8	no	✅	`cargo run --features metal '--' --port 1234 --isq 8 plain -m meta-llama/Llama-3.2-3B-Instruct`
	isq8	✅	✅	`cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m meta-llama/Llama-3.2-3B-Instruct`
gemma-3-4b	none	no	✅	`cargo run --features metal '--' --port 1234 vision-plain -m google/gemma-3-4b-it`
	none	✅	✅	`cargo run --features metal '--' --port 1234 --paged-attn vision-plain -m google/gemma-3-4b-it`
	isq8	no	✅	`cargo run --features metal '--' --port 1234 --isq 8 vision-plain -m google/gemma-3-4b-it`
	isq8	✅	✅	`cargo run --features metal '--' --port 1234 --isq 8 --paged-attn vision-plain -m google/gemma-3-4b-it`
phi-3-mini	none	no	✅	`cargo run --features metal '--' --port 1234 plain -m microsoft/Phi-3-mini-4k-instruct`
	none	✅	✅	`cargo run --features metal '--' --port 1234 --paged-attn plain -m microsoft/Phi-3-mini-4k-instruct`
	isq8	no	✅	`cargo run --features metal '--' --port 1234 --isq 8 plain -m microsoft/Phi-3-mini-4k-instruct`
	isq8	✅	✅	`cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m microsoft/Phi-3-mini-4k-instruct`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mistral Server Llama 3 Paged Attention Sampler Infinite Loop? #1383

Describe the bug

No Paged Attention

Paged Attention

Print debugging

Further testing notes

Latest commit or version

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Mistral Server Llama 3 Paged Attention Sampler Infinite Loop? #1383

Description

Describe the bug

No Paged Attention

Paged Attention

Print debugging

Further testing notes

Latest commit or version

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions