Skip to content

Mistral Server Llama 3 Paged Attention Sampler Infinite Loop? #1383

@matthewhaynesonline

Description

@matthewhaynesonline

Describe the bug

Edit: this only seems to be the case when using Llama 3.x and only under certain config. Further testing notes below.

When using --paged-attn, it seems like the sampler hits an infinite loop, at least in my quick testing. This is a follow up of #1376 (comment)

curl command
curl -X 'POST' \
  'http://localhost:1234/v1/chat/completions' \
  -H 'accept: */*' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "meta-llama/Llama-3.2-1B-Instruct",
  "messages": [{
    "role": "user",
    "content": "hi!"
  }]
}'

No Paged Attention

$ cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.2-1B-Instruct

When running no paged attention, subsequent requests work as expected.

Paged Attention

$ cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct

When running paged attention, the first request works, but the second request hangs. Canceling the request and then sending a third request causes a panic.

mistral-server-sampler-no-pa.mp4

Print debugging

mistral-server-sampler-pa.mp4

I added two debugging statements to show that the sampler picks the ! forever.

Also, I'm not sure if this is a red herring, but at the start of the second request, the sequence shows a different set of tokens than without paged attn. However, after the first step, the sequence matches, but continues on forever.

no paged attn second request sequence ctx
2025-05-28T21:52:46.463856Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 0.80, Prefix cache hitrate 0.00%, 0 running, 0 waiting
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 9906,
    logprob: 1.0,
    bytes: Some(
        "Hello",
    ),
    top_logprobs: None,
}
paged attn second request sequence ctx
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 9906,
    logprob: 1.0,
    bytes: Some(
        "Hello",
    ),
    top_logprobs: None,
}
diff
diff --git a/mistralrs-core/src/pipeline/sampling.rs b/mistralrs-core/src/pipeline/sampling.rs
index 05bf6883..b610cceb 100644
--- a/mistralrs-core/src/pipeline/sampling.rs
+++ b/mistralrs-core/src/pipeline/sampling.rs
@@ -323,6 +323,7 @@ pub async fn sample_and_add_toks(

     for (sampled, seq) in std::iter::zip(sampled_vec, seqs.iter_mut()) {
         let next_token = crate::handle_seq_error_stateaware_ok!(sampled, seq);
+        dbg!(&next_token);

         let metadata = this.get_metadata();
         let eos_tok = if disable_eos_stop {
@@ -352,6 +353,7 @@ pub async fn sample_sequence(

     let sampler = seq.sampler();
     let ctx_clone = seq.get_toks().to_vec();
+    dbg!(&ctx_clone);
     let rng_clone = rng.clone();
     let logits_clone = logits.clone();
     let first_lobprobs_response = if use_async_pool {
full command output
 matt@Matts-MacBook-Pro-2024  ~  /Users/matt/Code/matthewhaynes/mistral.rs/mistralrs-server

 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs/mistralrs-server   master ±  clear
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs/mistralrs-server   master ±  git status
On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   ../mistralrs-core/src/pipeline/sampling.rs

no changes added to commit (use "git add" and/or "git commit -a")
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs/mistralrs-server   master ±  git --no-pager log -1 --oneline
ec9ee690 (HEAD -> master, origin/master, origin/HEAD) Improved generate_uqff_card
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs/mistralrs-server   master ±  git diff
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs/mistralrs-server   master ±  cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.2-1B-Instruct
    Finished `dev` profile [optimized + debuginfo] target(s) in 0.14s
     Running `/Users/matt/Code/matthewhaynes/mistral.rs/target/debug/mistralrs-server --port 1234 plain -m meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:52:40.170867Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2025-05-28T21:52:40.170965Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-05-28T21:52:40.171029Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2025-05-28T21:52:40.171234Z  INFO hf_hub: Using token file found "/Users/matt/.cache/huggingface/token"
2025-05-28T21:52:40.171659Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:52:40.171734Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:52:40.334680Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2025-05-28T21:52:40.388299Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:52:40.524822Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:52:40.573661Z  INFO mistralrs_quant::utils::log: Automatic loader type determined to be `llama`
2025-05-28T21:52:40.573774Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-05-28T21:52:40.579251Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-28T21:52:40.589449Z  INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-05-28T21:52:40.589559Z  INFO mistralrs_quant::utils::log: Model has 16 repeating layers.
2025-05-28T21:52:40.589600Z  INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-05-28T21:52:40.589650Z  INFO mistralrs_quant::utils::log: Layers 0-15: metal[4294968389] (36 GB)
2025-05-28T21:52:40.591081Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-28T21:52:40.591095Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_act: Silu, hidden_size: 2048, intermediate_size: 8192, vocab_size: 128256, num_hidden_layers: 16, num_attention_heads: 32, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 32.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: true }
2025-05-28T21:52:40.592352Z  INFO mistralrs_core::pipeline::normal: Applying ISQ to None
2025-05-28T21:52:40.592594Z  INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-05-28T21:52:41.450399Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2025-05-28T21:52:41.460379Z  INFO mistralrs_server: Model loaded.
2025-05-28T21:52:41.460496Z  INFO mistralrs_core: Beginning dummy run.
2025-05-28T21:52:41.460760Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    15339,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 11,
    logprob: 1.0,
    bytes: Some(
        ",",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    15339,
    11,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 1268,
    logprob: 1.0,
    bytes: Some(
        " how",
    ),
    top_logprobs: None,
}
2025-05-28T21:52:41.542654Z  INFO mistralrs_core: Dummy run completed in 0.082136333s.
2025-05-28T21:52:41.544387Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.
2025-05-28T21:52:46.463856Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 0.80, Prefix cache hitrate 0.00%, 0 running, 0 waiting
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 9906,
    logprob: 1.0,
    bytes: Some(
        "Hello",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 0,
    logprob: 1.0,
    bytes: Some(
        "!",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 2650,
    logprob: 1.0,
    bytes: Some(
        " How",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 649,
    logprob: 1.0,
    bytes: Some(
        " can",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 358,
    logprob: 1.0,
    bytes: Some(
        " I",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 7945,
    logprob: 1.0,
    bytes: Some(
        " assist",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 499,
    logprob: 1.0,
    bytes: Some(
        " you",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 3432,
    logprob: 1.0,
    bytes: Some(
        " today",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
    3432,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 30,
    logprob: 1.0,
    bytes: Some(
        "?",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
    3432,
    30,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 128009,
    logprob: 1.0,
    bytes: Some(
        "<|eot_id|>",
    ),
    top_logprobs: None,
}
2025-05-28T21:52:51.465783Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 10.00, Prefix cache hitrate 50.00%, 0 running, 0 waiting
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 9906,
    logprob: 1.0,
    bytes: Some(
        "Hello",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 0,
    logprob: 1.0,
    bytes: Some(
        "!",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 2650,
    logprob: 1.0,
    bytes: Some(
        " How",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 649,
    logprob: 1.0,
    bytes: Some(
        " can",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 358,
    logprob: 1.0,
    bytes: Some(
        " I",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 7945,
    logprob: 1.0,
    bytes: Some(
        " assist",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 499,
    logprob: 1.0,
    bytes: Some(
        " you",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 3432,
    logprob: 1.0,
    bytes: Some(
        " today",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
    3432,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 30,
    logprob: 1.0,
    bytes: Some(
        "?",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
    3432,
    30,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 128009,
    logprob: 1.0,
    bytes: Some(
        "<|eot_id|>",
    ),
    top_logprobs: None,
}
2025-05-28T21:52:56.470522Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 10.00, Prefix cache hitrate 33.33%, 0 running, 0 waiting
^C
 ✘ matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs/mistralrs-server   master ±  cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct
    Finished `dev` profile [optimized + debuginfo] target(s) in 0.13s
     Running `/Users/matt/Code/matthewhaynes/mistral.rs/target/debug/mistralrs-server --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:53:07.630667Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2025-05-28T21:53:07.630753Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-05-28T21:53:07.630813Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2025-05-28T21:53:07.631007Z  INFO hf_hub: Using token file found "/Users/matt/.cache/huggingface/token"
2025-05-28T21:53:07.631413Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:53:07.631477Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:53:07.821719Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2025-05-28T21:53:07.896532Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:53:08.046795Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:53:08.138120Z  INFO mistralrs_quant::utils::log: Automatic loader type determined to be `llama`
2025-05-28T21:53:08.138223Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-05-28T21:53:08.144127Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-28T21:53:08.152848Z  INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-05-28T21:53:08.152941Z  INFO mistralrs_quant::utils::log: Model has 16 repeating layers.
2025-05-28T21:53:08.152984Z  INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-05-28T21:53:08.153029Z  INFO mistralrs_quant::utils::log: Layers 0-15: metal[4294968389] (36 GB)
2025-05-28T21:53:08.154436Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-28T21:53:08.154449Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_act: Silu, hidden_size: 2048, intermediate_size: 8192, vocab_size: 128256, num_hidden_layers: 16, num_attention_heads: 32, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 32.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: true }
2025-05-28T21:53:08.155722Z  INFO mistralrs_core::pipeline::normal: Applying ISQ to None
2025-05-28T21:53:08.155969Z  INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-05-28T21:53:08.697797Z  INFO mistralrs_core::paged_attention: Allocating 128 MB for PagedAttention KV cache per GPU
2025-05-28T21:53:08.697808Z  INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 128 GPU blocks: available context length is 4096 tokens
2025-05-28T21:53:09.011410Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2025-05-28T21:53:09.021301Z  INFO mistralrs_server: Model loaded.
2025-05-28T21:53:09.021418Z  INFO mistralrs_core: Beginning dummy run.
2025-05-28T21:53:09.021672Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    15339,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 11,
    logprob: 1.0,
    bytes: Some(
        ",",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    15339,
    11,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 1268,
    logprob: 1.0,
    bytes: Some(
        " how",
    ),
    top_logprobs: None,
}
2025-05-28T21:53:09.104622Z  INFO mistralrs_core: Dummy run completed in 0.083181958s.
2025-05-28T21:53:09.106300Z  INFO mistralrs_server: Serving on http://0.0.0.0:1234.
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 9906,
    logprob: 1.0,
    bytes: Some(
        "Hello",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 0,
    logprob: 1.0,
    bytes: Some(
        "!",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 2650,
    logprob: 1.0,
    bytes: Some(
        " How",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 649,
    logprob: 1.0,
    bytes: Some(
        " can",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 358,
    logprob: 1.0,
    bytes: Some(
        " I",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 7945,
    logprob: 1.0,
    bytes: Some(
        " assist",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 499,
    logprob: 1.0,
    bytes: Some(
        " you",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 3432,
    logprob: 1.0,
    bytes: Some(
        " today",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
    3432,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 30,
    logprob: 1.0,
    bytes: Some(
        "?",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    2650,
    649,
    358,
    7945,
    499,
    3432,
    30,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 128009,
    logprob: 1.0,
    bytes: Some(
        "<|eot_id|>",
    ),
    top_logprobs: None,
}
2025-05-28T21:53:14.026452Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 10.80, Prefix cache hitrate 0.00%, 0 running, 0 waiting
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 9906,
    logprob: 1.0,
    bytes: Some(
        "Hello",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 0,
    logprob: 1.0,
    bytes: Some(
        "!",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 0,
    logprob: 1.0,
    bytes: Some(
        "!",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 0,
    logprob: 1.0,
    bytes: Some(
        "!",
    ),
    top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
    128000,
    128000,
    128006,
    9125,
    128007,
    271,
    38766,
    1303,
    33025,
    2696,
    25,
    6790,
    220,
    2366,
    18,
    198,
    15724,
    2696,
    25,
    220,
    1591,
    11,
    3297,
    11,
    220,
    2366,
    20,
    271,
    128009,
    128006,
    882,
    128007,
    271,
    6151,
    0,
    128009,
    128006,
    78191,
    128007,
    271,
    9906,
    0,
    0,
    0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
    token: 0,
    logprob: 1.0,
    bytes: Some(
        "!",
    ),
    top_logprobs: None,
}

If I can chip in at all / if you have any pointers, just let me know!

Further testing notes

This actually only seems to impact Llama 3 (3.1 and 3.2 tested) and only under certain configs with paged attention. For example, no quant has the infinite loop, but isq8 does not (works as expected). But gguf has the infinite loop too. Tested Gemma 3 and Phi 3 and those don't seem to be impacted.

model quant paged attention works? command
llama 3.1 8b none no cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.1-8B-Instruct
none cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.1-8B-Instruct
isq8 no cargo run --features metal '--' --port 1234 --isq 8 plain -m meta-llama/Llama-3.1-8B-Instruct
isq8 cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m meta-llama/Llama-3.1-8B-Instruct
llama 3.2 1b none no cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.2-1B-Instruct
none cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct
isq8 no cargo run --features metal '--' --port 1234 --isq 8 plain -m meta-llama/Llama-3.2-1B-Instruct
isq8 cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct
gguf no cargo run --features metal '--' --port 1234 gguf -m bartowski/Llama-3.2-1B-Instruct-GGUF -f Llama-3.2-1B-Instruct-Q4_K_M.gguf
gguf cargo run --features metal '--' --port 1234 --paged-attn gguf -m bartowski/Llama-3.2-1B-Instruct-GGUF -f Llama-3.2-1B-Instruct-Q4_K_M.gguf
llama 3.2 3b none no cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.2-3B-Instruct
none cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-3B-Instruct
isq8 no cargo run --features metal '--' --port 1234 --isq 8 plain -m meta-llama/Llama-3.2-3B-Instruct
isq8 cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m meta-llama/Llama-3.2-3B-Instruct
gemma-3-4b none no cargo run --features metal '--' --port 1234 vision-plain -m google/gemma-3-4b-it
none cargo run --features metal '--' --port 1234 --paged-attn vision-plain -m google/gemma-3-4b-it
isq8 no cargo run --features metal '--' --port 1234 --isq 8 vision-plain -m google/gemma-3-4b-it
isq8 cargo run --features metal '--' --port 1234 --isq 8 --paged-attn vision-plain -m google/gemma-3-4b-it
phi-3-mini none no cargo run --features metal '--' --port 1234 plain -m microsoft/Phi-3-mini-4k-instruct
none cargo run --features metal '--' --port 1234 --paged-attn plain -m microsoft/Phi-3-mini-4k-instruct
isq8 no cargo run --features metal '--' --port 1234 --isq 8 plain -m microsoft/Phi-3-mini-4k-instruct
isq8 cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m microsoft/Phi-3-mini-4k-instruct

Latest commit or version

ec9ee69

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions