-
Notifications
You must be signed in to change notification settings - Fork 482
Description
Describe the bug
Edit: this only seems to be the case when using Llama 3.x and only under certain config. Further testing notes below.
When using --paged-attn, it seems like the sampler hits an infinite loop, at least in my quick testing. This is a follow up of #1376 (comment)
curl command
curl -X 'POST' \
'http://localhost:1234/v1/chat/completions' \
-H 'accept: */*' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"messages": [{
"role": "user",
"content": "hi!"
}]
}'No Paged Attention
$ cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.2-1B-Instruct
When running no paged attention, subsequent requests work as expected.
Paged Attention
$ cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct
When running paged attention, the first request works, but the second request hangs. Canceling the request and then sending a third request causes a panic.
mistral-server-sampler-no-pa.mp4
Print debugging
mistral-server-sampler-pa.mp4
I added two debugging statements to show that the sampler picks the ! forever.
Also, I'm not sure if this is a red herring, but at the start of the second request, the sequence shows a different set of tokens than without paged attn. However, after the first step, the sequence matches, but continues on forever.
no paged attn second request sequence ctx
2025-05-28T21:52:46.463856Z INFO mistralrs_core::engine::logger: Throughput (T/s) 0.80, Prefix cache hitrate 0.00%, 0 running, 0 waiting
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 9906,
logprob: 1.0,
bytes: Some(
"Hello",
),
top_logprobs: None,
}paged attn second request sequence ctx
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 9906,
logprob: 1.0,
bytes: Some(
"Hello",
),
top_logprobs: None,
}diff
diff --git a/mistralrs-core/src/pipeline/sampling.rs b/mistralrs-core/src/pipeline/sampling.rs
index 05bf6883..b610cceb 100644
--- a/mistralrs-core/src/pipeline/sampling.rs
+++ b/mistralrs-core/src/pipeline/sampling.rs
@@ -323,6 +323,7 @@ pub async fn sample_and_add_toks(
for (sampled, seq) in std::iter::zip(sampled_vec, seqs.iter_mut()) {
let next_token = crate::handle_seq_error_stateaware_ok!(sampled, seq);
+ dbg!(&next_token);
let metadata = this.get_metadata();
let eos_tok = if disable_eos_stop {
@@ -352,6 +353,7 @@ pub async fn sample_sequence(
let sampler = seq.sampler();
let ctx_clone = seq.get_toks().to_vec();
+ dbg!(&ctx_clone);
let rng_clone = rng.clone();
let logits_clone = logits.clone();
let first_lobprobs_response = if use_async_pool {full command output
matt@Matts-MacBook-Pro-2024 ~ /Users/matt/Code/matthewhaynes/mistral.rs/mistralrs-server
matt@Matts-MacBook-Pro-2024 ~/Code/matthewhaynes/mistral.rs/mistralrs-server master ± clear
matt@Matts-MacBook-Pro-2024 ~/Code/matthewhaynes/mistral.rs/mistralrs-server master ± git status
On branch master
Your branch is up to date with 'origin/master'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: ../mistralrs-core/src/pipeline/sampling.rs
no changes added to commit (use "git add" and/or "git commit -a")
matt@Matts-MacBook-Pro-2024 ~/Code/matthewhaynes/mistral.rs/mistralrs-server master ± git --no-pager log -1 --oneline
ec9ee690 (HEAD -> master, origin/master, origin/HEAD) Improved generate_uqff_card
matt@Matts-MacBook-Pro-2024 ~/Code/matthewhaynes/mistral.rs/mistralrs-server master ± git diff
matt@Matts-MacBook-Pro-2024 ~/Code/matthewhaynes/mistral.rs/mistralrs-server master ± cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.2-1B-Instruct
Finished `dev` profile [optimized + debuginfo] target(s) in 0.14s
Running `/Users/matt/Code/matthewhaynes/mistral.rs/target/debug/mistralrs-server --port 1234 plain -m meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:52:40.170867Z INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2025-05-28T21:52:40.170965Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-05-28T21:52:40.171029Z INFO mistralrs_server: Model kind is: normal (no adapters)
2025-05-28T21:52:40.171234Z INFO hf_hub: Using token file found "/Users/matt/.cache/huggingface/token"
2025-05-28T21:52:40.171659Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:52:40.171734Z INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:52:40.334680Z INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2025-05-28T21:52:40.388299Z INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:52:40.524822Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:52:40.573661Z INFO mistralrs_quant::utils::log: Automatic loader type determined to be `llama`
2025-05-28T21:52:40.573774Z INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-05-28T21:52:40.579251Z INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-28T21:52:40.589449Z INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-05-28T21:52:40.589559Z INFO mistralrs_quant::utils::log: Model has 16 repeating layers.
2025-05-28T21:52:40.589600Z INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-05-28T21:52:40.589650Z INFO mistralrs_quant::utils::log: Layers 0-15: metal[4294968389] (36 GB)
2025-05-28T21:52:40.591081Z INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-28T21:52:40.591095Z INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_act: Silu, hidden_size: 2048, intermediate_size: 8192, vocab_size: 128256, num_hidden_layers: 16, num_attention_heads: 32, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 32.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: true }
2025-05-28T21:52:40.592352Z INFO mistralrs_core::pipeline::normal: Applying ISQ to None
2025-05-28T21:52:40.592594Z INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-05-28T21:52:41.450399Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2025-05-28T21:52:41.460379Z INFO mistralrs_server: Model loaded.
2025-05-28T21:52:41.460496Z INFO mistralrs_core: Beginning dummy run.
2025-05-28T21:52:41.460760Z INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
15339,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 11,
logprob: 1.0,
bytes: Some(
",",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
15339,
11,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 1268,
logprob: 1.0,
bytes: Some(
" how",
),
top_logprobs: None,
}
2025-05-28T21:52:41.542654Z INFO mistralrs_core: Dummy run completed in 0.082136333s.
2025-05-28T21:52:41.544387Z INFO mistralrs_server: Serving on http://0.0.0.0:1234.
2025-05-28T21:52:46.463856Z INFO mistralrs_core::engine::logger: Throughput (T/s) 0.80, Prefix cache hitrate 0.00%, 0 running, 0 waiting
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 9906,
logprob: 1.0,
bytes: Some(
"Hello",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 0,
logprob: 1.0,
bytes: Some(
"!",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 2650,
logprob: 1.0,
bytes: Some(
" How",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 649,
logprob: 1.0,
bytes: Some(
" can",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 358,
logprob: 1.0,
bytes: Some(
" I",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
358,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 7945,
logprob: 1.0,
bytes: Some(
" assist",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
358,
7945,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 499,
logprob: 1.0,
bytes: Some(
" you",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
358,
7945,
499,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 3432,
logprob: 1.0,
bytes: Some(
" today",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
358,
7945,
499,
3432,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 30,
logprob: 1.0,
bytes: Some(
"?",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
358,
7945,
499,
3432,
30,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 128009,
logprob: 1.0,
bytes: Some(
"<|eot_id|>",
),
top_logprobs: None,
}
2025-05-28T21:52:51.465783Z INFO mistralrs_core::engine::logger: Throughput (T/s) 10.00, Prefix cache hitrate 50.00%, 0 running, 0 waiting
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 9906,
logprob: 1.0,
bytes: Some(
"Hello",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 0,
logprob: 1.0,
bytes: Some(
"!",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 2650,
logprob: 1.0,
bytes: Some(
" How",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 649,
logprob: 1.0,
bytes: Some(
" can",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 358,
logprob: 1.0,
bytes: Some(
" I",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
358,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 7945,
logprob: 1.0,
bytes: Some(
" assist",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
358,
7945,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 499,
logprob: 1.0,
bytes: Some(
" you",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
358,
7945,
499,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 3432,
logprob: 1.0,
bytes: Some(
" today",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
358,
7945,
499,
3432,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 30,
logprob: 1.0,
bytes: Some(
"?",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
358,
7945,
499,
3432,
30,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 128009,
logprob: 1.0,
bytes: Some(
"<|eot_id|>",
),
top_logprobs: None,
}
2025-05-28T21:52:56.470522Z INFO mistralrs_core::engine::logger: Throughput (T/s) 10.00, Prefix cache hitrate 33.33%, 0 running, 0 waiting
^C
✘ matt@Matts-MacBook-Pro-2024 ~/Code/matthewhaynes/mistral.rs/mistralrs-server master ± cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct
Finished `dev` profile [optimized + debuginfo] target(s) in 0.13s
Running `/Users/matt/Code/matthewhaynes/mistral.rs/target/debug/mistralrs-server --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:53:07.630667Z INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2025-05-28T21:53:07.630753Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-05-28T21:53:07.630813Z INFO mistralrs_server: Model kind is: normal (no adapters)
2025-05-28T21:53:07.631007Z INFO hf_hub: Using token file found "/Users/matt/.cache/huggingface/token"
2025-05-28T21:53:07.631413Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:53:07.631477Z INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:53:07.821719Z INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2025-05-28T21:53:07.896532Z INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:53:08.046795Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-28T21:53:08.138120Z INFO mistralrs_quant::utils::log: Automatic loader type determined to be `llama`
2025-05-28T21:53:08.138223Z INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-05-28T21:53:08.144127Z INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-28T21:53:08.152848Z INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-05-28T21:53:08.152941Z INFO mistralrs_quant::utils::log: Model has 16 repeating layers.
2025-05-28T21:53:08.152984Z INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-05-28T21:53:08.153029Z INFO mistralrs_quant::utils::log: Layers 0-15: metal[4294968389] (36 GB)
2025-05-28T21:53:08.154436Z INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-28T21:53:08.154449Z INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_act: Silu, hidden_size: 2048, intermediate_size: 8192, vocab_size: 128256, num_hidden_layers: 16, num_attention_heads: 32, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 32.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: true }
2025-05-28T21:53:08.155722Z INFO mistralrs_core::pipeline::normal: Applying ISQ to None
2025-05-28T21:53:08.155969Z INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-05-28T21:53:08.697797Z INFO mistralrs_core::paged_attention: Allocating 128 MB for PagedAttention KV cache per GPU
2025-05-28T21:53:08.697808Z INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 128 GPU blocks: available context length is 4096 tokens
2025-05-28T21:53:09.011410Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2025-05-28T21:53:09.021301Z INFO mistralrs_server: Model loaded.
2025-05-28T21:53:09.021418Z INFO mistralrs_core: Beginning dummy run.
2025-05-28T21:53:09.021672Z INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
15339,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 11,
logprob: 1.0,
bytes: Some(
",",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
15339,
11,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 1268,
logprob: 1.0,
bytes: Some(
" how",
),
top_logprobs: None,
}
2025-05-28T21:53:09.104622Z INFO mistralrs_core: Dummy run completed in 0.083181958s.
2025-05-28T21:53:09.106300Z INFO mistralrs_server: Serving on http://0.0.0.0:1234.
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 9906,
logprob: 1.0,
bytes: Some(
"Hello",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 0,
logprob: 1.0,
bytes: Some(
"!",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 2650,
logprob: 1.0,
bytes: Some(
" How",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 649,
logprob: 1.0,
bytes: Some(
" can",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 358,
logprob: 1.0,
bytes: Some(
" I",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
358,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 7945,
logprob: 1.0,
bytes: Some(
" assist",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
358,
7945,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 499,
logprob: 1.0,
bytes: Some(
" you",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
358,
7945,
499,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 3432,
logprob: 1.0,
bytes: Some(
" today",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
358,
7945,
499,
3432,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 30,
logprob: 1.0,
bytes: Some(
"?",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
2650,
649,
358,
7945,
499,
3432,
30,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 128009,
logprob: 1.0,
bytes: Some(
"<|eot_id|>",
),
top_logprobs: None,
}
2025-05-28T21:53:14.026452Z INFO mistralrs_core::engine::logger: Throughput (T/s) 10.80, Prefix cache hitrate 0.00%, 0 running, 0 waiting
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 9906,
logprob: 1.0,
bytes: Some(
"Hello",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 0,
logprob: 1.0,
bytes: Some(
"!",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 0,
logprob: 1.0,
bytes: Some(
"!",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 0,
logprob: 1.0,
bytes: Some(
"!",
),
top_logprobs: None,
}
[mistralrs-core/src/pipeline/sampling.rs:356:5] &ctx_clone = [
128000,
128000,
128006,
9125,
128007,
271,
38766,
1303,
33025,
2696,
25,
6790,
220,
2366,
18,
198,
15724,
2696,
25,
220,
1591,
11,
3297,
11,
220,
2366,
20,
271,
128009,
128006,
882,
128007,
271,
6151,
0,
128009,
128006,
78191,
128007,
271,
9906,
0,
0,
0,
]
[mistralrs-core/src/pipeline/sampling.rs:326:9] &next_token = Logprobs {
token: 0,
logprob: 1.0,
bytes: Some(
"!",
),
top_logprobs: None,
}If I can chip in at all / if you have any pointers, just let me know!
Further testing notes
This actually only seems to impact Llama 3 (3.1 and 3.2 tested) and only under certain configs with paged attention. For example, no quant has the infinite loop, but isq8 does not (works as expected). But gguf has the infinite loop too. Tested Gemma 3 and Phi 3 and those don't seem to be impacted.
| model | quant | paged attention | works? | command |
|---|---|---|---|---|
| llama 3.1 8b | none | no | ✅ | cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.1-8B-Instruct |
| none | ✅ | ❌ | cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.1-8B-Instruct |
|
| isq8 | no | ✅ | cargo run --features metal '--' --port 1234 --isq 8 plain -m meta-llama/Llama-3.1-8B-Instruct |
|
| isq8 | ✅ | ✅ | cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m meta-llama/Llama-3.1-8B-Instruct |
|
| llama 3.2 1b | none | no | ✅ | cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.2-1B-Instruct |
| none | ✅ | ❌ | cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct |
|
| isq8 | no | ✅ | cargo run --features metal '--' --port 1234 --isq 8 plain -m meta-llama/Llama-3.2-1B-Instruct |
|
| isq8 | ✅ | ✅ | cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct |
|
| gguf | no | ✅ | cargo run --features metal '--' --port 1234 gguf -m bartowski/Llama-3.2-1B-Instruct-GGUF -f Llama-3.2-1B-Instruct-Q4_K_M.gguf |
|
| gguf | ✅ | ❌ | cargo run --features metal '--' --port 1234 --paged-attn gguf -m bartowski/Llama-3.2-1B-Instruct-GGUF -f Llama-3.2-1B-Instruct-Q4_K_M.gguf |
|
| llama 3.2 3b | none | no | ✅ | cargo run --features metal '--' --port 1234 plain -m meta-llama/Llama-3.2-3B-Instruct |
| none | ✅ | ❌ | cargo run --features metal '--' --port 1234 --paged-attn plain -m meta-llama/Llama-3.2-3B-Instruct |
|
| isq8 | no | ✅ | cargo run --features metal '--' --port 1234 --isq 8 plain -m meta-llama/Llama-3.2-3B-Instruct |
|
| isq8 | ✅ | ✅ | cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m meta-llama/Llama-3.2-3B-Instruct |
|
| gemma-3-4b | none | no | ✅ | cargo run --features metal '--' --port 1234 vision-plain -m google/gemma-3-4b-it |
| none | ✅ | ✅ | cargo run --features metal '--' --port 1234 --paged-attn vision-plain -m google/gemma-3-4b-it |
|
| isq8 | no | ✅ | cargo run --features metal '--' --port 1234 --isq 8 vision-plain -m google/gemma-3-4b-it |
|
| isq8 | ✅ | ✅ | cargo run --features metal '--' --port 1234 --isq 8 --paged-attn vision-plain -m google/gemma-3-4b-it |
|
| phi-3-mini | none | no | ✅ | cargo run --features metal '--' --port 1234 plain -m microsoft/Phi-3-mini-4k-instruct |
| none | ✅ | ✅ | cargo run --features metal '--' --port 1234 --paged-attn plain -m microsoft/Phi-3-mini-4k-instruct |
|
| isq8 | no | ✅ | cargo run --features metal '--' --port 1234 --isq 8 plain -m microsoft/Phi-3-mini-4k-instruct |
|
| isq8 | ✅ | ✅ | cargo run --features metal '--' --port 1234 --isq 8 --paged-attn plain -m microsoft/Phi-3-mini-4k-instruct |