Skip to content

Mistral Server Pipeline Panic #1376

@matthewhaynesonline

Description

@matthewhaynesonline

Describe the bug

Hi there, seems to be a regression introduced sometime after c116ce42.

Here's the compare: https://github.com/EricLBuehler/mistral.rs/compare/c116ce42..2b56c10

When running 2b56c10 mistral server, the second request panics:

mistralrs-server-panic.mp4
thread '<unnamed>' panicked at mistralrs-core/src/pipeline/mod.rs:462:35:
Did not get any inputs. This is shocking.`

let l = l.expect("Did not get any inputs. This is shocking.");

output
matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs   master  git --no-pager log -1 --oneline
2b56c102 (HEAD -> master, origin/master, origin/HEAD, support-gemma-gguf) Include schemas needed for chatcompletions endpoint (#1353)
 
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs   master  cd mistralrs-server
 
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs/mistralrs-server   master  cargo build --release --features metal
    Finished `release` profile [optimized] target(s) in 0.54s
 
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs/mistralrs-server   master  cd ../
 
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs   master  ./target/release/mistralrs-server --port 8888 plain -m meta-llama/Llama-3.2-1B-Instruct
2025-05-26T19:58:11.322726Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2025-05-26T19:58:11.322894Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-05-26T19:58:11.322956Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2025-05-26T19:58:11.323058Z  INFO hf_hub: Using token file found "/Users/matt/.cache/huggingface/token"
2025-05-26T19:58:11.324290Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:58:11.324385Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:58:11.505796Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2025-05-26T19:58:11.582024Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:58:11.692240Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:58:11.825444Z  INFO mistralrs_quant::utils::log: Automatic loader type determined to be `llama`
2025-05-26T19:58:11.825508Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-05-26T19:58:12.505522Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-26T19:58:12.510890Z  INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-05-26T19:58:12.510981Z  INFO mistralrs_quant::utils::log: Model has 16 repeating layers.
2025-05-26T19:58:12.510995Z  INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-05-26T19:58:12.510998Z  INFO mistralrs_quant::utils::log: Layers 0-15: metal[4294968389] (36 GB)
2025-05-26T19:58:12.511874Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-26T19:58:12.511903Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_act: Silu, hidden_size: 2048, intermediate_size: 8192, vocab_size: 128256, num_hidden_layers: 16, num_attention_heads: 32, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 32.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: true }
2025-05-26T19:58:12.512490Z  INFO mistralrs_core::pipeline::normal: Applying ISQ to None
2025-05-26T19:58:12.512608Z  INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-05-26T19:58:13.507533Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2025-05-26T19:58:13.517271Z  INFO mistralrs_server: Model loaded.
2025-05-26T19:58:13.517364Z  INFO mistralrs_core: Beginning dummy run.
2025-05-26T19:58:13.518151Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
2025-05-26T19:58:15.524430Z  INFO mistralrs_core: Dummy run completed in 2.007560459s.
2025-05-26T19:58:15.525855Z  INFO mistralrs_server: Serving on http://0.0.0.0:8888.
2025-05-26T19:58:18.522281Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 0.80, Prefix cache hitrate 0.00%, 0 running, 0 waiting
2025-05-26T19:58:28.527891Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 10.00, Prefix cache hitrate 50.00%, 0 running, 0 waiting

thread '<unnamed>' panicked at mistralrs-core/src/pipeline/mod.rs:462:35:
Did not get any inputs. This is shocking.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
^C
 ✘ 
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs   master  RUST_BACKTRACE=1 ./target/release/mistralrs-server --port 8888 plain -m meta-llama/Llama-3.2-1B-Instruct
2025-05-26T19:58:48.115964Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2025-05-26T19:58:48.116009Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-05-26T19:58:48.116023Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2025-05-26T19:58:48.116050Z  INFO hf_hub: Using token file found "/Users/matt/.cache/huggingface/token"
2025-05-26T19:58:48.116141Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:58:48.116184Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:58:48.265188Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2025-05-26T19:58:48.342235Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:58:48.497978Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:58:48.566892Z  INFO mistralrs_quant::utils::log: Automatic loader type determined to be `llama`
2025-05-26T19:58:48.566919Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-05-26T19:58:48.572462Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-26T19:58:48.582439Z  INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-05-26T19:58:48.582494Z  INFO mistralrs_quant::utils::log: Model has 16 repeating layers.
2025-05-26T19:58:48.582502Z  INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-05-26T19:58:48.582507Z  INFO mistralrs_quant::utils::log: Layers 0-15: metal[4294968389] (36 GB)
2025-05-26T19:58:48.584154Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-26T19:58:48.584194Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_act: Silu, hidden_size: 2048, intermediate_size: 8192, vocab_size: 128256, num_hidden_layers: 16, num_attention_heads: 32, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 32.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: true }
2025-05-26T19:58:48.585135Z  INFO mistralrs_core::pipeline::normal: Applying ISQ to None
2025-05-26T19:58:48.585262Z  INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-05-26T19:58:49.439811Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2025-05-26T19:58:49.449003Z  INFO mistralrs_server: Model loaded.
2025-05-26T19:58:49.449051Z  INFO mistralrs_core: Beginning dummy run.
2025-05-26T19:58:49.449692Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
2025-05-26T19:58:49.540551Z  INFO mistralrs_core: Dummy run completed in 0.091494416s.
2025-05-26T19:58:49.541812Z  INFO mistralrs_server: Serving on http://0.0.0.0:8888.

thread '<unnamed>' panicked at mistralrs-core/src/pipeline/mod.rs:462:35:
Did not get any inputs. This is shocking.
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::option::expect_failed
   3: core::iter::adapters::try_process
   4: mistralrs_core::pipeline::Pipeline::step::{{closure}}
   5: mistralrs_core::engine::Engine::run::{{closure}}
   6: mistralrs_core::MistralRs::new::{{closure}}::{{closure}}::{{closure}}
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2025-05-26T19:58:54.455043Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 10.80, Prefix cache hitrate 66.67%, 1 running, 0 waiting
2025-05-26T19:58:58.932596Z  WARN mistralrs_core: Engine is dead, rebooting
2025-05-26T19:58:58.932635Z  INFO mistralrs_core: Successfully rebooted engine and updated sender + engine handler
2025-05-26T19:58:58.932921Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
2025-05-26T19:59:03.937805Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 10.00, Prefix cache hitrate 0.00%, 0 running, 0 waiting
^C
 ✘ 
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs   master  RUST_BACKTRACE=FULL ./target/release/mistralrs-server --port 8888 plain -m meta-llama/Llama-3.2-1B-Instruct
2025-05-26T19:59:13.516218Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2025-05-26T19:59:13.516262Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-05-26T19:59:13.516275Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2025-05-26T19:59:13.516307Z  INFO hf_hub: Using token file found "/Users/matt/.cache/huggingface/token"
2025-05-26T19:59:13.516401Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:59:13.516447Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:59:13.641180Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2025-05-26T19:59:13.714444Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:59:13.834169Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:59:13.980503Z  INFO mistralrs_quant::utils::log: Automatic loader type determined to be `llama`
2025-05-26T19:59:13.980528Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-05-26T19:59:13.984733Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-26T19:59:13.993146Z  INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-05-26T19:59:13.993210Z  INFO mistralrs_quant::utils::log: Model has 16 repeating layers.
2025-05-26T19:59:13.993218Z  INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-05-26T19:59:13.993222Z  INFO mistralrs_quant::utils::log: Layers 0-15: metal[4294968389] (36 GB)
2025-05-26T19:59:13.994495Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-26T19:59:13.994519Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_act: Silu, hidden_size: 2048, intermediate_size: 8192, vocab_size: 128256, num_hidden_layers: 16, num_attention_heads: 32, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 32.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: true }
2025-05-26T19:59:13.995394Z  INFO mistralrs_core::pipeline::normal: Applying ISQ to None
2025-05-26T19:59:13.995503Z  INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-05-26T19:59:14.834038Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2025-05-26T19:59:14.843804Z  INFO mistralrs_server: Model loaded.
2025-05-26T19:59:14.843864Z  INFO mistralrs_core: Beginning dummy run.
2025-05-26T19:59:14.844093Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
2025-05-26T19:59:14.928812Z  INFO mistralrs_core: Dummy run completed in 0.084942708s.
2025-05-26T19:59:14.929148Z  INFO mistralrs_server: Serving on http://0.0.0.0:8888.

thread '<unnamed>' panicked at mistralrs-core/src/pipeline/mod.rs:462:35:
Did not get any inputs. This is shocking.
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::option::expect_failed
   3: core::iter::adapters::try_process
   4: mistralrs_core::pipeline::Pipeline::step::{{closure}}
   5: mistralrs_core::engine::Engine::run::{{closure}}
   6: mistralrs_core::MistralRs::new::{{closure}}::{{closure}}::{{closure}}
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2025-05-26T19:59:19.844485Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 10.80, Prefix cache hitrate 66.67%, 1 running, 0 waiting
^C
 ✘ 
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs   master  RUST_BACKTRACE=full ./target/release/mistralrs-server --port 8888 plain -m meta-llama/Llama-3.2-1B-Instruct
2025-05-26T19:59:27.974382Z  INFO mistralrs_server: avx: false, neon: true, simd128: false, f16c: false
2025-05-26T19:59:27.974433Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-05-26T19:59:27.974448Z  INFO mistralrs_server: Model kind is: normal (no adapters)
2025-05-26T19:59:27.974476Z  INFO hf_hub: Using token file found "/Users/matt/.cache/huggingface/token"
2025-05-26T19:59:27.974579Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:59:27.974625Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:59:28.085527Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2025-05-26T19:59:28.154177Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:59:28.448257Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-05-26T19:59:28.500695Z  INFO mistralrs_quant::utils::log: Automatic loader type determined to be `llama`
2025-05-26T19:59:28.500717Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-05-26T19:59:28.504446Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-26T19:59:28.512404Z  INFO mistralrs_core::pipeline::loaders: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-05-26T19:59:28.512460Z  INFO mistralrs_quant::utils::log: Model has 16 repeating layers.
2025-05-26T19:59:28.512468Z  INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-05-26T19:59:28.512473Z  INFO mistralrs_quant::utils::log: Layers 0-15: metal[4294968389] (36 GB)
2025-05-26T19:59:28.513859Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-05-26T19:59:28.513881Z  INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_act: Silu, hidden_size: 2048, intermediate_size: 8192, vocab_size: 128256, num_hidden_layers: 16, num_attention_heads: 32, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 32.0, low_freq_factor: 1.0, high_freq_factor: 4.0, original_max_position_embeddings: 8192, rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: true }
2025-05-26T19:59:28.514678Z  INFO mistralrs_core::pipeline::normal: Applying ISQ to None
2025-05-26T19:59:28.514787Z  INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-05-26T19:59:29.326964Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2025-05-26T19:59:29.336539Z  INFO mistralrs_server: Model loaded.
2025-05-26T19:59:29.336587Z  INFO mistralrs_core: Beginning dummy run.
2025-05-26T19:59:29.337210Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
2025-05-26T19:59:29.424327Z  INFO mistralrs_core: Dummy run completed in 0.087735458s.
2025-05-26T19:59:29.425554Z  INFO mistralrs_server: Serving on http://0.0.0.0:8888.

thread '<unnamed>' panicked at mistralrs-core/src/pipeline/mod.rs:462:35:
Did not get any inputs. This is shocking.
stack backtrace:
   0:        0x1045275cc - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h217270392019d164
   1:        0x1032a093c - core::fmt::write::he22fcab56bd3ec61
   2:        0x104525ae8 - std::io::Write::write_fmt::hb32eaafcfd249a19
   3:        0x104527484 - std::sys::backtrace::BacktraceLock::print::h115149c0b879e5c3
   4:        0x1045262ac - std::panicking::default_hook::ha0b223ccc4379930
   5:        0x1045258e0 - std::panicking::rust_panic_with_hook::h203f96c93e7ac62d
   6:        0x10455a32c - std::panicking::begin_panic_handler::{{closure}}::hcc8f653f753c0254
   7:        0x10455a29c - std::sys::backtrace::__rust_end_short_backtrace::h911de07218b69a6c
   8:        0x10455b280 - _rust_begin_unwind
   9:        0x1046ad2e0 - core::panicking::panic_fmt::h6a4014bec58fba4f
  10:        0x1046ad5e4 - core::option::expect_failed::h064f2cf84916882a
  11:        0x103c25cdc - core::iter::adapters::try_process::h15efef5839024646
  12:        0x103f83308 - mistralrs_core::pipeline::Pipeline::step::{{closure}}::h44952332cd29c76d
  13:        0x103ff60d0 - mistralrs_core::engine::Engine::run::{{closure}}::haa04be666b79b1a7
  14:        0x104006afc - mistralrs_core::MistralRs::new::{{closure}}::{{closure}}::{{closure}}::h79afb8c92fd88f28
  15:        0x103e9ea64 - std::sys::backtrace::__rust_begin_short_backtrace::h62234e08cf8beae7
  16:        0x104051610 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hdfdc2f031a50fd04
  17:        0x10455c6b4 - std::sys::pal::unix::thread::Thread::new::thread_start::h6d53b1b0c047a3b9
  18:        0x19783ec0c - __pthread_cond_wait
2025-05-26T19:59:34.342502Z  INFO mistralrs_core::engine::logger: Throughput (T/s) 10.80, Prefix cache hitrate 66.67%, 1 running, 0 waiting
^C
 ✘ 

But c116ce42 does not

mistralrs-server-no-panic.mp4
output
  matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs   master  git show c116ce42
 
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs   master  git checkout c116ce42
Note: switching to 'c116ce42'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at c116ce42 Don't use mmap on cuda (#1336)
 
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs  ➦ c116ce42  cd mistralrs-server
 
 matt@Matts-MacBook-Pro-2024  ~/Code/matthewhaynes/mistral.rs/mistralrs-server  ➦ c116ce42  cargo build --release --features metal
   Compiling num-complex v0.4.6
   Compiling mistralrs-quant v0.5.0 (/Users/matt/Code/matthewhaynes/mistral.rs/mistralrs-quant)
   Compiling mistralrs-paged-attn v0.5.0 (/Users/matt/Code/matthewhaynes/mistral.rs/mistralrs-paged-attn)
   Compiling mistralrs-core v0.5.0 (/Users/matt/Code/matthewhaynes/mistral.rs/mistralrs-core)
   Compiling pulp v0.18.22
   Compiling gemm-common v0.17.1
   Compiling gemm-f32 v0.17.1
   Compiling gemm-c32 v0.17.1
   Compiling gemm-c64 v0.17.1
   Compiling gemm-f64 v0.17.1
   Compiling gemm-f16 v0.17.1
   Compiling gemm v0.17.1
   Compiling candle-core v0.8.0 (https://github.com/EricLBuehler/candle.git?rev=cb2d8f5#cb2d8f59)
   Compiling candle-nn v0.8.0 (https://github.com/EricLBuehler/candle.git?rev=cb2d8f5#cb2d8f59)
   Compiling mistralrs-vision v0.5.0 (/Users/matt/Code/matthewhaynes/mistral.rs/mistralrs-vision)
   Compiling mistralrs-server v0.5.0 (/Users/matt/Code/matthewhaynes/mistral.rs/mistralrs-server)
    Finished `release` profile [optimized] target(s) in 1m 42s
Here is the example curl
curl -X 'POST' \
  'http://localhost:8888/v1/chat/completions' \
  -H 'accept: */*' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "meta-llama/Llama-3.2-1B-Instruct",
  "messages": [{
    "role": "user",
    "content": "hi!"
  }]
}'

Be happy to take a look if you have any thoughts?

Latest commit or version

2b56c10

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions