-
Notifications
You must be signed in to change notification settings - Fork 476
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
I have a simple program testing the time it takes for Qwen3 1.7B Q8 to generate a response:
use anyhow::Result;
use mistralrs::{
GgufModelBuilder, PagedAttentionMetaBuilder, TextMessageRole, TextMessages,
};
use std::time::Instant;
#[tokio::main]
async fn main() -> Result<()> {
println!("Loading Qwen3 model...");
let model_load_start = Instant::now();
let model = GgufModelBuilder::new(
"C:/path/to/model/",
vec!["Qwen3-1.7B-Q8_0.gguf"],
)
.with_logging()
.with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
.build()
.await?;
let model_load_time = model_load_start.elapsed();
println!("Model loaded successfully!");
let messages = TextMessages::new().add_message(
TextMessageRole::User,
"hello /no_think",
);
println!("Sending request to model...");
let response_start = Instant::now();
// Send the chat request
let response = model.send_chat_request(messages).await?;
let response_time = response_start.elapsed();
let total_time = model_load_start.elapsed();
println!("\n--- Model Response ---");
println!("{}", response.choices[0].message.content.as_ref().unwrap());
println!("\n--- Timing Statistics ---");
println!("Model loading time: {:.2?}", model_load_time);
println!("Response generation time: {:.2?}", response_time);
println!("Total time: {:.2?}", total_time);
println!("\n--- Performance Stats ---");
println!("Average prompt tokens per second: {:.2}", response.usage.avg_prompt_tok_per_sec);
println!("Average completion tokens per second: {:.2}", response.usage.avg_compl_tok_per_sec);
Ok(())
}Here are the timing and performance stats from running on CPU (95+% utilization; Ryzen 7 5800H):
--- Timing Statistics ---
Model loading time: 27.36s
Response generation time: 140.01s
Total time: 167.38s
--- Performance Stats ---
Average prompt tokens per second: 0.20
Average completion tokens per second: 0.21
To do the same generation on llama.cpp takes <5s total (model loading and generating). After reviewing #903 and seeing how similar mistral.rs and llama.cpp are in performance, I'm wondering if this speed discrepancy is due to me doing something wrong.
Is there anything I can do to speed up these model load and generation times? Thank you!
Latest commit or version
Running from the latest commit pulled directly from Github
johnsutor
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working