Skip to content

Unreasonably long model load / generation times #1505

@LukeSutor

Description

@LukeSutor

Describe the bug

I have a simple program testing the time it takes for Qwen3 1.7B Q8 to generate a response:

use anyhow::Result;
use mistralrs::{
    GgufModelBuilder, PagedAttentionMetaBuilder, TextMessageRole, TextMessages,
};
use std::time::Instant;

#[tokio::main]
async fn main() -> Result<()> {
    println!("Loading Qwen3 model...");
    let model_load_start = Instant::now();
    
    let model = GgufModelBuilder::new(
        "C:/path/to/model/",
        vec!["Qwen3-1.7B-Q8_0.gguf"],
    )
    .with_logging()
    .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
    .build()
    .await?;

    let model_load_time = model_load_start.elapsed();
    println!("Model loaded successfully!");

    let messages = TextMessages::new().add_message(
        TextMessageRole::User,
        "hello /no_think",
    );

    println!("Sending request to model...");
    let response_start = Instant::now();
    
    // Send the chat request
    let response = model.send_chat_request(messages).await?;
    
    let response_time = response_start.elapsed();
    let total_time = model_load_start.elapsed();

    println!("\n--- Model Response ---");
    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    
    println!("\n--- Timing Statistics ---");
    println!("Model loading time: {:.2?}", model_load_time);
    println!("Response generation time: {:.2?}", response_time);
    println!("Total time: {:.2?}", total_time);
    
    println!("\n--- Performance Stats ---");
    println!("Average prompt tokens per second: {:.2}", response.usage.avg_prompt_tok_per_sec);
    println!("Average completion tokens per second: {:.2}", response.usage.avg_compl_tok_per_sec);

    Ok(())
}

Here are the timing and performance stats from running on CPU (95+% utilization; Ryzen 7 5800H):

--- Timing Statistics ---
Model loading time: 27.36s
Response generation time: 140.01s
Total time: 167.38s

--- Performance Stats ---
Average prompt tokens per second: 0.20
Average completion tokens per second: 0.21

To do the same generation on llama.cpp takes <5s total (model loading and generating). After reviewing #903 and seeing how similar mistral.rs and llama.cpp are in performance, I'm wondering if this speed discrepancy is due to me doing something wrong.

Is there anything I can do to speed up these model load and generation times? Thank you!

Latest commit or version

Running from the latest commit pulled directly from Github

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions