Unreasonably long model load / generation times

## Describe the bug
I have a simple program testing the time it takes for Qwen3 1.7B Q8 to generate a response:
```rust
use anyhow::Result;
use mistralrs::{
    GgufModelBuilder, PagedAttentionMetaBuilder, TextMessageRole, TextMessages,
};
use std::time::Instant;

#[tokio::main]
async fn main() -> Result<()> {
    println!("Loading Qwen3 model...");
    let model_load_start = Instant::now();
    
    let model = GgufModelBuilder::new(
        "C:/path/to/model/",
        vec!["Qwen3-1.7B-Q8_0.gguf"],
    )
    .with_logging()
    .with_paged_attn(|| PagedAttentionMetaBuilder::default().build())?
    .build()
    .await?;

    let model_load_time = model_load_start.elapsed();
    println!("Model loaded successfully!");

    let messages = TextMessages::new().add_message(
        TextMessageRole::User,
        "hello /no_think",
    );

    println!("Sending request to model...");
    let response_start = Instant::now();
    
    // Send the chat request
    let response = model.send_chat_request(messages).await?;
    
    let response_time = response_start.elapsed();
    let total_time = model_load_start.elapsed();

    println!("\n--- Model Response ---");
    println!("{}", response.choices[0].message.content.as_ref().unwrap());
    
    println!("\n--- Timing Statistics ---");
    println!("Model loading time: {:.2?}", model_load_time);
    println!("Response generation time: {:.2?}", response_time);
    println!("Total time: {:.2?}", total_time);
    
    println!("\n--- Performance Stats ---");
    println!("Average prompt tokens per second: {:.2}", response.usage.avg_prompt_tok_per_sec);
    println!("Average completion tokens per second: {:.2}", response.usage.avg_compl_tok_per_sec);

    Ok(())
}
```
Here are the timing and performance stats from running on CPU (95+% utilization; Ryzen 7 5800H):
```
--- Timing Statistics ---
Model loading time: 27.36s
Response generation time: 140.01s
Total time: 167.38s

--- Performance Stats ---
Average prompt tokens per second: 0.20
Average completion tokens per second: 0.21
```
To do the same generation on llama.cpp takes <5s total (model loading and generating). After reviewing #903 and seeing how similar mistral.rs and llama.cpp are in performance, I'm wondering if this speed discrepancy is due to me doing something wrong.

Is there anything I can do to speed up these model load and generation times? Thank you!


## Latest commit or version
Running from the latest commit pulled directly from Github



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unreasonably long model load / generation times #1505

Describe the bug

Latest commit or version

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Unreasonably long model load / generation times #1505

Description

Describe the bug

Latest commit or version

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions