Prefix caching for PagedAttention #1369

EricLBuehler · 2025-05-24T23:25:21Z

Prefix cacher handles pagedattention seqs
Updated sequence logic
Updated inputs processing

Currently working!

Summary by CodeRabbit

New Features
- Added support for advanced block-based prefix caching, improving efficiency and cache hit rates for large sequences.
- Enhanced sequence handling to manage both logical and physical token blocks, enabling more flexible memory management.
Improvements
- Simplified and unified interfaces for processing paged attention metadata across all input processors, removing lifetime parameters for easier use.
- Improved thread safety and concurrency by introducing shared, mutex-protected access to block engines and related data structures.
- Refined block allocation logic to support prefilled physical blocks and more robust memory calculations.
Bug Fixes
- Corrected block size and memory configuration handling for more reliable cache setup and allocation.
Refactor
- Streamlined method and trait signatures related to block engines, schedulers, and input processors for consistency and maintainability.

coderabbitai · 2025-05-24T23:25:27Z

Warning

Rate limit exceeded

@EricLBuehler has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 20 minutes and 4 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 01cedfa and 3784876.

📒 Files selected for processing (1)

docs/PAGED_ATTENTION.md (1 hunks)

Walkthrough

This update refactors the PagedAttention and block engine infrastructure to remove explicit lifetime annotations, replacing mutable references with thread-safe, shared ownership via Arc<Mutex<_>>. It introduces new block-based caching mechanisms, updates method signatures for concurrency, and enhances block management. The changes propagate through input processors, sequence management, schedulers, and cache handling, unifying interfaces and improving modularity.

Changes

File(s)	Change Summary
`src/pipeline/inputs_processor.rs`, `src/pipeline/mod.rs`, `src/pipeline/speculative.rs`, `src/pipeline/speech.rs`, `src/vision_models/*/inputs_processor.rs`	Removed lifetime parameters from `PagedAttentionMeta` and related method signatures; updated handling to use owned/shared types.
`src/paged_attention/block_engine.rs`, `src/dummy_paged_attention/block_engine.rs`	Added `Debug`, `Clone`, and `Hash` to `LogicalTokenBlock`; added accessors; enhanced allocation logic to support prefilled blocks; fixed CPU allocator flag.
`src/paged_attention/block_engine_sequence.rs`, `src/dummy_paged_attention/block_engine_sequence.rs`	Changed trait method to return slice of logical blocks; added method for taking prefilled physical blocks.
`src/paged_attention/mod.rs`, `src/dummy_paged_attention/mod.rs`	Publicly re-exported `PhysicalTokenBlock`; implemented cache configuration logic.
`src/paged_attention/scheduler.rs`, `src/dummy_paged_attention/scheduler.rs`	Wrapped `BlockEngine` in `Arc<Mutex<_>>`; updated all usages to lock for thread safety; updated method signatures for concurrency.
`src/scheduler/mod.rs`, `src/scheduler/default_scheduler.rs`	Changed scheduler trait methods to return owned or shared types instead of references; updated for thread safety.
`src/engine/add_request.rs`, `src/engine/mod.rs`	Updated prefix cache handling to support both normal and paged (block-based) cache variants; improved scheduler/block engine initialization.
`src/sequence.rs`	Added support for physical token blocks in paged attention metadata; added prefill methods; refactored token-to-block logic; updated block engine sequence trait implementation.
`src/prefix_cacher.rs`	Introduced block-based prefix caching; added new cache structures and lookup logic; updated cache manager to support both token and block-based caches.
`src/diffusion_models/processor.rs`	Updated method signatures to remove lifetimes from paged attention metadata.
`src/vision_models/*/inputs_processor.rs`	Updated input processor method signatures to accept owned paged attention metadata.
`mistralrs-quant/src/metal_kernels/mod.rs`, `mistralrs-quant/src/utils/ops.rs`	Replaced manual ceiling division with `div_ceil`; minor optimizations and signature simplification.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Scheduler
    participant BlockEngine
    participant PrefixCacheManager
    participant Sequence

    Client->>Scheduler: Add request (with tokens)
    Scheduler->>PrefixCacheManager: search_for_matching_cache(tokens)
    alt Block-based cache enabled
        PrefixCacheManager->>BlockEngine: Get logical/physical blocks
        PrefixCacheManager-->>Scheduler: Return MatchingCache::Paged
    else
        PrefixCacheManager-->>Scheduler: Return MatchingCache::Normal
    end
    Scheduler->>BlockEngine: allocate(mut Sequence)
    BlockEngine->>Sequence: assign blocks (may use prefilled)
    Scheduler->>Sequence: prefill_v2_normal or prefill_v2_paged
    Sequence-->>Scheduler: Sequence ready

Poem

In fields of memory, blocks align,
With mutex guards, their fates entwine.
No lifetimes chase the rabbits here—
Just shared Arc paths, all crystal clear!
Caches leap, and tokens bound,
Prefilled blocks are hopping 'round.
🐇✨ Rusty dreams, concurrency found!

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

github-actions · 2025-05-24T23:26:28Z

Code Metrics Report

===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                3           62           53            0            9
 CSS                     1          428          366           12           50
 Dockerfile              1           41           22           10            9
 HTML                    1           58           46            4            8
 JavaScript              7         1221          915          169          137
 JSON                   12          107          106            0            1
 Makefile                1            6            5            0            1
 Python                 86         4046         3414          158          474
 Shell                   1           63           26           18           19
 Plain Text              3         3723            0         2413         1310
 TOML                   20          607          549           10           48
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       3            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               56         4932            0         3761         1171
 |- BASH                 9           99           96            0            3
 |- JSON                 1           12           12            0            0
 |- Python               7          121          109            0           12
 |- Rust                22          757          634            1          122
 |- TOML                 2           75           63            0           12
 (Total)                           5996          914         3762         1320
-------------------------------------------------------------------------------
 Rust                  386       130170       116052         2765        11353
 |- Markdown           178         2235           29         1996          210
 (Total)                         132405       116081         4761        11563
===============================================================================
 Total                 583       145485       121573         9322        14590
===============================================================================

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (9)

mistralrs-core/src/paged_attention/block_engine.rs (1)
276-276: Fix typo in variable name.
-            for _logcical_idx in 0..seq.logical_token_blocks().len() {
+            for _logical_idx in 0..seq.logical_token_blocks().len() {
mistralrs-core/src/dummy_paged_attention/block_engine.rs (1)
276-276: Fix typo in variable name.
-            for _logcical_idx in 0..seq.logical_token_blocks().len() {
+            for _logical_idx in 0..seq.logical_token_blocks().len() {
mistralrs-core/src/dummy_paged_attention/mod.rs (3)
83-91: Confusing macro name - considers renaming for clarity.

The macro ctxt_to_blocks! calculates total bytes needed for a context length, not the number of blocks. This naming is misleading and could cause confusion.

Consider renaming to ctxt_to_bytes! or converting these macros to properly documented functions for better clarity:
/// Calculate the number of blocks that fit in the given memory size (in bytes)
fn memory_to_blocks(
    memory_bytes: usize,
    dtype_size: usize,
    block_size: usize,
    config: &dyn ModelConfigLike,
) -> usize {
    memory_bytes
        / dtype_size
        / block_size
        / config.num_kv_heads()
        / (config.k_head_dim().max(config.v_head_dim()))
        / config.num_layers()
        / 2
}

/// Calculate the total memory (in bytes) needed for a given context length
fn context_to_memory(
    context_len: usize,
    dtype_size: usize,
    config: &dyn ModelConfigLike,
) -> usize {
    context_len
        * dtype_size
        * config.num_kv_heads()
        * (config.k_head_dim().max(config.v_head_dim()))
        * config.num_layers()
        * 2
}
131-135: Remove commented out code.

The commented code appears to be an incomplete implementation that should be removed to keep the codebase clean:
-    // // Cap at kv cache for max seq len
-    // let mem_for_toks =
-    //     ctxt_to_blocks!(config.max_seq_len(), dtype_size, block_size, config) / SIZE_IN_MB;
-    // let mem_gpu = min_mem_gpu.min(mem_for_toks);
     let mem_gpu = min_mem_gpu;
145-145: Split long log line for better readability.
-        info!("Using PagedAttention with block size {block_size} and {num_gpu_blocks} GPU blocks: available context length is {} tokens", num_gpu_blocks*block_size);
+        info!(
+            "Using PagedAttention with block size {block_size} and {num_gpu_blocks} GPU blocks: \
+             available context length is {} tokens",
+            num_gpu_blocks * block_size
+        );
mistralrs-core/src/sequence.rs (2)

517-552: New prefill methods properly implement normal and paged variants.

The implementation correctly distinguishes between normal and paged attention prefill scenarios. The builder pattern is well-suited for this use case.

Consider adding documentation comments to explain:

When each method should be used

The relationship between logical and physical blocks

The meaning of the offset parameter

638-661: Proper synchronization for block engine operations.

The method correctly handles the reallocation by:

Clearing existing logical blocks

Rebuilding them from the new tokens

Using a single mutex lock for both free and allocate operations

Consider combining the free_sequence and allocate calls into a single reallocate method on the block engine to ensure atomicity and reduce the critical section duration.

mistralrs-core/src/prefix_cacher.rs (2)

24-33: Consider hash collision handling.

While using DefaultHasher is reasonable for this use case, consider documenting or handling potential hash collisions, especially as the cache grows.

Consider using a cryptographic hash or adding collision detection/resolution logic if the cache is expected to grow significantly.

238-272: Optimize block cache lookup performance.

The current implementation iterates through all cached entries to find the best match. For large caches, this could become a performance bottleneck.

Consider using a more efficient data structure like a trie or prefix tree for faster prefix matching, especially as the cache size grows.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e7a30f and 01cedfa.

📒 Files selected for processing (34)

mistralrs-core/src/diffusion_models/processor.rs (1 hunks)
mistralrs-core/src/dummy_paged_attention/block_engine.rs (8 hunks)
mistralrs-core/src/dummy_paged_attention/block_engine_sequence.rs (1 hunks)
mistralrs-core/src/dummy_paged_attention/mod.rs (4 hunks)
mistralrs-core/src/dummy_paged_attention/scheduler.rs (9 hunks)
mistralrs-core/src/engine/add_request.rs (2 hunks)
mistralrs-core/src/engine/mod.rs (2 hunks)
mistralrs-core/src/paged_attention/block_engine.rs (8 hunks)
mistralrs-core/src/paged_attention/block_engine_sequence.rs (1 hunks)
mistralrs-core/src/paged_attention/mod.rs (1 hunks)
mistralrs-core/src/paged_attention/scheduler.rs (9 hunks)
mistralrs-core/src/pipeline/inputs_processor.rs (11 hunks)
mistralrs-core/src/pipeline/mod.rs (2 hunks)
mistralrs-core/src/pipeline/speculative.rs (1 hunks)
mistralrs-core/src/pipeline/speech.rs (1 hunks)
mistralrs-core/src/prefix_cacher.rs (5 hunks)
mistralrs-core/src/scheduler/default_scheduler.rs (2 hunks)
mistralrs-core/src/scheduler/mod.rs (1 hunks)
mistralrs-core/src/sequence.rs (10 hunks)
mistralrs-core/src/vision_models/gemma3/inputs_processor.rs (1 hunks)
mistralrs-core/src/vision_models/idefics2/idefics2_input_processor.rs (1 hunks)
mistralrs-core/src/vision_models/idefics3/inputs_processor.rs (1 hunks)
mistralrs-core/src/vision_models/llama4/inputs_processor.rs (1 hunks)
mistralrs-core/src/vision_models/llava/llava_inputs_processor.rs (1 hunks)
mistralrs-core/src/vision_models/llava/llava_next_inputs_processor.rs (1 hunks)
mistralrs-core/src/vision_models/minicpmo/inputs_processor.rs (1 hunks)
mistralrs-core/src/vision_models/mistral3/inputs_processor.rs (1 hunks)
mistralrs-core/src/vision_models/mllama/inputs_processor.rs (1 hunks)
mistralrs-core/src/vision_models/phi3/phi3_inputs_processor.rs (1 hunks)
mistralrs-core/src/vision_models/phi4/inputs_processor.rs (1 hunks)
mistralrs-core/src/vision_models/qwen2_5_vl/inputs_processor.rs (1 hunks)
mistralrs-core/src/vision_models/qwen2vl/inputs_processor.rs (1 hunks)
mistralrs-quant/src/metal_kernels/mod.rs (7 hunks)
mistralrs-quant/src/utils/ops.rs (4 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (5)

mistralrs-core/src/scheduler/mod.rs (3)

mistralrs-core/src/dummy_paged_attention/scheduler.rs (3)

block_tables (381-383)

block_size (384-386)

block_engine (390-392)

mistralrs-core/src/paged_attention/scheduler.rs (3)

block_tables (381-383)

block_size (384-386)

block_engine (390-392)

mistralrs-core/src/scheduler/default_scheduler.rs (3)

block_tables (330-332)

block_size (333-335)

block_engine (337-339)

mistralrs-core/src/dummy_paged_attention/block_engine_sequence.rs (1)

mistralrs-core/src/sequence.rs (4)

blocks_to_add_new_tok (354-365)

take_physical_blocks_prefill (382-391)

get_id (367-369)

logical_token_blocks (371-380)

mistralrs-core/src/paged_attention/scheduler.rs (3)

mistralrs-core/src/paged_attention/block_engine.rs (7)

new (19-25)

new (167-183)

new (191-207)

new (235-243)

can_allocate (249-260)

block_size (27-29)

block_size (245-247)

mistralrs-core/src/dummy_paged_attention/scheduler.rs (8)

new (52-65)

block_engine (390-392)

_allocate (343-345)

_free (347-349)

block_tables (381-383)

block_size (384-386)

free_finished_sequence_groups (242-256)

free_finished_sequence_groups (387-389)

mistralrs-core/src/scheduler/mod.rs (4)

block_engine (65-65)

block_tables (63-63)

block_size (64-64)

free_finished_sequence_groups (60-60)

mistralrs-core/src/paged_attention/block_engine.rs (3)

mistralrs-core/src/dummy_paged_attention/scheduler.rs (2)

block_size (384-386)

new (52-65)

mistralrs-core/src/dummy_paged_attention/block_engine.rs (14)

block_size (27-29)

block_size (245-247)

num_tokens (31-33)

hash (57-59)

hash (110-112)

increment_refcount (71-73)

fmt (79-90)

new (19-25)

new (167-183)

new (191-207)

new (235-243)

can_allocate (249-260)

allocate (146-150)

allocate (262-281)

mistralrs-core/src/paged_attention/scheduler.rs (2)

block_size (384-386)

new (52-65)

mistralrs-core/src/sequence.rs (5)

mistralrs-core/src/dummy_paged_attention/block_engine_sequence.rs (2)

logical_token_blocks (9-9)

take_physical_blocks_prefill (7-7)

mistralrs-core/src/paged_attention/block_engine_sequence.rs (2)

logical_token_blocks (9-9)

take_physical_blocks_prefill (7-7)

mistralrs-core/src/paged_attention/block_engine.rs (6)

block_size (27-29)

block_size (245-247)

new (19-25)

new (167-183)

new (191-207)

new (235-243)

mistralrs-core/src/scheduler/mod.rs (2)

block_size (64-64)

block_engine (65-65)

mistralrs-core/src/engine/mod.rs (1)

new (95-143)

⏰ Context from checks skipped due to timeout of 90000ms (6)

GitHub Check: Clippy
GitHub Check: Test Suite (windows-latest, stable)
GitHub Check: Docs
GitHub Check: Test Suite (macOS-latest, stable)
GitHub Check: Test Suite (ubuntu-latest, stable)
GitHub Check: Check (windows-latest, stable)

🔇 Additional comments (57)

mistralrs-core/src/paged_attention/mod.rs (1)

14-14: LGTM! Clean API extension for PhysicalTokenBlock.

The addition of PhysicalTokenBlock to the public exports is consistent with the PR's objective of implementing prefix caching for PagedAttention sequences.

mistralrs-quant/src/utils/ops.rs (4)

923-923: Excellent refactoring to use idiomatic ceiling division.

Replacing manual ceiling division (size_sorted_axis + tn - 1) / tn with size_sorted_axis.div_ceil(tn) improves readability and reduces the chance of arithmetic errors.

934-934: Good use of div_ceil for cleaner block count calculation.

The ceiling division refactoring continues the pattern of using more idiomatic Rust methods for mathematical operations.

1030-1030: Consistent application of div_ceil refactoring.

The same improvement is applied to the Sort operation's metal_fwd implementation, maintaining consistency across the codebase.

1041-1041: Final ceiling division improvement completes the refactoring.

All instances of manual ceiling division in the Metal kernel dispatch logic have been properly updated to use the more expressive div_ceil method.

mistralrs-core/src/engine/mod.rs (3)

109-111: Simplified no_prefix_cache computation improves readability.

The refactored boolean logic is cleaner and easier to follow than the previous implementation.

121-123: Good refactoring to create scheduler once and extract block_engine.

Creating the scheduler once and then extracting the block_engine is more efficient than potentially creating multiple instances. The pattern of immediately calling block_engine() after creation is safe since there's no concurrent access at this point.

128-128: Proper shared ownership pattern for scheduler and block_engine.

Using the cloned scheduler reference and passing the extracted block_engine to PrefixCacheManagerV2 correctly implements the new architecture where the prefix cache manager needs access to the block engine for prefix caching support.

Also applies to: 135-135

mistralrs-core/src/vision_models/idefics3/inputs_processor.rs (1)

114-114: LGTM: Signature update aligns with architectural refactoring.

The removal of the lifetime parameter from PagedAttentionMeta<'_> to PagedAttentionMeta is consistent with the broader refactoring to use thread-safe shared ownership via Arc<Mutex<BlockEngine>>. This change supports better concurrency handling without affecting the method's internal logic.

mistralrs-core/src/engine/add_request.rs (2)

3-3: LGTM: Import supports the new prefix caching functionality.

The addition of MatchingCache import is necessary for the new cache matching logic below.

538-563: Excellent implementation of dual cache type support.

The prefix caching logic correctly handles both Normal and Paged cache types:

Consistent handling: Both branches properly call seq.keep_num_images(images_to_keep) and log cache hits

Appropriate methods: Uses prefill_v2_normal for normal caches and prefill_v2_paged for paged caches

Fallback behavior: Returns the original sequence when no cache is found

Parameter passing: Correctly passes cache-specific data (normal cache vs. logical/physical blocks)

This implementation enables flexible caching strategies while maintaining backward compatibility.

mistralrs-core/src/vision_models/llava/llava_next_inputs_processor.rs (1)

96-96: LGTM: Consistent signature update across vision processors.

This change mirrors the same architectural update seen in other vision model input processors, removing the lifetime parameter from PagedAttentionMeta to support the new thread-safe shared ownership model. The consistency across all vision processors ensures uniform behavior.

mistralrs-core/src/vision_models/qwen2_5_vl/inputs_processor.rs (1)

126-126: LGTM: Completes the consistent signature refactoring.

This final signature update maintains consistency with the other vision model input processors, completing the transition from lifetime-bound PagedAttentionMeta<'_> to owned PagedAttentionMeta values. This supports the new Arc<Mutex<BlockEngine>> architecture for thread-safe block engine access.

mistralrs-core/src/diffusion_models/processor.rs (1)

72-72: LGTM: Lifetime parameter removal aligns with architectural refactor.

The removal of the lifetime parameter from PagedAttentionMeta is consistent with the broader refactor to use Arc<Mutex<BlockEngine>> for thread-safe shared ownership. Since this parameter is unused in diffusion models, the change is safe and improves API consistency.

mistralrs-core/src/pipeline/speech.rs (1)

122-122: LGTM: Consistent lifetime parameter removal.

The change removes the lifetime parameter from PagedAttentionMeta, aligning with the broader architectural refactor. Since speech models don't utilize paged attention (parameter is unused), this change is safe and maintains API consistency.

mistralrs-core/src/pipeline/mod.rs (2)

261-272: LGTM: Enum refactor supports new concurrency model.

The removal of the lifetime parameter from CacheBackendMetadata::PagedAttention variant aligns with the broader refactor to use Arc<Mutex<BlockEngine>> for thread-safe shared ownership. This change enables cleaner APIs and better concurrency support.

356-356: LGTM: Method signature consistent with lifetime removal.

The step method signature change to accept CacheBackendMetadata without lifetime parameters is consistent with the enum refactor and maintains API coherence across the pipeline.

mistralrs-core/src/scheduler/mod.rs (2)

63-63: LGTM: Ownership model improvement for BlockTables.

Changing the return type from a reference to an owned Option<BlockTables> simplifies the ownership model and eliminates lifetime complexity. The relevant code snippets show that implementations correctly return cloned BlockTables, maintaining consistency across the scheduler hierarchy.

65-65: LGTM: Thread-safe BlockEngine access enabled.

The change to return Option<Arc<Mutex<BlockEngine>>> instead of a mutable reference enables thread-safe shared ownership and concurrent access to the block engine. This architectural improvement supports the prefix caching functionality while maintaining safety guarantees.

Consideration: The Arc<Mutex<_>> pattern introduces some performance overhead compared to direct mutable references, but this trade-off enables safer concurrent access patterns required for the enhanced paged attention implementation.

mistralrs-core/src/vision_models/llama4/inputs_processor.rs (1)

140-140: LGTM: Consistent lifetime parameter removal.

The removal of the lifetime parameter from PagedAttentionMeta aligns with the broader refactor to use Arc<Mutex<BlockEngine>> for thread-safe shared ownership. This change is consistent across all vision model input processors and maintains the existing usage patterns.

mistralrs-core/src/paged_attention/block_engine_sequence.rs (3)

1-3: Good additions for prefix caching support.

The new imports support the enhanced trait methods that handle both logical and physical token blocks with proper reference counting.

7-7: Well-designed method for physical block prefill support.

The take_physical_blocks_prefill method properly supports the prefix caching mechanism by:

Using &mut self to ensure exclusive access when consuming prefilled blocks

Returning Option<Vec<Arc<PhysicalTokenBlock>>> for thread-safe shared ownership

Following the "take" naming convention for consuming operations

9-9: Improved method naming and return type.

The rename from get_logical_token_blocks to logical_token_blocks follows Rust naming conventions, and changing the return type from usize (count) to &[LogicalTokenBlock] (actual data) provides more useful access to the underlying logical blocks for caching operations.

mistralrs-core/src/vision_models/phi3/phi3_inputs_processor.rs (1)

83-83: LGTM: Consistent with broader refactor.

The removal of the lifetime parameter from PagedAttentionMeta matches the changes across other vision model input processors, maintaining consistency in the codebase refactor for prefix caching support.

mistralrs-core/src/vision_models/idefics2/idefics2_input_processor.rs (1)

145-145: LGTM: Completes consistent refactor across vision models.

This change matches the identical refactor applied to all other vision model input processors (llama4, phi3, etc.), ensuring consistency in removing lifetime parameters from PagedAttentionMeta to support the new thread-safe ownership model for prefix caching.

mistralrs-core/src/vision_models/minicpmo/inputs_processor.rs (1)

103-103: LGTM: Signature update aligns with architectural refactor.

The removal of the lifetime parameter from PagedAttentionMeta<'_> to PagedAttentionMeta is consistent with the broader refactor that transitions from borrowed references to owned, thread-safe shared ownership via Arc<Mutex<_>>. This change enables the new prefix caching functionality for PagedAttention sequences.

mistralrs-core/src/vision_models/gemma3/inputs_processor.rs (1)

91-91: LGTM: Consistent signature update for architectural refactor.

This change removes the lifetime parameter from PagedAttentionMeta, consistent with the transition to thread-safe shared ownership patterns throughout the codebase. The update enables the new prefix caching functionality for PagedAttention sequences.

mistralrs-core/src/vision_models/phi4/inputs_processor.rs (1)

83-83: LGTM: Signature update supports thread-safe design.

The removal of the lifetime parameter from PagedAttentionMeta<'_> to PagedAttentionMeta is part of the consistent architectural refactor across vision model input processors. This change supports the new thread-safe shared ownership model and enables prefix caching for PagedAttention sequences.

mistralrs-core/src/vision_models/mllama/inputs_processor.rs (1)

183-183: LGTM: Completes consistent architectural refactor across vision models.

This final signature update removes the lifetime parameter from PagedAttentionMeta, completing the consistent architectural refactor across all vision model input processors. The change enables the new thread-safe shared ownership design and prefix caching functionality for PagedAttention sequences.

mistralrs-core/src/pipeline/speculative.rs (1)

340-340: LGTM! Lifetime removal is consistent with the refactoring.

The removal of the lifetime parameter from CacheBackendMetadata aligns with the broader refactoring to use Arc<Mutex<_>> for thread-safe shared ownership instead of explicit lifetime annotations.

mistralrs-core/src/dummy_paged_attention/block_engine_sequence.rs (3)

1-3: LGTM! Necessary imports for the new trait methods.

The imports properly support the new Arc<PhysicalTokenBlock> return type and block types used in the trait.

7-7: Good addition for physical block management.

The take_physical_blocks_prefill method appropriately uses &mut self to transfer ownership of prefilled physical blocks, which aligns well with the prefix caching implementation.

9-9: Improved API design with direct slice access.

The rename from get_logical_token_blocks to logical_token_blocks follows Rust naming conventions, and returning &[LogicalTokenBlock] instead of usize provides direct access to the blocks rather than just a count, making the API more useful.

mistralrs-core/src/scheduler/default_scheduler.rs (2)

4-4: LGTM! Import supports the new return type.

The Arc import is required for the updated block_engine return type.

330-339: Correct implementation for scheduler without PagedAttention support.

The return type changes align with the Scheduler trait updates:

block_tables now returns owned Option<BlockTables> instead of a reference

block_engine returns Option<Arc<tokio::sync::Mutex<BlockEngine>>> for thread-safe access

The DefaultScheduler correctly returns None for both methods since it doesn't support PagedAttention.

mistralrs-core/src/vision_models/qwen2vl/inputs_processor.rs (1)

128-128: LGTM! Signature change aligns with architectural refactor.

This change removes the lifetime parameter from PagedAttentionMeta, transitioning from explicit lifetime management to the new Arc<Mutex<BlockEngine>> ownership model. The implementation logic remains unchanged, which is appropriate for this signature refactor that's part of the broader prefix caching for PagedAttention feature.

mistralrs-core/src/vision_models/mistral3/inputs_processor.rs (1)

95-95: LGTM! Consistent signature change across vision processors.

The removal of the lifetime parameter from PagedAttentionMeta is consistent with the architectural changes being made across all vision model input processors. The method implementation correctly preserves the existing logic while adopting the new ownership model.

mistralrs-core/src/vision_models/llava/llava_inputs_processor.rs (1)

89-89: LGTM! Completes the coordinated signature update.

This final signature change completes the systematic update across vision model input processors, removing the lifetime parameter from PagedAttentionMeta. The consistency across all three processors demonstrates excellent coordination in implementing the new mutex-based ownership model for the prefix caching feature.

mistralrs-quant/src/metal_kernels/mod.rs (7)

1219-1223: Good overflow protection!

The updated condition properly checks for potential overflow before multiplication, preventing integer overflow issues when tmp_grid_dims.width * stride_blocks could exceed u32::MAX.

1344-1351: LGTM! Simplified lifetime annotations.

The removal of the explicit lifetime parameter is correct since the method returns cloned Arc references rather than borrowed data, making the lifetime annotation unnecessary.

1477-1478: Good use of built-in method!

Using Rust's built-in div_ceil method is more idiomatic and reduces custom code.

1697-1700: Nice optimization!

Using push for single characters is more efficient than push_str.

1754-1756: Clean parameter passing!

Removing explicit references simplifies the code while maintaining the same functionality.

1999-2001: Idiomatic iterator usage!

Using skip with iterator methods is cleaner and more expressive than index-based loops.

2014-2014: Cleaner Arc handling!

The automatic deref coercion makes the explicit dereferencing unnecessary.

mistralrs-core/src/paged_attention/scheduler.rs (3)

47-47: Good architectural change for thread safety!

Wrapping BlockEngine in Arc<tokio::sync::Mutex<_>> enables safe concurrent access across async tasks. This is essential for the prefix caching functionality where multiple components need to access the block engine.

Also applies to: 58-62

110-111: Improved allocation pattern!

Taking a mutable reference to the sequence in _allocate avoids nested mutex locks and ensures proper sequence mutation during allocation.

Also applies to: 343-344

381-383: Correct trait implementation for the new concurrency model!

Returning cloned values is the right approach here:

BlockTables clone provides a consistent snapshot

Arc clone is cheap and enables shared ownership

This aligns well with the mutex-based concurrency model.

Also applies to: 390-392

mistralrs-core/src/dummy_paged_attention/scheduler.rs (1)

47-47: Consistent implementation with the main scheduler!

The dummy scheduler correctly mirrors all the concurrency changes from the main PagedAttentionScheduler, maintaining API compatibility and consistent behavior.

Also applies to: 58-62, 343-344, 381-383, 390-392

mistralrs-core/src/paged_attention/block_engine.rs (1)

199-199: Good catch on fixing the is_gpu flag!

This corrects an important bug where CPU allocator blocks were incorrectly marked as GPU blocks.

mistralrs-core/src/dummy_paged_attention/block_engine.rs (1)

199-199: Good catch on fixing the is_gpu flag!

This corrects an important bug where CPU allocator blocks were incorrectly marked as GPU blocks.
mistralrs-core/src/pipeline/inputs_processor.rs (4)

85-89: Good refactoring to remove lifetime parameters.

The change from lifetime-bound mutable references to Arc<tokio::sync::Mutex<BlockEngine>> improves the API ergonomics and enables thread-safe shared ownership of the block engine.

212-216: Verify the simplified sliding window calculation.

The logic for calculating start_idx has been significantly simplified. Previously it had complex conditionals, now it's just prompt_len.saturating_sub(sliding_window) or 0.

Please ensure this simplified logic correctly handles all sliding window scenarios, especially edge cases where the prompt length is less than the sliding window size.

641-641: Consistent API update.

The removal of lifetime parameters from PagedAttentionMeta is consistently applied throughout the trait implementation.

383-392:
❓ Verification inconclusive

Critical change in block position calculation needs verification.

The block position calculation has changed from start_pos to start_pos - seq.token_offset(). This accounts for sequence token offsets which appears to be related to the prefix caching implementation.

Please verify that this calculation correctly handles all cases, especially:

When token_offset is 0

When token_offset is greater than start_pos (though this should be prevented by the sequence logic)

The interaction with prefix caching where sequences may have pre-filled blocks

The enhanced error message is helpful for debugging.
#!/usr/bin/env bash
set -e

echo ">>> make_completion_chunk signature and context"
rg -n "fn make_completion_chunk" mistralrs-core/src/pipeline/inputs_processor.rs || true
echo
echo ">>> Lines 350-430 from inputs_processor.rs"
sed -n '350,430p' mistralrs-core/src/pipeline/inputs_processor.rs || true
echo
echo ">>> All occurrences of token_offset"
rg -n "token_offset" mistralrs-core/src || true
echo
echo ">>> Sequence struct definitions"
rg -n "struct Sequence" mistralrs-core/src || true
echo
echo ">>> Sequence impl blocks"
rg -n "impl Sequence" mistralrs-core/src || true
#!/usr/bin/env bash
set -e

echo ">>> Location and signature of make_completion_chunk"
rg -n -C2 "fn make_completion_chunk" mistralrs-core/src/pipeline/inputs_processor.rs

echo
echo ">>> Full body of make_completion_chunk (lines around 350–430)"
sed -n '350,430p' mistralrs-core/src/pipeline/inputs_processor.rs

echo
echo ">>> All references to token_offset in the pipeline module"
rg -n "token_offset" -C1 mistralrs-core/src/pipeline

echo
echo ">>> Definition of Sequence (struct and impl) in the pipeline module"
rg -n -C2 "struct Sequence" mistralrs-core/src/pipeline
rg -n -C2 "impl Sequence" mistralrs-core/src/pipeline
#!/usr/bin/env bash
set -e

echo ">>> Inspect Sequence.token_offset signature and implementation"
rg -n -C3 "fn token_offset" mistralrs-core/src/pipeline/inputs_processor.rs || true

echo
echo ">>> Locate start_pos declaration and type"
rg -n -C3 "start_pos" -C2 mistralrs-core/src/pipeline/inputs_processor.rs || true

echo
echo ">>> Verify token_offset never exceeds start_pos in sequence logic"
# List all uses of token_offset in the pipeline module
rg -n -C2 "token_offset" -C2 mistralrs-core/src/pipeline || true
Please review the output to ensure:

token_offset() returns a usize that is always ≤ start_pos.

No unexpected underflow occurs when calculating block_pos = start_pos - seq.token_offset().

Prefix caching scenarios maintain this invariant.
#!/usr/bin/env bash
set -e

echo ">>> Sequence struct definition and token_offset method"
rg -n -C3 "struct Sequence" mistralrs-core/src/pipeline
rg -n -C3 "fn token_offset" mistralrs-core/src/pipeline

echo
echo ">>> Usage of start_pos and block_pos in make_completion_chunk"
rg -n -C3 "start_pos" -C3 mistralrs-core/src/pipeline/inputs_processor.rs

echo
echo ">>> All occurrences of token_offset in make_completion_chunk"
rg -n -C3 "token_offset" -C3 mistralrs-core/src/pipeline/inputs_processor.rs
#!/usr/bin/env bash
set -e

echo ">>> Sequence.token_offset definition and documentation"
rg -n -C3 "fn token_offset" mistralrs-core/src/pipeline

echo
echo ">>> Context around start_pos in make_completion_chunk"
rg -n -C3 "start_pos" -C3 mistralrs-core/src/pipeline/inputs_processor.rs
Please verify the new block‐position calculation with token offsets

The change subtracts seq.token_offset() from start_pos when computing block_pos. This is critical for prefix-caching correctness but risks underflow if the offset exceeds start_pos. Confirm that:

When token_offset == 0, behavior is unchanged.

Under no valid sequence can token_offset > start_pos.

Prefix-caching logic always pre-fills or bounds the sequence such that the subtraction cannot underflow.

If any of these invariants might be violated, add an explicit check or clamp to avoid panic/underflow.
mistralrs-core/src/sequence.rs (2)

97-119: Well-structured utility function for token block management.

The util_append_token_to_blocks function properly encapsulates the logic for appending tokens to logical blocks. The implementation correctly:

Handles the case when no blocks exist

Appends to the last block if space is available

Creates a new block when the current one is full

371-391: Trait implementation properly supports new block management.

The changes to return a slice of logical blocks and the addition of take_physical_blocks_prefill align well with the prefix caching design. Using Option::take() ensures the physical blocks can only be consumed once.

coderabbitai · 2025-05-26T02:42:17Z

mistralrs-core/src/paged_attention/block_engine.rs

+impl Hash for LogicalTokenBlock {
+    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
+        self.tokens.hash(state);
+    }
+}


⚠️ Potential issue

Incomplete Hash implementation may cause issues.

The Hash implementation only considers the tokens field, ignoring block_size and num_tokens. This could lead to hash collisions for blocks with identical tokens but different sizes or token counts, potentially causing incorrect behavior in hash-based collections.

Consider including all fields that affect equality:

impl Hash for LogicalTokenBlock { fn hash<H: std::hash::Hasher>(&self, state: &mut H) { self.tokens.hash(state); + self.block_size.hash(state); + self.num_tokens.hash(state); } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

impl Hash for LogicalTokenBlock {

fn hash<H: std::hash::Hasher>(&self, state: &mut H) {

self.tokens.hash(state);

}

}

impl Hash for LogicalTokenBlock {

fn hash<H: std::hash::Hasher>(&self, state: &mut H) {

self.tokens.hash(state);

self.block_size.hash(state);

self.num_tokens.hash(state);

}

}

🤖 Prompt for AI Agents

In mistralrs-core/src/paged_attention/block_engine.rs around lines 56 to 60, the Hash implementation for LogicalTokenBlock only hashes the tokens field, ignoring block_size and num_tokens. To fix this, update the hash method to include block_size and num_tokens along with tokens to ensure the hash reflects all fields that affect equality and prevent potential hash collisions.

coderabbitai · 2025-05-26T02:42:18Z

mistralrs-core/src/paged_attention/block_engine.rs

+            let mut block_table = physical_blocks_prefill.clone();
+            for block in &mut block_table {
+                block.deref_mut().refcount = 1;
+            }
+            let n_extra_blocks = seq.logical_token_blocks().len() - block_table.len();
+            for _ in 0..n_extra_blocks {
+                block_table.push(self.gpu_allocator.allocate());
+            }
+            self.block_tables.insert(seq.get_id(), block_table.clone());


🛠️ Refactor suggestion

Remove redundant clone operation.

The block_table is cloned twice - once when creating it from physical_blocks_prefill (line 265) and again when inserting into block_tables (line 273). The second clone is unnecessary.

- let mut block_table = physical_blocks_prefill.clone(); + let mut block_table = physical_blocks_prefill; for block in &mut block_table { block.deref_mut().refcount = 1; } let n_extra_blocks = seq.logical_token_blocks().len() - block_table.len(); for _ in 0..n_extra_blocks { block_table.push(self.gpu_allocator.allocate()); } - self.block_tables.insert(seq.get_id(), block_table.clone()); + self.block_tables.insert(seq.get_id(), block_table);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let mut block_table = physical_blocks_prefill.clone();

for block in &mut block_table {

block.deref_mut().refcount = 1;

}

let n_extra_blocks = seq.logical_token_blocks().len() - block_table.len();

for _ in 0..n_extra_blocks {

block_table.push(self.gpu_allocator.allocate());

}

self.block_tables.insert(seq.get_id(), block_table.clone());

let mut block_table = physical_blocks_prefill;

for block in &mut block_table {

block.deref_mut().refcount = 1;

}

let n_extra_blocks = seq.logical_token_blocks().len() - block_table.len();

for _ in 0..n_extra_blocks {

block_table.push(self.gpu_allocator.allocate());

}

self.block_tables.insert(seq.get_id(), block_table);

🤖 Prompt for AI Agents

In mistralrs-core/src/paged_attention/block_engine.rs around lines 265 to 273, remove the redundant clone when inserting block_table into self.block_tables. Instead of cloning block_table again on insertion, insert the existing block_table directly to avoid unnecessary cloning and improve performance.

coderabbitai · 2025-05-26T02:42:18Z

mistralrs-core/src/dummy_paged_attention/block_engine.rs

+impl Hash for LogicalTokenBlock {
+    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
+        self.tokens.hash(state);
+    }
+}


⚠️ Potential issue

Incomplete Hash implementation may cause issues.

The Hash implementation only considers the tokens field, ignoring block_size and num_tokens. This could lead to hash collisions for blocks with identical tokens but different sizes or token counts.

Consider including all fields that affect equality:

impl Hash for LogicalTokenBlock { fn hash<H: std::hash::Hasher>(&self, state: &mut H) { self.tokens.hash(state); + self.block_size.hash(state); + self.num_tokens.hash(state); } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

impl Hash for LogicalTokenBlock {

fn hash<H: std::hash::Hasher>(&self, state: &mut H) {

self.tokens.hash(state);

}

}

impl Hash for LogicalTokenBlock {

fn hash<H: std::hash::Hasher>(&self, state: &mut H) {

self.tokens.hash(state);

self.block_size.hash(state);

self.num_tokens.hash(state);

}

}

🤖 Prompt for AI Agents

In mistralrs-core/src/dummy_paged_attention/block_engine.rs around lines 56 to 60, the Hash implementation for LogicalTokenBlock only hashes the tokens field, ignoring block_size and num_tokens. To fix this, update the hash method to include hashing of block_size and num_tokens along with tokens to ensure the hash reflects all fields that affect equality and avoid collisions.

coderabbitai · 2025-05-26T02:42:18Z

mistralrs-core/src/dummy_paged_attention/block_engine.rs

+            let mut block_table = physical_blocks_prefill.clone();
+            for block in &mut block_table {
+                block.deref_mut().refcount = 1;
+            }
+            let n_extra_blocks = seq.logical_token_blocks().len() - block_table.len();
+            for _ in 0..n_extra_blocks {
+                block_table.push(self.gpu_allocator.allocate());
+            }
+            self.block_tables.insert(seq.get_id(), block_table.clone());


🛠️ Refactor suggestion

Remove redundant clone operation.

The block_table is cloned twice - once from physical_blocks_prefill and again when inserting. The second clone is unnecessary.

- let mut block_table = physical_blocks_prefill.clone(); + let mut block_table = physical_blocks_prefill; for block in &mut block_table { block.deref_mut().refcount = 1; } let n_extra_blocks = seq.logical_token_blocks().len() - block_table.len(); for _ in 0..n_extra_blocks { block_table.push(self.gpu_allocator.allocate()); } - self.block_tables.insert(seq.get_id(), block_table.clone()); + self.block_tables.insert(seq.get_id(), block_table);

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In mistralrs-core/src/dummy_paged_attention/block_engine.rs around lines 265 to 273, remove the redundant clone when inserting block_table into self.block_tables. Instead of cloning block_table again on insertion, insert the original block_table directly since it is already a cloned copy of physical_blocks_prefill.

coderabbitai · 2025-05-26T02:42:18Z

mistralrs-core/src/prefix_cacher.rs

+            let mut n_blocks = match_len.div_ceil(block_size);
+            n_blocks = n_blocks.min(logical_blocks.len());
+
+            if n_blocks == 0 {
+                return Ok(None);
+            }
+
+            // Take the first n_blocks of both logical and physical blocks
+            let mut logical_prefix = logical_blocks[..n_blocks].to_vec();
+            let physical_prefix = physical_blocks[..n_blocks].to_vec();
+
+            // If the last reused block is full, reserve an extra empty block for new tokens
+            let new_toks = toks[match_len..].to_vec();
+            for _ in 0..new_toks.len().div_ceil(block_size) {
+                logical_prefix.push(LogicalTokenBlock::new(block_size));
+            }
+            if logical_prefix.last().is_some_and(|last| last.is_full()) {
+                logical_prefix.push(LogicalTokenBlock::new(block_size));
+            }


💡 Verification agent

❓ Verification inconclusive

Verify edge cases in block matching logic.

The block calculation logic handles several edge cases but needs verification:

Line 280: div_ceil correctly rounds up for partial blocks

Lines 293-298: The logic adds blocks for remaining tokens and ensures an empty block is available

🏁 Script executed:

#!/bin/bash # Search for tests that verify the block matching logic rg -A 10 "search_for_matching_cache.*test" --type rust rg -A 10 "test.*prefix.*cache.*block" --type rust

Length of output: 108

Add missing tests for block matching edge cases

I didn’t find any existing tests covering the search_for_matching_cache / prefix-cacher logic. Please add unit tests in mistralrs-core/src/prefix_cacher.rs (or in a dedicated test module) to cover at least these scenarios:

match_len exactly divides block_size (no partial block)

match_len leaves a partial block (tests div_ceil rounding up)

match_len is zero (early return Ok(None))

remaining tokens fill exactly one block (no extra empty block added)

remaining tokens overflow into a new block (extra block added)

last reused block is full (ensures an additional empty block)

This will verify both the n_blocks calculation and the “reserve extra block” logic.

🤖 Prompt for AI Agents

In mistralrs-core/src/prefix_cacher.rs around lines 280 to 298, add unit tests to cover edge cases in the block matching logic. Specifically, create tests for when match_len exactly divides block_size, when match_len results in a partial block requiring div_ceil rounding up, when match_len is zero triggering an early return of Ok(None), when remaining tokens fill exactly one block without adding an extra empty block, when remaining tokens overflow into a new block requiring an extra block, and when the last reused block is full ensuring an additional empty block is appended. These tests should verify both the calculation of n_blocks and the logic that reserves an extra block for new tokens.

coderabbitai · 2025-05-26T02:42:18Z

mistralrs-core/src/prefix_cacher.rs

+        if let Some(block_engine) = &self.block_engine {
+            let logical_token_blocks = seq.logical_token_blocks();
+            let block_engine = get_mut_arcmutex!(block_engine);
+            let block_table = &block_engine.block_tables[seq.id()];
+            for block in block_table {
+                block.deref_mut().increment_refcount();
+            }
+
+            let hashed_logical_blocks = hash_logical_blocks(logical_token_blocks);
+
+            self.block_caches.insert(
+                hashed_logical_blocks,
+                BlockCacheElement {
+                    logical_blocks: logical_token_blocks.to_vec(),
+                    physical_blocks: block_table.clone(),
+                    image_hashes: seq.image_hashes().map(|x| x.to_vec()),
+                },
+            );
+        } else {
+            let cache = seq.normal_cache().to_vec();
+
+            self.caches.insert(
+                seq.get_toks().to_vec().into(),
+                CacheElement {
+                    cache,
+                    image_hashes: seq.image_hashes().map(|x| x.to_vec()),
+                },
+            );
+        }
    }


⚠️ Potential issue

Ensure proper reference counting for cached blocks.

The method correctly increments reference counts when adding sequences to the cache. However, there's no corresponding decrement when cache entries are evicted.

The evict_caches and evict_all_caches methods only handle the token-based cache but don't clean up block_caches. This could lead to memory leaks as physical blocks' reference counts won't be decremented.

Add cleanup logic for block caches:

pub fn evict_all_caches(&mut self) -> Result<usize> { let len = self.caches.len(); self.caches.clear(); + // Decrement refcounts and clear block caches + if let Some(block_engine) = &self.block_engine { + let block_engine = get_mut_arcmutex!(block_engine); + for (_, cache_elem) in self.block_caches.drain(..) { + for block in cache_elem.physical_blocks { + block.deref_mut().decrement_refcount(); + } + } + } Ok(len) }

🤖 Prompt for AI Agents

In mistralrs-core/src/prefix_cacher.rs around lines 118 to 147, the code increments reference counts for physical blocks when adding to block_caches but does not decrement them when evicting cache entries, causing potential memory leaks. To fix this, update the evict_caches and evict_all_caches methods to also iterate over block_caches entries being removed and call decrement_refcount on each physical block before removing the cache entry. This ensures proper cleanup of reference counts for block caches.

* Fix handling of Metal fused attn head dims (EricLBuehler#1234) * Fix handling of metal attn head dims * Fix handling of gemma3 1b when images * Tweak default for paged attn builder * Support paged attn for vision model rust api (EricLBuehler#1235) * [Breaking] Support setting HF cache path (EricLBuehler#1237) * Add it internally * Add the apis * Support tool calling for DeepSeek models (EricLBuehler#1239) * Support tool calling for deepseek models * Format * Fix deepseek * Server image processing refactor and fixes (EricLBuehler#1244) * Fix strict gemma3 case * Accept multiple images in the content array * Fix multiple images in one array ct * Add it to the python api * Typos * Optimized CUDA RoPE kernels (EricLBuehler#1247) * Add the kernels * It works * Works * Buulds * Typo fix (add_speial_tokens to add_special_tokens) (EricLBuehler#1246) * Fix typo * Update mistralrs.pyi * Fixes for UQFF + distributed layers (EricLBuehler#1250) * Fixes for uqff + distributed layers * Typo * Automatic agentic search integration (`web_search_options`) (EricLBuehler#1243) * Add the tool * Actually search * Clippy * Sort of works * Remove some debuggers * tweak * Add some rules * Works great * Tweak 'system' prompt * Update mistralrs-core/src/search/mod.rs Co-authored-by: Copilot <[email protected]> * Typo * Add it to all the apis * Add bert model for similarity reranking * Typos * Early detection of tools * Alias max_tokens -> max_completion_tokens too * Customizable bert model * Flip the enabler around * Add docs * Update readme * Typo --------- Co-authored-by: Copilot <[email protected]> * Format kernels (EricLBuehler#1251) * Update readme * Update readme * Remove test * Add quantize guards for uqff deserialize (EricLBuehler#1252) * Refactor cuBLASlt-related code (EricLBuehler#1253) * Centralize cublaslt into mistralrs-quant * Use cublaslt in unquant layer * Use beautiful trait constants for simpler code * Move tests * Dispatch to unquant for cublaslt * Dispatch to unquant for cublaslt * Fix feature * Add convert_to_gptq script * Update deps, bump pyo3 version (EricLBuehler#1259) * Faster cuda FP8 performance (EricLBuehler#1257) * Avoid fp8 sync * Fix dtype * Rust 1.86 clippy (EricLBuehler#1260) * Rust 1.86 clippy * Clippy * Refactor engine arch (EricLBuehler#1262) * Refactor engine add_request * Don't recompile regex * Clippy * Revamped LoRA support - removing the Ordering system! (EricLBuehler#1263) * Play with varbuilder lifetimes * Merge lora weights * Clippy * Lora works * Support multiple loras * Cleanup, remove adapter activation * Complete merge * Fast Metal-specific quantization method: AFQ (EricLBuehler#1264) * Add mlx quantized kernels * Add mlx quantized kernels * Kernel launcher * Add AFQ isq quant and dequant * Some quantmethod things * Begin to implement the qmm caller * Clippy * Much faster * Cache kernels * Docs * Clippy * Add it to uqff * Support prequantized models from MLX (EricLBuehler#1265) * Refactor quantizedconfig * Support AFQ prequantized * Update docs * Update docs * Automatic ISQ to select fastest & most accurate method (EricLBuehler#1266) * Automatic isq * typo * Doc * Improved usage metrics (EricLBuehler#1267) * Fix cuda * Bump tokio from 1.44.1 to 1.44.2 (EricLBuehler#1270) Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.44.1 to 1.44.2. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](tokio-rs/tokio@tokio-1.44.1...tokio-1.44.2) --- updated-dependencies: - dependency-name: tokio dependency-version: 1.44.2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Gather MM ops in mistralrs-quant (EricLBuehler#1272) * Update the caller * Wire things up * Broadcase for afq gathermm * Broadcase for afq gathermm * Clippy * Improve performance of deepseek models * Typo fix * BincountOp not used * Implement Llama 4! (EricLBuehler#1268) * Implement Llama 4 * Implement the main changes for the text model * Make chunked mask * Wire things up * Add some EP * Initial sketch of inputs processor * Runs * Progress * all reduce moes * It works! * Some cleanup * Faster moe block * Add device map * Make chunked matrix * Fully working now! * Reactivate cublaslt * Fix shared mlp cublaslt * Refactor to packed experts * Complete merge * It is a normal model now * Fixes * Set device for moe * ISQ fixes * Much faster sort kernel * Faster loading! * Faster loading! * Fp8 cpu copy ops in candle backend * Add the vision model * Add mmproj layer * Actually merge the inputs * Sketch most of the image processor * Add the rest of the image processor * Implement the whole processor * Add the loader * Some fixes * A batch of fixes * Some fixes * tmp * Actually support isq * Ok it works a bit * Fix norm device * It works * A bit cleaner * Support residul tensors * Remove text loader * Implement the device mapping system * Fix auto device map * Add examples * Add model card * Typo * Remove superflous logging * Fixes for Llama 4 UQFF loading (EricLBuehler#1275) * Support sharding for UQFF (EricLBuehler#1276) * Serialize sharded uqff files * Loading * Fix base64 * Fix bug for group-topk (group_limited_greedy) in deepseek models (EricLBuehler#1278) * Support the DeepCoder model (EricLBuehler#1279) * Add faq for metal not found * Improved PagedAttn scheduling accuracy (EricLBuehler#1282) * Scheduler ops by reference * Ensure scheduler gets correct prompts * Fix cuda build for copy_blocks * Fixes for scheduling image seqs with pagedattn (EricLBuehler#1283) * update to llguidance 0.7.16 (EricLBuehler#1284) * update llguidance to 0.7.16 from crates.io; use ParserFactory * add lark_llg.py example * use new llguidance::Matcher APIs * rework spec-decoding with llg * more work on spec sampling * check for parser stop * fix clippy * remove unneeded rollback * update build_llg_factory to return Result * Update dependencies (EricLBuehler#1286) * Much faster image inputs processing (EricLBuehler#1289) * Add more SDPA head dims for much faster SigLIP (EricLBuehler#1290) * More sdpa head dims, faster vision models * Move nonzero to above for faster metal synch * Doc * Update valid head dims * Show throughput in interactive mode (EricLBuehler#1291) * Update interactive mode throughput stats * Accurate prompt t/s * Accurate prompt t/s for usage * Unify bitwise operations (EricLBuehler#1288) * Unify bitwise ops * Tests pass * Fix cuda build * Clippy * Multimodal prefix caching support! (EricLBuehler#1298) * Initial progress * Support vision prefix caching * Update docs * Add multimodal data abstraction * Interactive mode improvements (EricLBuehler#1299) * More ergonomic image url parsing * Add option to clear * Add the Qwen 3 and Qwen 3 MoE models! (EricLBuehler#1285) * Add qwen3 model * Add enable_thinking * Add initial qwen3 moe * Add the moe model * Format * Fix order of norm * Fix expert shapes * Fix reverse * Fix norm device for isq * Fix nonzero when no nonzero * Moe model runs * Working qwen3 moe * Add metal fp8 blockwise dequant * Clean * Typo * Enable tool calling * Streamlined ux * Add some examples * Add docs * Fix dead link * Remove interactive mode max_len * Update QWEN3.md * Hotfix for vision mode clear * Revamped and streaming web search support (EricLBuehler#1301) * Streaming web search * Refactor a bit * More refactoring * Add some logging, parallelize some things * Allow url * Suppress warning, allow multi-turn searching * Batch compute_similarities * Cap content len * Typos * Doc * Handle vision messages or different tool call prefixes (EricLBuehler#1302) * Fix cuda * Tune web search budget * Simplify prefix cacher (EricLBuehler#1305) * Use rustyline to handle non-ascii in interactive mode (EricLBuehler#1306) The io::stdin().read_line() cannot handle non-ascii input, which caused crash when use backspace to delete non-ascii characters. Introduce rustyline to the interactive mode to solve the problem. Plus it can bring more editing features in the future. Close EricLBuehler#1140 * Add more tools for automatic search (EricLBuehler#1307) * Add interactive mode history * Add a website extraction tool * Pass toks by reference * Optimize prompt chunking * Fix CPU hogging in interactive mode (EricLBuehler#1309) The log enabler should be checked after the sleep instead of a busy loop checking. Since the interactive mode always disables the token speed logger, 100% CPU was taken by this loop always. * Add Metal precompilation support (EricLBuehler#1311) * Add metal precompilation for paged attn * Add for mistralrs-quant * Better constructor * Dont always build * Fix name for paged attn rebuild * Reduce thrashing of Metal autorelease (EricLBuehler#1313) * Reduce calls to autorelease * Optimize clone_in_cache * Refactor float8 * make `AdapterPaths` and `LoraAdapterPaths` public (EricLBuehler#1314) Make `AdapterPaths` and `LoraAdapterPaths` public so `LocalModelPaths` can be constructed outside of `mistralrs-core`. * Refactor KV cache manager (EricLBuehler#1315) * Refactor kv cache * Refactor caches * Fix some overflows * Add `Audio` and `Speech` model categories (EricLBuehler#1317) * add `Audio` to `ModelCategory` * add `Speech` to `ModelCategory` * fix to go back to PartialEq having an exhaustiveness check * Remove has_conv2d from vision model API (EricLBuehler#1318) * Unified/automatic flash attention enabler (EricLBuehler#1319) * Remove from sdpa params * Fix errors * No warnings * Log * Clippy * Fix cublaslt 4d mask (EricLBuehler#1320) * Fix cublaslt 4d mask * Clippy * Keep caches on gpu * Qwen VL models fixes (EricLBuehler#1322) * Add some defaults * Fix * Fix one thing * 2.5 vl works * Use caching again * Fix v2 * Move index inside loop * Offset in ropeidx * Default support for vision prefix caching is false * Fixes for all vision models (EricLBuehler#1323) * Fix phi input processor? * Fix phi input processor * Handle no_prefix_cache from pipeline * Phi models confirmed 👍 * Fixed for phi inputs processors * Fixed for phi4 * Llama 3 confirmed 😀 * Mistral 3 confirmed 😃 * Idefics 2/3 fixes * Some fixes * Remove unsafety * Improved+faster LRU prefix cacher (EricLBuehler#1321) * Show TTFT * Use LRU prefix cacher * Faster prefix cacher * Inplace ISQ support and default to mmap (EricLBuehler#1277) * Initial impl of immediate isq * Immediate isq -> !loading_isq * Varbuiler utils always using mmap! * Log * Add for packed experts * Afq without copy * Clarify * Clippy * Apple immediate isq * Better logic for loading_isq * Support showing ttft * Rename * Shared quantize guard * Parallel progress bar * Parallel loading for progress bars * Actual ISQ support * Conditional parallelism for NiceProgressBar * Use conditional iterator * Warn once * Predicate for applying immediate isq * Allow parallel * Remove debug print * Remove debug print * Remove debug print * Fix typos (EricLBuehler#1329) * Fix Idefics 3 arch chat templating (EricLBuehler#1330) * Update inputs merger * Fix * Better warning * Better warning * Better warning * Nonzero ahead of time * No f32 * Clippy * Optimize get_logprobs * Fix packed experts * Update masking * Use Sdpa in idefics3 * QuantMethod in idefics3 vision * Remove a .contiguous * Remove two space from PR comment (EricLBuehler#1331) * Add automatic vision loader type (EricLBuehler#1332) * Add automatic vision loader * Remove references to --arch * Update examples * Add the Dia 1.6b TTS model! (EricLBuehler#1304) * Add loading * Add rope, mlp, most of attn * Add encoder + encoder layer, decoder layer forwards * Add decoder forwards * Add prepare_audio_prompt * prepare_generation mostly done * Add a proper dia kvcache * Add most of decoder_step * Add the sampler * Add the generation loop * Wire things up * Add speech pipeline * Fixes * Loads * Some fixes * f32 * Some progress * Ok it runs upto dac decoding * Add dac part loading * Loads and runs at least * Remove encodec * Debugging * Debugging * Huh * Complete merge * Interactive * Confirmed dac works at least * Looks like encoder works * Much progress * Hmm * Sampling * Almost there * Sampler * Sampler * Bf16 support * Response * Use it in interactive mode * Fix oneshot * Add openai api * Add openai api * Refactor loading * Use naive sdpa for inplace * Factor out * Clippy * Clippy * Config * Refactor config * Metal clippy * Fix t/s * ISQ support * Some fixes, nits * Fix cuda * Clippy * Inhibit cublaslt for cuda * Add server example * Add python example * Add rust api * Add docs * Update config.toml * Fix .pyi * Update readme * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * update `llguidance` to `0.7.20` (EricLBuehler#1334) Update `llguidance` from `0.7.16` to `0.7.20` so that it has guidance-ai/llguidance#172 which is a fix for building on GCC 15. * Add model category <> messages check (EricLBuehler#1335) * Verify model category matches the messages * Add vision chat * Fixes * Add element-wise normalization check (EricLBuehler#1340) * Fix streaming example print statement (EricLBuehler#1339) * Fix normalization formula in comment (EricLBuehler#1338) * Fix image_to_pixels to handle non-RGB images (EricLBuehler#1337) * Fix typo in expect messages (EricLBuehler#1342) * Don't use mmap on cuda (EricLBuehler#1336) * No mmap on cuda * Simplify streaming tool call logic * Remove debug * Support AWQ format models (EricLBuehler#1350) * Support AWQ format models * Clippy fix * Fix uqff dummy layer ISQ application (EricLBuehler#1351) * Disable immediate isq if write_uqff (EricLBuehler#1352) * Fixes for UQFF loading on CUDA, ISQ pack factor (EricLBuehler#1354) * Fix logic for uqff on cuda * Updated pack_factor * Refactor Option references for model paths (EricLBuehler#1347) * refactor: use Option refs in model path helpers * Format * Add a script for server benchmarking (EricLBuehler#1355) * Serde alias * Fix * Update for tie_word_embeddings * Print running/waiting * 30 users * Update num_users * Update dummy paged attn * Optimized Metal qmv_fast path (EricLBuehler#1356) * Compile with lto * Tweak profiles * New, fast sampler for Metal! (EricLBuehler#1327) * Show TTFT * Use LRU prefix cacher * Faster prefix cacher * A bit of gpu sampling * Minp but cpu for now * Metal fast cumsum impl * Sampling with fast topp kernel * Hmm not perfect * Add metal sort kernels * Tmp * Add single block sort * Add most of multi block sort, just need copy op * Add copy kernels * Expose kernels * Add a test * Ok it works * Structure things * Add caching * Rename * Cpu is default * CUDA case * Topk * Refactor Option references for model paths (EricLBuehler#1347) * refactor: use Option refs in model path helpers * Format * Add a script for server benchmarking (EricLBuehler#1355) * Serde alias * Fix * Update for tie_word_embeddings * Print running/waiting * 30 users * Update num_users * Update dummy paged attn * Optimized Metal qmv_fast path (EricLBuehler#1356) * Compile with lto * Tweak profiles * Fix topk * Penalties * Add logits processor, clippy fixes * Fix chat port * Remove warning * Fix chat port * Fix metal parallel sampling (EricLBuehler#1357) * Cpu if parallel for now * Tweak bench script * Add immediate isq predicates for qwen3 (EricLBuehler#1358) * Add immediate isq predicates for qwen3 * Fix parsing of "parse_isq_value" depedent of device * Typo * Fix gemma3 logging * Regressions fixes (EricLBuehler#1359) * Fix regression for mmap * Revert EricLBuehler#1321 * Refactored matching_cache impl * Clippy * Revamped and smaller readme (EricLBuehler#1360) * Expandable detail sections * Refactor using derivative model * Tweak quick examples * Update llama * Update llama * Supported accelerators is a table * Update installation guides * Tweak apis * Remove --port in quick examples * Add demo gif * Add gif in readme * Update demo gif * Update demo gif * Update demo gif * Add gif in readme * Add gif in readme * Add a web chat app! (EricLBuehler#1362) * Initial * Markdown * Copy code * Add model loading sidebar * Support vision models * Tweak isq * Links go to another page * Clear when switch model * Fix html tags * Add image support! * More then one images * Fix * Improved textarea * Tab for switching between vision and text * No paged attn for now * Prettier format * Multiple models at once * Better switching, clearing ability * Mobile support * Inline markdown parser * Update examples * Typos * Support specifying isq * Fix mobile * Fixes * Fix button on mobile * Image height is capped * Thumbnail * Fix rotating kv cache edge case * Add drag and drop for images * Small things * Sidebar is frozen now * Better listner * Add readme * Tweak readme * Add chat history support to web chat app (EricLBuehler#1363) * Add chat history * Support renaming * Start immediately with new chat * Add timestamp * Prettier chat list * Style * Delete chat * Fix copy button * Fix markdown rendering * Store things in cache * Store things in cache * Refactor web chat, fix multichat image restore (EricLBuehler#1364) * Fix multichat image restoration. * Clippy * Refactor * Refactor frontent * Fix repeated immediate isq init (EricLBuehler#1365) * Add images_ref * Add debug impl * Fix the bug * Tweak style of buttons * Add a spinner * Move spinner * Tweak emoji * Add gif * Tweak initial gif * Include vision tower tensors in Mistral3 UQFF (EricLBuehler#1366) * Fix mistral 3 uqff resitdual tensors for vision * Rolling shard creation for uqff files (EricLBuehler#1367) * Fix occasional unstability during isq of afq (EricLBuehler#1368) * Fix unstability during isq of afq * Clippy * Fix web chat installation * Support web chat file uploading (EricLBuehler#1370) * Web chat fixes * Fix thumbnail in message, reuse blank chat * Add file uploading support * Fix scroll * Allowed extensions * Preserve files as literals * Support multiple clients * Add a stop button * New cache dir * New cache dir * Fix * Refactor * Update readme * Tweak drag-and-drop css * Add speech generation support to the web chat! (EricLBuehler#1373) * Initial speech gen support for web chat * Tweak ui * Update docs * Prefix caching for PagedAttention! (EricLBuehler#1369) * Exposing some things for logical token blocks * Prefix cache manager has the scheduler * Refactor * Get logical and physical blocks into the prefix cacher * Hash and cache * Pass physical block prefill * Allocation of prefilled block tables * Temp * Dont always use 2 * Hmm * Hmm * It mostly works * Increment refcount * Support images! * Add to dummy paged attn * Fix some clippy * Clippy * More checks * Include EricLBuehler#1371, closes EricLBuehler#1371 * Typos * Update docs * Metal PagedAttention accuracy improvements (EricLBuehler#1374) * Fix subtle bug * Fix half sum bug * Format metal paged attention * Handle images in paged attn scheduler (EricLBuehler#1375) * Include schemas needed for chatcompletions endpoint (EricLBuehler#1353) * EricLBuehler#1326: WIP include schemas needed for chat completions endpoint Conflicts: Cargo.lock mistralrs-server/src/main.rs * EricLBuehler#1326: WIP define utoipa as a workspace dep since core and server both need it * EricLBuehler#1326: first draft of handling schemas that use Either * EricLBuehler#1326: first draft of handling schema for Grammar * EricLBuehler#1326: Add in other endpoints to API docs. * EricLBuehler#1326: Adjust code comments * EricLBuehler#1326: Implement coderabbitai suggestions - EricLBuehler#1353 (review) - EricLBuehler#1353 (comment) * Fix constraints with metal sampler * Revert EricLBuehler#1375 * Fix case where prefix cacher returns no toks (EricLBuehler#1377) * Fix AFQ UQFF serialization * Faster UQFF serialization (EricLBuehler#1379) * Faster UQFF serialization * Fix uqff gemma3 * Improve gemma3 auto loader names * UQFF creation for AFQ on CPU support (EricLBuehler#1380) * Add afq cpu quantize/dequantize * Clippy * Improved device for afq quantize * Improved dtype handling for cpu afq (de)quantize * Improved generate_uqff_card * Add fused CPU attention kernel! (EricLBuehler#1382) * Working * Fix warnings * Allow mask * Support bf16, f16 * Handle striding * Parallelized * Add initial vector flash attn * Avoid repeated allocations * Tiled kv * Apply some clippy * Some small fixes * Chunked vec_dot * Clipy * Use T::zero * Refactor attention backends (EricLBuehler#1384) * Refactor attention code * Refactor attention code * Move into backends * Set macOS thread affinity for CPU attn (EricLBuehler#1385) * Use lazylock * Format * Fix metal warn build * Faster Qwen 3 MoE support on Metal (EricLBuehler#1387) * Fix load * Use afq gather qmm * Well it runs * It works * Polish * Fast and slow options * Remove quantized.rs * Polish some more * Refactor * Add isq * Update load in parallel * Support fp8 * Refactor for FusedExperts * Clippy * Handle pack factor when loading prequantized models * Use f32 only in moe * Avoid using f32 so much * Avoid using f32 so much * Fix PagedAttention block leaks (EricLBuehler#1388) * Warn and ignore if ignored * Fix a block allocation leak * Update bench.py * Fix double free in block engine * Do not apply ISQ if loading a prequantized model * Fix cuda build again (EricLBuehler#1389) * Fix cuda build * Fix * Format * Fixes for cuda docker * Update dockerfiles * Bump version to 0.6.0 (EricLBuehler#1390) * Bump version to 0.6.0 * Remove lower_level api * Make a static dir * Update deps * Fix routing for static handler in web chat * Fewer .contiguous calls for qwen3 moe (EricLBuehler#1391) * Allow speech models to accept batched inputs (EricLBuehler#1393) * Allow speech models to accept batched inputs * Clippy * Ring distributed backend for heterogeneous TP (EricLBuehler#1238) * Begin work on ring distributed backend for Metal * Add the actual ring functionality * It loads and kind of runs * It works * Optimize buffer allocation * Avoid copy * It works * Add allgather * Fix load * Ping-pong * Small things * Add config json * Allow different ip address * Read config once * Read config when appropriate * Replicate requests * Small fix * Fix small compat with openai * Clippy * Update docs * Add deepseek tool calling chat template * Add auto loader for vision/text detection! (EricLBuehler#1402) * Add auto loader for vision/text detection * Build fixes * Add model loader * Update docs * Format * Create Mistral.rs Server Core Lib: `mistralrs-server-core` (EricLBuehler#1346) * First draft of exposing mistral server routes as lib * make arg struct fields pub * Take base path so utoipa swagger route can properly redirect * Expose swagger routes and make it configurable * Add base path option for swagger docs * More work on modularizing mistralrs server * Sync fork (+1 squashed commit) Squashed commits: [169ae9e] Sync fork * Adjust fn params to use refs / individual params instead of args * Start breaking down controller actions into smaller pieces * Continue refactoring * Make mods pub so they can be used outside crate * Allow chat completion streamer to take a callback so that you can get the complete response when finished WIP (+3 squashed commits) Squashed commits: [0061d87] WIP [c484d56] WIP [16f8a60] WIP * Sync fork * Adjust callback type * Remove throughput_log arg that was removed in 26afcc3 * Implement defaults for Args (and use for Clap) * Small code formatting tweaks * Rename callback to match SSE event and code clean up * Sync fork * WIP: first very rough draft of server core builder. Doesn't meet parity with old functional approach yet (slower / unstable?). * Clean up (+4 squashed commits) Squashed commits: [e1cff387] Sync fork [d8301025] WIP debugging [1ea9f8c8] Sync fork [4fe28cf5] WIP: debug function * WIP server core builders * Code clean up * Add on_chunk callback * Code clean up * First draft of creating version of mistral-server that uses server-core Code clean up (+1 squashed commit) Squashed commits: [adea1693] * Sync fork * Add helper methods to builder to make optional args more ergonomic (since .build validates params) * Start adding docs * Start cleaning up crates deps * Example commit of mistral-server with implementing server-core * Start addressing CodeRabbit feedback * Fix comment typo * Tweak doc blocks * - Update type alias naming for clarity (MistralRs instead of Mistral) - CodeRabbit, don't use eprintln for lib (use trace) - Allow buffer size to be passed in and default to Constant - Allow router body limit to be passed in and default to Constant - Update doc examples * Typo * Address CoderRabbitAI feedback * Support linear rope for llama3 (EricLBuehler#1408) * Hotfix for loading * Fix vllama4 uqff loading (EricLBuehler#1409) * Fix vllama4 uqff loading * Fix regex * Fix regex * Maybe a fix * Gracefully handle receiver disconnects (EricLBuehler#1410) * Handle receiver disconnects * Format * Fix Qwen3 MoE device mapping irregularities (EricLBuehler#1411) * Fix bias * Fix lm_head packing case * Account for gate * Fix head dim * Fix interactive mode URL parsing (EricLBuehler#1412) * fix url regex in vision interactive mode * Fix regex * Clippy * Refactor auto device map (EricLBuehler#1413) * Refactor auto device map * Refactor a bit more * Clippy * Enable runtime sampling tweaks in interactive mode (EricLBuehler#1414) * Document runtime sampling commands * Fix readme * Tweak * Bounds checking * Tweak temp bounds * Send streaming tokens every time * Gumbel sampling for fast sampler (EricLBuehler#1416) * Improved handling for initialize_logging * Improved CPU flash attention accuracy & performance (EricLBuehler#1417) * Downcast correctly * Operate internally in f32 * Avoid some casts and striding * Prefetch * Provide chat_templates to container users (EricLBuehler#1419) Models often come without chat templates requiring mapping them from the source repository into a container for access by the mistralrs-server. Copy the templates from the build tree into the root of the image to permit use via `--chat-template /chat_templates/something.json` TODO: With the increase in quantized models and support for other formats, the initial benchmark run during model load can be used to qualify/select existing chat templates embedded into the binary for models which do not come with any (to include output of the functional failures in each test allowing users to modify the ones already provided correctly to suit the model being loaded). Co-authored-by: RageLtMan <rageltman [at] sempervictus> * Faster cpu flash attn (EricLBuehler#1418) * Faster cpu flash attn * Prefetch * Clippy * Add some tests * Add softcap tests * Fix test_parse_image_url test * Update tests * Update tests * Web search improvements (bm25, web chat) (EricLBuehler#1420) * Fix web search blocking case * Web search support in web chat * Tweak ui * Support fallback to bm25 * Clippy * Reinject descriptions * Propely handle consecutive searches (EricLBuehler#1421) * Update extraction tool reinjection * Looped * Update docs (EricLBuehler#1422) - lib.rs: clean up example var names and match logging change from EricLBuehler@201d6be - server_builder: fix typo - READMEs: link to crate docs * Better tool call detection logic (EricLBuehler#1424) * Add web search hook callbacks (EricLBuehler#1426) * feat: add customizable search hook * Move to builder * Update docs * Fix CUDA context switching, bind thread on CudaStorage drop (EricLBuehler#1428) * Add CUDA context helper and use in Llama forward * No flashparams? * working * Tweak * Update to use dep * conditionally build flash attention inputs (EricLBuehler#1429) * Add AGENTS.md (EricLBuehler#1430) * Support Qwen3 GGUF model (EricLBuehler#1432) * Support QWen3 GGUF model * Clippy fix * cargo fmt * Improved paged attn prefix caching (EricLBuehler#1434) * Improved paged attn prefix caching * Disable * Clippy * Temporary fix for qwen3 gguf tokenizer (EricLBuehler#1433) * Temporary fix for qwen3 gguf tokenizer * Typo fix * Add tool callback support (EricLBuehler#1427) * Add tool callback support * Fixes * Support named tool callbacks * Update examples * Update docs * Clippy * Centralize crate dependencies (EricLBuehler#1438) * chore: centralize dependencies * Format * Fix bug in tokenizer created with gguf metadata (EricLBuehler#1440) * Fix bug in tokenizer created with gguf metadata * Clippy fix * Update deps (EricLBuehler#1441) * Small things * Update deps * Update deps * Update breaking changes * Doc fixes (EricLBuehler#1442) * Mention uqff_maker * Downgrade rustyline 16.0.0 -> 15.0.0 (EricLBuehler#1444) * Add max_completion_tokens alias for server (EricLBuehler#1451) * Audio input support (Phi 4 multimodal) (EricLBuehler#1448) * Deps * Add conformer * Nemo loading * Position embeds * Load t5 attn bias * Attn and feed forward * Add conv module and glu pointwise * Implement relative attn bias * Add the forward methods * Add encoder embedding * Fix oproj * Some loading * Conformer loads! * Fully loading speech stack * Merger * Dont need that * First pass at audio processing * Read samples * Optional * Small loading fix * Runs but not correct yet * Improved audio processing? * Works with this * Fix t5 attn bias * It works! * Comment * Use some other crates * Clippy * Allow bf16 on metal * Add prefix_audio * Remove unused * Typo * User specified * Add audio url parsing * AudioProjectionMode -> InputMode * Audio prefix caching * Fix bug in audio prefix caching * Support both at the same time! * Tweak logging * Support stereo * Add mistralrs-audio * Support batching * Add server and rust api example * Add python api * Fix add_multimodal_message * Fix unfold for conformer * Streaming example * Add web chat support * Add modalities registry * Fix offline cache issue for gguf models (EricLBuehler#1452) * Add MCP server endpoints (EricLBuehler#1453) * feat(server): add MCP server support * Add mcp docs * Add handle_list_tools_request * Better launch, tool handling * Tmp state * Ok works * Handle modalities * Update docs * Add ping * Tweak temperature bounds, args * MCP documentation pass (EricLBuehler#1455) * Fix table * Update mcp docs * Improve readme header * Improve readme header * Integrate an MCP client (EricLBuehler#1456) * Add builtin mcp client * Use async loader * Add headers * Handle sse * More flexible search request * Add tool callbacks with tools, for mcp * Add bearer token support * Add websocket support * Update docs * Add python api * Clippy * Add http api, docs * Tests pass * Make these configs actually work * Add docs * Make mistralrs-mcp * Refactor examples * Update examples * Add defaults * Add defaults * Add defaults * Update docs * Improved docs * Add -y to npx usages * Even better examples * Update generate_wheels * Update generate_wheels * Update generate_wheels * Fix Dockerfile.cuda-all * Improve automatic tool call (EricLBuehler#1460) * Improved auto tool call * Add logging * chore: `Dockerfile.cuda-all` configurable threads (EricLBuehler#1458) * chore: `Dockerfile.cuda-all` - Merge `RUN` for `apt-get install` (EricLBuehler#1459) * Add fallback definition for isnan (EricLBuehler#1463) * chore: `Dockerfile` - Drop runtime rayon thread ENV (EricLBuehler#1465) * chore: Dockerfile - Remove rayon threads env * chore: Dockerfile - Improve formatting for `apt-get` * Remove duplicate calls for api_dir_list (EricLBuehler#1474) * Remove duplicate calls for api_dir_list * Support local cache for api_dir_list * Fix home folder for metal * Capitalized * Fix transient pyo3 dep (EricLBuehler#1478) Co-authored-by: Eric Buehler <[email protected]> * Fix objc dep with non macos (EricLBuehler#1480) * Fix phi 3/4 + nccl issue (EricLBuehler#1481) * Fix log * Fix n kv heads * Fix phi3.5 moe (EricLBuehler#1482) * Fix phi3.5 moe accum device * Fix again * Fix again * Support GLM4 model! (EricLBuehler#1437) * Support GLM4 model * Mention GLM4 model in ReadMe * glm4 type hint * Typo fix * Fix unsupported chat_template function * Clippy fix * Refactor distributed backend (EricLBuehler#1484) * Refactor distributed backend, check power of 2 * Fix compilation * Cap metal paged attn kv allocation (EricLBuehler#1485) * Better paged attn metal cap (EricLBuehler#1486) * Better paged attn metal cap * Small fix * Comment * Small fix * Refactor * Server core: consolidate and unify route handlers and API surface (EricLBuehler#1423) * Start working on consolidating completion and chat_completion underlying implementations * Move response channel to util mod for now (since it's used with streaming and non streaming) * More work on consolidating completions and chat completions * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * Update docs and restrict completion core visibility * CodeRabbit feedback: remove logprobs warn from route handler since parse request also checks this * Use consistent var name for completions mod * Make route handler modules public API consistent (same fn names, etc.) and provide proxy fn that wrap core fns so core mod doesn't have to be pub Make lib.rs example compile checked and update example * Code formatting * Typo * Sync fork * Sync fork * Docs example fix * Support qwen3 gguf (EricLBuehler#1488) * Add qwen3 gguf * Template fixup * Make bos/eos token IDs optional (EricLBuehler#1493) * Remove python deps from CUDA dockerfiles (EricLBuehler#1487) * Handle noncontiguous v in naive_sdpa (EricLBuehler#1499) Co-authored-by: Eric Buehler <[email protected]> * Server Core: refactor Paged Attention configuration (EricLBuehler#1500) * Use StorageModePrivate for Metal PA kv cache (EricLBuehler#1506) * Fix OpenAI stream: emit field in tool-call deltas for schema compliance (EricLBuehler#1507) * FP8 KV-cache quantization for PagedAttention (EricLBuehler#1400) * Add most of paged attn kv quant * It builds a bit * All the functionality at least * Small fix * Add a scale * Fix bf16 usage * Make k_v_scale optional * Collector * Tweak collection * Refactor * Add to apis * Add cuda impl * Fix compilation * Fixes * Handle ENABLE_FP8 * Format * Tweak * Fix scaled_convert usage * Fix cache_t size * Fixed scale collection * Actual fix * Fix fp8 for CC<8 * Fix the usual String != &str bit (EricLBuehler#1483) Co-authored-by: RageLtMan <rageltman [at] sempervictus> * chore: `Dockerfile` - Drop runtime rayon thread ENV (EricLBuehler#1465) * chore: Dockerfile - Remove rayon threads env * chore: Dockerfile - Improve formatting for `apt-get` * Remove duplicate calls for api_dir_list (EricLBuehler#1474) * Remove duplicate calls for api_dir_list * Support local cache for api_dir_list * Fix home folder for metal * Capitalized * Fix transient pyo3 dep (EricLBuehler#1478) Co-authored-by: Eric Buehler <[email protected]> * Fix objc dep with non macos (EricLBuehler#1480) * Fix phi 3/4 + nccl issue (EricLBuehler#1481) * Fix log * Fix n kv heads * Fix phi3.5 moe (EricLBuehler#1482) * Fix phi3.5 moe accum device * Fix again * Fix again * Support GLM4 model! (EricLBuehler#1437) * Support GLM4 model * Mention GLM4 model in ReadMe * glm4 type hint * Typo fix * Fix unsupported chat_template function * Clippy fix * Refactor distributed backend (EricLBuehler#1484) * Refactor distributed backend, check power of 2 * Fix compilation * Cap metal paged attn kv allocation (EricLBuehler#1485) * Better paged attn metal cap (EricLBuehler#1486) * Better paged attn metal cap * Small fix * Comment * Small fix * Refactor * Server core: consolidate and unify route handlers and API surface (EricLBuehler#1423) * Start working on consolidating completion and chat_completion underlying implementations * Move response channel to util mod for now (since it's used with streaming and non streaming) * More work on consolidating completions and chat completions * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * Update docs and restrict completion core visibility * CodeRabbit feedback: remove logprobs warn from route handler since parse request also checks this * Use consistent var name for completions mod * Make route handler modules public API consistent (same fn names, etc.) and provide proxy fn that wrap core fns so core mod doesn't have to be pub Make lib.rs example compile checked and update example * Code formatting * Typo * Sync fork * Sync fork * Docs example fix * Support qwen3 gguf (EricLBuehler#1488) * Add qwen3 gguf * Template fixup * Make bos/eos token IDs optional (EricLBuehler#1493) * Remove python deps from CUDA dockerfiles (EricLBuehler#1487) * Handle USE_FP8 for cuda * Fix cuda warn * Add readme * Saturating sub in sequence state --------- Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: RageLtMan <[email protected]> Co-authored-by: Brennan Kinney <[email protected]> Co-authored-by: Guoqing Bao <[email protected]> Co-authored-by: Matthew Haynes <[email protected]> * Validate model name in OpenAI API (EricLBuehler#1509) * Validate model name in openai api * Add docs, allow 'ignore' * Updated examples for EricLBuehler#1509 * Fix mcp import in doc string (EricLBuehler#1510) * Add multi-model support! (EricLBuehler#1512) * Refactor MistralRs * Working multi-model! * Add mutli-model docs initially * Update mistralrs-pyo3, mistralrs-bench, mistralrs * Update apis for consistency * API tweaks * Logging tweaks * Add examples, tweak cli * Clearer pipeline id * Fix config key semantics * Format and clippy * Tweak logging, fix example * Clippy refactor * Update examples * Remove unused multi model docs * Replace 'ignore' with 'default' * Update docs * Add stars label to readme (EricLBuehler#1513) * Add CLAUDE.md * Handle base_model.model case in lora (EricLBuehler#1514) * Add thread_local! for engine-specific const/static (EricLBuehler#1517) * Fix MCP doc test (EricLBuehler#1511) * Allow disabling metal precompilation (EricLBuehler#1518) * Allow disabling metal precompilation * Simple preprocessor * Simple docs --------- Co-authored-by: Eric Buehler <[email protected]> * Rust 1.88 clippy (EricLBuehler#1522) * Rust 1.88 clippy * Format * Fix cuda warnings (EricLBuehler#1526) * Avoid panic decoding tokens on error (EricLBuehler#1527) * Split Marlin and Paged Attention kernels for faster build (EricLBuehler#1525) * Split Marlin and Paged Attention kernels for faster build * Typo fix * chore: update llguidance (EricLBuehler#1535) * chore: update llguidance * chore: remove unused import * Add the SmolLM3 model! (EricLBuehler#1501) * Add model * Update loader * Fix llama config usage * Docs * Fix config no_rope_layers * Fix tie_word_embeddings default * Add chat template * Embed the chat templates * Fix embedding template * enable_thinking default true * Update examples * XML tools for smollm3 * Add smollm3 docs * Fix openai examples * Clippy --------- Co-authored-by: Eric Buehler <[email protected]> * Add full Gemma 3n support! (EricLBuehler#1519) * Add initial * Loading for text model * Add ple embeddings * Add altup, laurel block * Update rmsnorm * Add mlp * Update attn norm application * Currently no kv shared * Wire it up * It runs * Fix bf16 * Fix scaled embd * Fixes for mean * tmp * Attn confirmed * Fix target_magnitude * Add shared kv * Ok it works * Remove npy * Fix streaming * Remove warnings * Remove paged attn * Refactor rope * Add immediate isq * Add vision & mproj * Update image processor * Vision merge runs, not correct * Remove * Add mobilenet v5 * Add multimodal vision embedding * Fix load * runs * Fix gamma * Works but just not vision tower * It works!! * Tweak * Fix warnings * Move vision tower * Fix warn * Update cache manager things * Refactor * Add audio model, it loads * Add audio processing * It runs at least * tmp * A bit better * Audio works!!!! * Fused attn in vision * Clippy * Update audio runner * Optimized audio model * Remove unused things * Fix inputs processor bug * Remove comments * Clippy * Small optimizations * Format * Correctly register modalities * Add docs * Update readme * Runs there * Fixed padding from Blaizzy/mlx-vlm#410 * Add better checks * Fix sdpa n_kv_groups * Vision encoder works! * Rotate image * Clippy * Fix cuda loading * Updated device mapper * Fix overflow * Fix dtype errors * Refactor image/audio embeddings * Fix metal * Fix dtype mismatch * Audio processing fixes * Audio processing fixes * Works * Audio is good * Fix boi/eoi too * Embed the chat templates * Better embedding accuracy in non f32 * More f32 * Support bf16 on metal * Add more ISQ * Fixed device map * Clippy * Gemma3n no paged attn * Fix saturating sub * Faster rmsnorm * Use sdpa for vision model * Fix ple bug * Fix name * Fix multiaudio * Add matformer config loading * Add docs * Add support for matformer in auto device mapper * Update docs * Typos * Tweak * Tweak * Fix multidevice * Fix gemma3n text model auto device map * Fix dims3 * Fix auto devic emap vision * Non-metal keeps PLE on cpu * Complete merge * Vision dtype f16 -> f32 * Fix metal nm device * Fix uqff * Typos * Reference uqff * Fix tests * Fix sequence length check (EricLBuehler#1546) * update candle version (EricLBuehler#1545) Co-authored-by: AlpineVibrations <[email protected]> * add ios target to metal deps (EricLBuehler#1548) --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: edwko <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Guoqing Bao <[email protected]> Co-authored-by: Michał Moskal <[email protected]> Co-authored-by: Chen Mulong <[email protected]> Co-authored-by: Steph Wolski <[email protected]> Co-authored-by: omahs <[email protected]> Co-authored-by: Viktor Szépe <[email protected]> Co-authored-by: Matthew Haynes <[email protected]> Co-authored-by: RageLtMan <[email protected]> Co-authored-by: Brennan Kinney <[email protected]> Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: Sbargaoui <[email protected]> Co-authored-by: Gaétan Lepage <[email protected]> Co-authored-by: Ammar Elsabe <[email protected]> Co-authored-by: luke <[email protected]> Co-authored-by: AlpineVibrations <[email protected]> Co-authored-by: Michael Tissen <[email protected]>

EricLBuehler added 8 commits May 24, 2025 09:16

Exposing some things for logical token blocks

5b61f64

Prefix cache manager has the scheduler

5ffb5d1

Refactor

d189ea7

Get logical and physical blocks into the prefix cacher

c628771

Hash and cache

197a2c3

Pass physical block prefill

401d938

Allocation of prefilled block tables

181518a

Temp

8fe7bba

EricLBuehler added optimization paged-attention labels May 24, 2025

EricLBuehler added 12 commits May 24, 2025 22:48

Merge branch 'master' into paged_attn_prefix_caching

6730332

Dont always use 2

fb909f0

Hmm

ab34543

Hmm

9e88ef9

Merge branch 'master' into paged_attn_prefix_caching

3bb8392

It mostly works

3c69ee2

Increment refcount

fee7d69

Support images!

0222013

Add to dummy paged attn

3c3a7ca

Fix some clippy

e9f3d2d

Clippy

59953f1

More checks

51c3425

EricLBuehler marked this pull request as ready for review May 26, 2025 02:32

Include #1371, closes #1371

5074475

EricLBuehler mentioned this pull request May 26, 2025

Fix CPU block is_gpu init #1371

Closed

EricLBuehler added 2 commits May 25, 2025 22:34

Typos

01cedfa

Update docs

3784876

EricLBuehler merged commit 9387241 into master May 26, 2025
12 checks passed

EricLBuehler deleted the paged_attn_prefix_caching branch May 26, 2025 02:41

coderabbitai bot reviewed May 26, 2025

View reviewed changes

Prefix caching for PagedAttention #1369

Prefix caching for PagedAttention #1369

Uh oh!

Conversation

EricLBuehler commented May 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

github-actions bot commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EricLBuehler commented May 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented May 24, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

github-actions bot commented May 24, 2025 •

edited

Loading