-
Notifications
You must be signed in to change notification settings - Fork 469
Server core: consolidate and unify route handlers and API surface #1423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server core: consolidate and unify route handlers and API surface #1423
Conversation
WalkthroughThe changes refactor chat completion, completions, image generation, and speech generation modules to use new generic base abstractions for streaming, error handling, and response processing. Several new core modules are introduced, exposing reusable types and helpers. Type aliases and function signatures are updated for consistency, and error handling is centralized and modularized. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant HTTPHandler
participant CoreHelpers
participant Model
participant SSEStreamer
Client->>HTTPHandler: Send request (chat/completions/image/speech)
HTTPHandler->>CoreHelpers: Parse and validate request
CoreHelpers->>Model: Send model request via channel
Model-->>CoreHelpers: Send response(s) via channel
CoreHelpers->>SSEStreamer: Create streamer with callbacks (if streaming)
SSEStreamer-->>Client: Stream chunks (SSE) or send JSON/audio response
CoreHelpers->>HTTPHandler: Handle errors using generic helpers
HTTPHandler-->>Client: Return error response if needed
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (8)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== C Header 3 62 53 0 9 CSS 1 473 408 14 51 Dockerfile 1 42 23 10 9 HTML 1 78 64 5 9 JavaScript 7 1397 1068 180 149 JSON 18 282 279 0 3 Makefile 1 6 5 0 1 Python 92 4530 3813 185 532 Shell 1 63 26 18 19 Plain Text 3 3723 0 2413 1310 TOML 23 844 777 11 56 YAML 2 21 19 2 0 ------------------------------------------------------------------------------- Jupyter Notebooks 3 0 0 0 0 |- Markdown 2 77 32 31 14 |- Python 2 205 178 1 26 (Total) 282 210 32 40 ------------------------------------------------------------------------------- Markdown 68 6102 0 4600 1502 |- BASH 14 203 190 7 6 |- JSON 9 436 436 0 0 |- Python 12 391 326 24 41 |- Rust 29 1123 945 30 148 |- TOML 2 75 63 0 12 (Total) 8330 1960 4661 1709 ------------------------------------------------------------------------------- Rust 392 138341 122796 3175 12370 |- Markdown 185 3956 284 3154 518 (Total) 142297 123080 6329 12888 =============================================================================== Total 616 155964 129331 10613 16020 =============================================================================== |
fa08ac0 to
d8c0f7a
Compare
mistralrs-server-core/src/util.rs
Outdated
| /// This constant defines the maximum number of response messages that can be buffered | ||
| /// in the channel before backpressure is applied. A larger buffer reduces the likelihood | ||
| /// of blocking but uses more memory. | ||
| pub const DEFAULT_CHANNEL_BUFFER_SIZE: usize = 10_000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing things in util mod for now, but not sure if this will be the best / final location
|
Sounds like an interesting idea! Let me know when this is ready for review 🚀 |
84dbd45 to
7c06b0b
Compare
Will do! Still kind of a mess at the moment and TBH not sure if the consolidation is better or just causing indirection but I still want to work on it a bit more. Also, I don't think anything in mistral-core proper would be touched (pyo3 bindings forbidding generics and such) Perf wise I think the compiler will monomorphize (sp?) away the generics, but I'm not sure if there's other allocations creeping in so I'd want to double check / test to be sure once I think it's ready-ish. |
bcf9f9d to
1b29479
Compare
1694dfc to
196135f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
mistralrs-server-core/src/speech_generation.rs (1)
182-186:⚠️ Potential issueFix buffer capacity calculation for PCM encoding.
The buffer is allocated with capacity for
i64elements but only storesi16values, wasting memory:-let mut buf = Vec::with_capacity(samples.len() * std::mem::size_of::<i64>()); +let mut buf = Vec::with_capacity(samples.len() * std::mem::size_of::<i16>());
🧹 Nitpick comments (3)
mistralrs-server-core/src/handler_core.rs (1)
11-16: Consider making the default buffer size configurable via environment variable.While 10,000 is a reasonable default, different deployments might have different memory constraints or throughput requirements. Consider allowing this to be configured similar to the
KEEP_ALIVE_INTERVALpattern used elsewhere.-pub const DEFAULT_CHANNEL_BUFFER_SIZE: usize = 10_000; +pub const DEFAULT_CHANNEL_BUFFER_SIZE: usize = 10_000; + +/// Gets the channel buffer size from environment or returns default +pub fn get_channel_buffer_size() -> usize { + std::env::var("CHANNEL_BUFFER_SIZE") + .ok() + .and_then(|s| s.parse().ok()) + .unwrap_or(DEFAULT_CHANNEL_BUFFER_SIZE) +}mistralrs-server-core/src/completions.rs (1)
183-251: Consider removing redundant logprobs warning.The warning about logprobs at line 198 is redundant since the handler already returns a validation error for this case at line 267-271. Consider removing the warning here to avoid duplicate log messages.
- if oairequest.logprobs.is_some() { - warn!("Completion requests do not support logprobs."); - }mistralrs-server-core/src/chat_completion.rs (1)
59-86: Performance-conscious use of zero-cost abstractions.The use of type aliases over generic base types is a good approach that:
- Maintains API compatibility
- Leverages Rust's monomorphization for zero runtime overhead
- Reduces code duplication without performance penalties
This addresses the PR author's concerns about avoiding performance regressions.
Also applies to: 167-168
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
mistralrs-server-core/src/chat_completion.rs(9 hunks)mistralrs-server-core/src/completion_core.rs(1 hunks)mistralrs-server-core/src/completions.rs(5 hunks)mistralrs-server-core/src/handler_core.rs(1 hunks)mistralrs-server-core/src/image_generation.rs(3 hunks)mistralrs-server-core/src/lib.rs(1 hunks)mistralrs-server-core/src/speech_generation.rs(6 hunks)mistralrs-server-core/src/streaming.rs(1 hunks)mistralrs-server-core/src/types.rs(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
mistralrs-server-core/src/completions.rs (5)
mistralrs-server-core/src/completion_core.rs (3)
convert_stop_tokens(38-44)get_dry_sampling_params(47-65)handle_completion_error(27-34)mistralrs-server-core/src/handler_core.rs (5)
create_response_channel(82-87)process_non_streaming_response(102-118)send_model_request(90-99)new(36-38)new(73-78)mistralrs-server-core/src/streaming.rs (2)
create_streamer(45-62)get_keep_alive_interval(65-74)mistralrs-core/src/lib.rs (3)
maybe_log_response(580-592)maybe_log_request(567-578)maybe_log_error(594-605)mistralrs-server-core/src/chat_completion.rs (1)
match_responses(482-506)
🔇 Additional comments (24)
mistralrs-server-core/src/types.rs (1)
16-50: Well-designed generic callback abstractions.The new callback type aliases provide clean abstractions for streaming response processing with proper thread safety bounds and comprehensive documentation. The generic design enables reuse across different completion types while maintaining type safety.
mistralrs-server-core/src/completion_core.rs (3)
11-24: Well-structured generic completion responder enum.The
BaseCompletionResponderenum provides a comprehensive set of response variants that cover all necessary completion scenarios. The generic design enables reuse across different completion types while maintaining clear semantics for each response case.
26-34: Proper error handling with logging integration.The error handling function correctly integrates with the MistralRs logging system and provides consistent error conversion. The use of
anyhow::Errorfor error chaining maintains error context while providing a uniform interface.
46-65: Robust parameter construction with error propagation.The dry sampling parameter construction properly handles optional parameters and uses the builder pattern from the core library. Error propagation through the
Resulttype ensures issues are caught early in the request processing pipeline.mistralrs-server-core/src/streaming.rs (3)
13-21: Clear streaming lifecycle state management.The
DoneStateenum provides clear semantics for tracking the streaming response lifecycle. The three states appropriately represent the progression from active streaming through completion signaling to final termination.
23-42: Comprehensive streaming abstraction with good documentation.The
BaseStreamerstruct provides a well-designed foundation for SSE streaming with clear field documentation and appropriate generic parameters. The separation of concerns between chunk processing, state management, and callback handling supports maintainable streaming implementations.
64-74: Safe environment configuration with proper error handling.The keep-alive interval configuration properly handles environment variable parsing with appropriate fallback behavior. The warning logging for parse failures provides good debugging information while maintaining system stability.
mistralrs-server-core/src/image_generation.rs (5)
1-24: Module refactoring looks good!The imports are well-organized and the module documentation clearly describes the purpose. Good use of the new shared abstractions from
handler_core.
26-46: Well-structured response type implementation!The
ImageGenerationResponderenum and itsIntoResponseimplementation properly leverage the shared error handling utilities. The HTTP status codes are appropriate for each error type.
48-82: Clean request parsing implementation!Good job making the function public and adding comprehensive documentation. The use of
SamplingParams::deterministic()is appropriate for image generation requests.
84-114: Excellent handler refactoring!The handler now follows a consistent pattern using the shared utilities from
handler_core. The use ofprocess_non_streaming_responseis appropriate since image generation doesn't support streaming.
116-154: Well-implemented helper functions!The error handling and response matching logic is clean and consistent. The
unreachable!()calls appropriately document which response types are not expected in the image generation context.mistralrs-server-core/src/handler_core.rs (3)
18-48: Excellent error handling abstractions!The
ErrorToResponsetrait andJsonErrorstruct provide a clean, consistent way to handle error responses across the codebase. Good implementation of the necessary traits.
50-79: Well-designed error types for different scenarios!The
ModelErrorMessagewrapper and genericBaseJsonModelErrorprovide appropriate abstractions for different error cases. The ability to include partial responses with errors is particularly useful for streaming scenarios.
81-118: Excellent utility functions for common operations!The utility functions are well-designed:
create_response_channelproperly handles buffer size configurationsend_model_requestprovides good error contextprocess_non_streaming_responseuses a clever generic design with function pointers for flexibilityThese abstractions will significantly reduce code duplication across handlers.
mistralrs-server-core/src/completions.rs (4)
1-81: Excellent callback design and documentation!The callback types
CompletionOnChunkCallbackandCompletionOnDoneCallbackprovide great flexibility for stream processing. The documentation with practical examples is particularly helpful. Good use of type aliases to improve code readability.
83-156: Well-implemented streaming logic with callback support!The
poll_nextimplementation properly handles:
- State transitions with the
DoneStateenum- Conditional chunk storage based on
on_donecallback presence- Chunk modification via
on_chunkcallback- Proper error logging and stream termination
The optimization to only store chunks when needed is a nice touch.
158-181: Clean response type implementation using shared abstractions!Good use of type aliases and the generic base types. The
JsonModelErrortype appropriately handles cases where partial responses need to be included with error messages.
253-329: Excellent handler refactoring with proper SSE configuration!The handler and helper functions are well-implemented:
- Consistent use of shared utilities from
handler_core- Proper SSE configuration with keep-alive intervals
- Clean response matching with appropriate
unreachable!()calls- Good separation of concerns between parsing, handling, and response creation
mistralrs-server-core/src/speech_generation.rs (3)
1-2: LGTM! Clear module documentation added.The addition of module-level documentation improves code clarity.
17-22: Good modularization with shared utilities.The imports from
handler_coreandtypesmodules properly leverage the new shared abstractions, reducing code duplication across route handlers.
91-113: Excellent use of shared utilities for consistency.The refactored endpoint handler properly leverages:
create_response_channelfor channel setupsend_model_requestfor centralized request handling- Consistent error handling pattern
This aligns well with the consolidation goals.
mistralrs-server-core/src/chat_completion.rs (2)
59-86: Excellent use of type aliases for API consistency.The type aliases maintain backward compatibility while leveraging generic implementations. The comprehensive documentation for callback types is particularly helpful.
450-464: Clean refactoring with consistent error handling.The endpoint handler now uses centralized utilities effectively:
- Consistent error handling via
handle_completion_error- Shared request sending logic via
send_model_request- Unified response processing with callback support
This significantly improves maintainability.
mistralrs-server-core/src/lib.rs
Outdated
| pub mod completion_core; | ||
| pub mod completions; | ||
| pub mod handler_core; | ||
| mod handlers; | ||
| mod image_generation; | ||
| pub mod image_generation; | ||
| pub mod mistralrs_for_server_builder; | ||
| pub mod mistralrs_server_router_builder; | ||
| pub mod openai; | ||
| pub mod openapi_doc; | ||
| mod speech_generation; | ||
| pub mod speech_generation; | ||
| pub mod streaming; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider the implications of expanding the public API surface.
Making these internal modules public supports the consolidation goals and enables the documented usage patterns. However, this significantly expands the public API surface area, which increases maintenance commitments and backward compatibility obligations.
Ensure that the exposed APIs are stable and well-documented before release, as changes to public modules will be breaking changes for consumers.
🤖 Prompt for AI Agents
In mistralrs-server-core/src/lib.rs around lines 218 to 228, the modules are
being made public, which expands the public API surface. Review each module to
confirm its API is stable and well-documented before making it public. Add or
improve documentation comments for all public items and ensure the APIs are
finalized to avoid breaking changes for consumers after release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (3)
mistralrs-server-core/src/completions.rs (1)
145-151: 🛠️ Refactor suggestionConsider replacing unreachable!() with explicit error handling.
The match arms contain several
unreachable!()calls which will cause panics if these variants become reachable due to future code changes. Consider using explicit error handling or logging instead.- Response::Done(_) => unreachable!(), - Response::CompletionDone(_) => unreachable!(), - Response::Chunk(_) => unreachable!(), - Response::ImageGeneration(_) => unreachable!(), - Response::ModelError(_, _) => unreachable!(), - Response::Speech { .. } => unreachable!(), - Response::Raw { .. } => unreachable!(), + Response::Done(_) => { + warn!("Unexpected Done response in completion stream"); + Poll::Pending + } + Response::CompletionDone(_) => { + warn!("Unexpected CompletionDone response in completion stream"); + Poll::Pending + } + Response::Chunk(_) => { + warn!("Unexpected Chunk response in completion stream"); + Poll::Pending + } + Response::ImageGeneration(_) => { + warn!("Unexpected ImageGeneration response in completion stream"); + Poll::Pending + } + Response::ModelError(_, _) => { + warn!("Unexpected ModelError response in completion stream"); + Poll::Pending + } + Response::Speech { .. } => { + warn!("Unexpected Speech response in completion stream"); + Poll::Pending + } + Response::Raw { .. } => { + warn!("Unexpected Raw response in completion stream"); + Poll::Pending + }mistralrs-server-core/src/chat_completion.rs (2)
153-159: 🛠️ Refactor suggestionSame unreachable!() issue as in completions module.
This streaming implementation has the same issue with multiple
unreachable!()calls that could cause panics if the assumptions change. The same solution should be applied here for consistency.Apply the same approach as suggested for the completions module - replace
unreachable!()calls with proper error handling and logging.
498-504: 🛠️ Refactor suggestionSame unreachable!() pattern needs addressing.
The
match_responsesfunction has the same issue withunreachable!()calls. For consistency and safety, these should be replaced with proper error handling as suggested for the completions module.Apply the same error handling pattern suggested for the completions module's
match_responsesfunction.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
mistralrs-server-core/src/chat_completion.rs(9 hunks)mistralrs-server-core/src/completions.rs(5 hunks)mistralrs-server-core/src/lib.rs(4 hunks)mistralrs-server-core/src/types.rs(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- mistralrs-server-core/src/types.rs
- mistralrs-server-core/src/lib.rs
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Clippy
- GitHub Check: Check (windows-latest, stable)
- GitHub Check: Docs
- GitHub Check: Test Suite (windows-latest, stable)
- GitHub Check: Test Suite (macOS-latest, stable)
- GitHub Check: Check (macOS-latest, stable)
- GitHub Check: Test Suite (ubuntu-latest, stable)
- GitHub Check: Check (ubuntu-latest, stable)
🔇 Additional comments (8)
mistralrs-server-core/src/completions.rs (4)
40-74: Well-documented type aliases with clear examples.The type aliases for
CompletionOnChunkCallbackandCompletionOnDoneCallbackare well-documented with clear examples and usage patterns. This provides good developer experience for the callback system.
183-251: Approve the simplified parse_request function.The refactoring successfully delegates complex logic to helper functions (
convert_stop_tokens,get_dry_sampling_params) which improves readability and code reuse. The function structure remains clear and maintainable.
261-288: Good consolidation of common handler patterns.The handler function effectively uses the shared utilities from
handler_core(create_response_channel,send_model_request,process_non_streaming_response) and centralized error handling. This reduces code duplication across different completion types.
26-38:✅ Verification successful
Verify that the new imported modules exist and are properly structured.
The refactoring introduces several new module imports (
completion_core,handler_core,streaming,types). Ensure that these modules are correctly implemented and exported, especially since this appears to be part of a larger consolidation effort.
🏁 Script executed:
#!/bin/bash # Description: Verify that the imported modules exist and are properly structured # Expected: All imported modules should exist and export the expected items echo "Checking for existence of imported modules..." fd -t f -e rs completion_core.rs handler_core.rs streaming.rs types.rs echo -e "\nChecking exports in completion_core.rs..." if [ -f "mistralrs-server-core/src/completion_core.rs" ]; then rg -A 3 "pub.*convert_stop_tokens|pub.*get_dry_sampling_params|pub.*handle_completion_error|pub.*BaseCompletionResponder" mistralrs-server-core/src/completion_core.rs else echo "completion_core.rs not found" fi echo -e "\nChecking exports in handler_core.rs..." if [ -f "mistralrs-server-core/src/handler_core.rs" ]; then rg -A 3 "pub.*create_response_channel|pub.*send_model_request|pub.*process_non_streaming_response" mistralrs-server-core/src/handler_core.rs else echo "handler_core.rs not found" fi echo -e "\nChecking exports in streaming.rs..." if [ -f "mistralrs-server-core/src/streaming.rs" ]; then rg -A 3 "pub.*BaseStreamer|pub.*DoneState|pub.*create_streamer" mistralrs-server-core/src/streaming.rs else echo "streaming.rs not found" fiLength of output: 3273
🏁 Script executed:
#!/bin/bash # Verify handler_core exports BaseJsonModelError, ErrorToResponse, JsonError, ModelErrorMessage echo "Checking exports in handler_core.rs for error types..." if [ -f "mistralrs-server-core/src/handler_core.rs" ]; then rg -A 3 "pub.*BaseJsonModelError|pub.*ErrorToResponse|pub.*JsonError|pub.*ModelErrorMessage" mistralrs-server-core/src/handler_core.rs || echo "Some error types not found" else echo "handler_core.rs not found" fi # Verify openai module exports CompletionRequest and Grammar echo -e "\nChecking exports in openai.rs..." if [ -f "mistralrs-server-core/src/openai.rs" ]; then rg -A 3 "pub.*CompletionRequest|pub.*Grammar" mistralrs-server-core/src/openai.rs || echo "CompletionRequest or Grammar not found" else echo "openai.rs not found" fi # Verify types module exports state and callback types echo -e "\nChecking exports in types.rs..." if [ -f "mistralrs-server-core/src/types.rs" ]; then rg -A 3 "pub.*ExtractedMistralRsState|pub.*OnChunkCallback|pub.*OnDoneCallback|pub.*SharedMistralRsState" mistralrs-server-core/src/types.rs || echo "Some type aliases not found" else echo "types.rs not found" fiLength of output: 2770
Imports Verified
All newly imported modules and symbols are present and correctly exported:
- completion_core:
convert_stop_tokens,get_dry_sampling_params,handle_completion_error,BaseCompletionResponder- handler_core:
create_response_channel,send_model_request,process_non_streaming_response,BaseJsonModelError,ErrorToResponse,JsonError,ModelErrorMessage- streaming:
BaseStreamer,DoneState,create_streamer- openai:
CompletionRequest,Grammar- types:
ExtractedMistralRsState,OnChunkCallback,OnDoneCallback,SharedMistralRsStateNo missing exports or structural issues detected.
mistralrs-server-core/src/chat_completion.rs (4)
24-40: Consistent modular import structure.The import structure follows the same pattern as the completions module, using the new generic base abstractions. This demonstrates good consistency in the refactoring approach.
42-86: Well-structured type aliases with consistent documentation.The type aliases for chat completion callbacks and streamers follow the same pattern as the completions module, maintaining consistency across the codebase. The documentation is clear and provides good examples.
197-434: Complex but well-structured parse_request function.The
parse_requestfunction handles the complexity of chat messages, including image processing and various message formats. The refactoring to use helper functions (convert_stop_tokens,get_dry_sampling_params) maintains the same consolidation benefits while preserving the necessary complexity for chat handling.
444-465: Consistent handler implementation.The
chatcompletionshandler follows the same pattern as the completions handler, using the shared utilities and centralized error handling. This demonstrates successful consolidation across different completion types.
| Response::CompletionChunk(_) => unreachable!(), | ||
| Response::Chunk(_) => unreachable!(), | ||
| Response::Done(_) => unreachable!(), | ||
| Response::ModelError(_, _) => unreachable!(), | ||
| Response::ImageGeneration(_) => unreachable!(), | ||
| Response::Speech { .. } => unreachable!(), | ||
| Response::Raw { .. } => unreachable!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Replace unreachable!() calls in match_responses function.
Similar to the streaming implementation, this function contains multiple unreachable!() calls that could cause panics if the response variants become reachable in the future.
- Response::CompletionChunk(_) => unreachable!(),
- Response::Chunk(_) => unreachable!(),
- Response::Done(_) => unreachable!(),
- Response::ModelError(_, _) => unreachable!(),
- Response::ImageGeneration(_) => unreachable!(),
- Response::Speech { .. } => unreachable!(),
- Response::Raw { .. } => unreachable!(),
+ Response::CompletionChunk(_) => {
+ warn!("Unexpected CompletionChunk response in non-streaming context");
+ CompletionsResponder::InternalError("Unexpected response type".into())
+ }
+ Response::Chunk(_) => {
+ warn!("Unexpected Chunk response in completions context");
+ CompletionsResponder::InternalError("Unexpected response type".into())
+ }
+ Response::Done(_) => {
+ warn!("Unexpected Done response in completions context");
+ CompletionsResponder::InternalError("Unexpected response type".into())
+ }
+ Response::ModelError(_, _) => {
+ warn!("Unexpected ModelError response in completions context");
+ CompletionsResponder::InternalError("Unexpected response type".into())
+ }
+ Response::ImageGeneration(_) => {
+ warn!("Unexpected ImageGeneration response in completions context");
+ CompletionsResponder::InternalError("Unexpected response type".into())
+ }
+ Response::Speech { .. } => {
+ warn!("Unexpected Speech response in completions context");
+ CompletionsResponder::InternalError("Unexpected response type".into())
+ }
+ Response::Raw { .. } => {
+ warn!("Unexpected Raw response in completions context");
+ CompletionsResponder::InternalError("Unexpected response type".into())
+ }Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In mistralrs-server-core/src/completions.rs around lines 321 to 327, the
match_responses function uses unreachable!() for several Response variants,
which risks panics if those variants become reachable. Replace these
unreachable!() calls with proper error handling or logging to safely handle
unexpected variants without panicking, ensuring the function gracefully manages
all possible Response cases.
|
Want to clean up something real quick before review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
mistralrs-server-core/src/completions.rs (1)
330-336: Replace unreachable!() calls with proper error handling.As noted in previous reviews, these unreachable!() calls should be replaced with proper error handling to avoid potential panics if the response variants become reachable in the future.
🧹 Nitpick comments (3)
mistralrs-server-core/src/handler_core.rs (2)
11-16: Consider reducing the default channel buffer size.A buffer size of 10,000 seems excessive and could lead to high memory usage with many concurrent requests. Consider a more conservative default (e.g., 100-1000) and make it configurable via environment variables similar to
KEEP_ALIVE_INTERVAL.
65-80: Consider implementing standard error traits for BaseJsonModelError.The struct should implement
std::fmt::Displayandstd::error::Errortraits for consistency with other error types. Also, consider ifErrorToResponseshould be implemented here or left to specific type aliases.+impl<T> std::fmt::Display for BaseJsonModelError<T> { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + write!(f, "{}", self.message) + } +} + +impl<T: std::fmt::Debug> std::error::Error for BaseJsonModelError<T> {}mistralrs-server-core/src/chat_completion.rs (1)
513-519: Consider replacing unreachable!() calls with proper error handling.Similar to the completions module, these unreachable!() calls could cause panics if the response variants become reachable. Consider logging warnings and returning appropriate error responses instead.
- Response::Chunk(_) => unreachable!(), - Response::CompletionDone(_) => unreachable!(), - Response::CompletionModelError(_, _) => unreachable!(), - Response::CompletionChunk(_) => unreachable!(), - Response::ImageGeneration(_) => unreachable!(), - Response::Speech { .. } => unreachable!(), - Response::Raw { .. } => unreachable!(), + Response::Chunk(_) => { + tracing::warn!("Unexpected Chunk response in chat completion context"); + ChatCompletionResponder::InternalError("Unexpected response type".into()) + } + Response::CompletionDone(_) => { + tracing::warn!("Unexpected CompletionDone response in chat completion context"); + ChatCompletionResponder::InternalError("Unexpected response type".into()) + } + Response::CompletionModelError(_, _) => { + tracing::warn!("Unexpected CompletionModelError response in chat completion context"); + ChatCompletionResponder::InternalError("Unexpected response type".into()) + } + Response::CompletionChunk(_) => { + tracing::warn!("Unexpected CompletionChunk response in chat completion context"); + ChatCompletionResponder::InternalError("Unexpected response type".into()) + } + Response::ImageGeneration(_) => { + tracing::warn!("Unexpected ImageGeneration response in chat completion context"); + ChatCompletionResponder::InternalError("Unexpected response type".into()) + } + Response::Speech { .. } => { + tracing::warn!("Unexpected Speech response in chat completion context"); + ChatCompletionResponder::InternalError("Unexpected response type".into()) + } + Response::Raw { .. } => { + tracing::warn!("Unexpected Raw response in chat completion context"); + ChatCompletionResponder::InternalError("Unexpected response type".into()) + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
mistralrs-server-core/src/chat_completion.rs(9 hunks)mistralrs-server-core/src/completions.rs(6 hunks)mistralrs-server-core/src/handler_core.rs(1 hunks)mistralrs-server-core/src/image_generation.rs(3 hunks)mistralrs-server-core/src/lib.rs(5 hunks)mistralrs-server-core/src/speech_generation.rs(6 hunks)mistralrs-server-core/src/streaming.rs(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- mistralrs-server-core/src/streaming.rs
- mistralrs-server-core/src/lib.rs
- mistralrs-server-core/src/image_generation.rs
- mistralrs-server-core/src/speech_generation.rs
🧰 Additional context used
🧬 Code Graph Analysis (1)
mistralrs-server-core/src/completions.rs (5)
mistralrs-server-core/src/completion_core.rs (3)
convert_stop_tokens(38-44)get_dry_sampling_params(47-65)handle_completion_error(27-34)mistralrs-server-core/src/handler_core.rs (5)
base_process_non_streaming_response(102-118)create_response_channel(82-87)send_model_request(90-99)new(36-38)new(73-78)mistralrs-server-core/src/streaming.rs (2)
base_create_streamer(45-62)get_keep_alive_interval(65-74)mistralrs-core/src/lib.rs (3)
maybe_log_response(580-592)maybe_log_request(567-578)maybe_log_error(594-605)mistralrs-server-core/src/chat_completion.rs (4)
handle_error(467-472)create_streamer(475-486)process_non_streaming_response(489-494)match_responses(497-521)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Docs
- GitHub Check: Clippy
- GitHub Check: Check (macOS-latest, stable)
- GitHub Check: Check (windows-latest, stable)
- GitHub Check: Check (ubuntu-latest, stable)
- GitHub Check: Test Suite (macOS-latest, stable)
- GitHub Check: Test Suite (ubuntu-latest, stable)
- GitHub Check: Test Suite (windows-latest, stable)
🔇 Additional comments (14)
mistralrs-server-core/src/handler_core.rs (6)
18-26: Well-designed error conversion trait.The
ErrorToResponsetrait provides a clean abstraction for converting errors to HTTP responses with appropriate status codes.
28-49: Clean implementation of JsonError struct.The struct properly implements all necessary traits and follows Rust error handling best practices.
50-64: Good use of newtype pattern for ModelErrorMessage.The newtype wrapper provides semantic clarity and proper error trait implementations.
82-87: Clean channel creation utility.The function provides a simple interface for creating response channels with configurable buffer sizes.
89-99: Well-implemented request sending utility.Good use of anyhow's context for meaningful error messages and proper async/await handling.
101-118: Excellent generic abstraction for non-streaming response processing.The function provides a flexible and reusable pattern for handling responses with custom matching and error handling logic.
mistralrs-server-core/src/completions.rs (5)
26-38: Clean import organization reflecting the new modular architecture.The imports are well-structured, showing clear separation between different core modules.
40-82: Excellent documentation and type alias definitions.The type aliases provide clear abstractions and are well-documented with examples, improving API usability.
83-156: Well-implemented streaming logic with callback support.The Stream implementation properly handles chunk processing, completion callbacks, and chunk storage when needed.
195-209: Good refactoring to use helper functions.The use of
convert_stop_tokensandget_dry_sampling_paramshelper functions improves code modularity and reusability.
261-281: Clean and consistent handler implementation.The refactored handler effectively uses the new helper functions, resulting in cleaner and more maintainable code.
mistralrs-server-core/src/chat_completion.rs (3)
42-86: Consistent refactoring pattern with completions module.The type aliases and abstractions follow the same pattern as in completions.rs, ensuring consistency across the codebase.
344-346: Good extraction of image URL parsing to utility function.The use of
parse_image_urlfrom the util module improves code reusability and maintainability.
466-495: Well-structured helper functions following consistent patterns.The helper functions provide clean abstractions for error handling, streaming, and response processing, maintaining consistency with the completions module.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
3881315 to
dfc215b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
mistralrs-server-core/src/speech_generation.rs (1)
180-186:⚠️ Potential issueFix buffer capacity calculation in PCM encoding.
The buffer capacity calculation uses
size_of::<i64>()but the actual data written isi16(2 bytes), causing over-allocation.- let mut buf = Vec::with_capacity(samples.len() * std::mem::size_of::<i64>()); + let mut buf = Vec::with_capacity(samples.len() * std::mem::size_of::<i16>());
♻️ Duplicate comments (1)
mistralrs-server-core/src/completions.rs (1)
330-336:⚠️ Potential issueReplace unreachable!() calls in match_responses function.
Similar to the streaming implementation, this function contains multiple
unreachable!()calls that could cause panics if the response variants become reachable in the future. This issue was previously identified in past reviews.- Response::CompletionChunk(_) => unreachable!(), - Response::Chunk(_) => unreachable!(), - Response::Done(_) => unreachable!(), - Response::ModelError(_, _) => unreachable!(), - Response::ImageGeneration(_) => unreachable!(), - Response::Speech { .. } => unreachable!(), - Response::Raw { .. } => unreachable!(), + Response::CompletionChunk(_) => { + warn!("Unexpected CompletionChunk response in non-streaming context"); + CompletionResponder::InternalError("Unexpected response type".into()) + } + Response::Chunk(_) => { + warn!("Unexpected Chunk response in completions context"); + CompletionResponder::InternalError("Unexpected response type".into()) + } + Response::Done(_) => { + warn!("Unexpected Done response in completions context"); + CompletionResponder::InternalError("Unexpected response type".into()) + } + Response::ModelError(_, _) => { + warn!("Unexpected ModelError response in completions context"); + CompletionResponder::InternalError("Unexpected response type".into()) + } + Response::ImageGeneration(_) => { + warn!("Unexpected ImageGeneration response in completions context"); + CompletionResponder::InternalError("Unexpected response type".into()) + } + Response::Speech { .. } => { + warn!("Unexpected Speech response in completions context"); + CompletionResponder::InternalError("Unexpected response type".into()) + } + Response::Raw { .. } => { + warn!("Unexpected Raw response in completions context"); + CompletionResponder::InternalError("Unexpected response type".into()) + }
🧹 Nitpick comments (2)
mistralrs-server-core/src/handler_core.rs (1)
16-16: Consider reducing the default buffer size.A buffer size of 10,000 seems excessive for most streaming scenarios and could lead to unnecessary memory usage. Consider reducing this to a more reasonable value like 1,000 or making it configurable based on expected load patterns.
-pub const DEFAULT_CHANNEL_BUFFER_SIZE: usize = 10_000; +pub const DEFAULT_CHANNEL_BUFFER_SIZE: usize = 1_000;mistralrs-server-core/src/chat_completion.rs (1)
88-164: Consider consolidating the Stream implementation pattern.The local
impl futures::Stream for ChatCompletionStreamerimplementation appears to duplicate logic that could potentially be generalized. SinceChatCompletionStreameris now a type alias toBaseStreamer, consider whether this streaming logic could be moved to a generic implementation onBaseStreameritself.This would further align with the consolidation goals and reduce code duplication across different completion types.
// Consider if this logic could be moved to a generic impl on BaseStreamer impl<R, C, D> futures::Stream for BaseStreamer<R, C, D> where R: /* appropriate bounds */, C: /* appropriate bounds */, D: /* appropriate bounds */, { // Generic streaming logic }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
mistralrs-server-core/src/chat_completion.rs(9 hunks)mistralrs-server-core/src/completion_core.rs(1 hunks)mistralrs-server-core/src/completions.rs(6 hunks)mistralrs-server-core/src/handler_core.rs(1 hunks)mistralrs-server-core/src/image_generation.rs(3 hunks)mistralrs-server-core/src/lib.rs(5 hunks)mistralrs-server-core/src/speech_generation.rs(6 hunks)mistralrs-server-core/src/streaming.rs(1 hunks)mistralrs-server-core/src/types.rs(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- mistralrs-server-core/src/types.rs
- mistralrs-server-core/src/lib.rs
- mistralrs-server-core/src/streaming.rs
- mistralrs-server-core/src/completion_core.rs
🧰 Additional context used
🧬 Code Graph Analysis (4)
mistralrs-server-core/src/speech_generation.rs (3)
mistralrs-server-core/src/handler_core.rs (2)
create_response_channel(82-87)send_model_request(90-99)mistralrs-server-core/src/image_generation.rs (3)
parse_request(52-82)handle_error(111-118)match_responses(129-156)mistralrs-core/src/lib.rs (1)
maybe_log_error(594-605)
mistralrs-server-core/src/completions.rs (5)
mistralrs-server-core/src/completion_core.rs (3)
convert_stop_tokens(38-44)get_dry_sampling_params(47-65)handle_completion_error(27-34)mistralrs-server-core/src/handler_core.rs (5)
base_process_non_streaming_response(102-118)create_response_channel(82-87)send_model_request(90-99)new(36-38)new(73-78)mistralrs-server-core/src/streaming.rs (2)
base_create_streamer(45-62)get_keep_alive_interval(65-74)mistralrs-core/src/lib.rs (3)
maybe_log_response(580-592)maybe_log_request(567-578)maybe_log_error(594-605)mistralrs-server-core/src/chat_completion.rs (4)
handle_error(467-472)create_streamer(475-486)process_non_streaming_response(489-494)match_responses(497-521)
mistralrs-server-core/src/image_generation.rs (5)
mistralrs-server-core/src/speech_generation.rs (4)
std(182-182)handle_error(116-123)process_non_streaming_response(126-140)match_responses(143-204)mistralrs-server-core/src/handler_core.rs (3)
base_process_non_streaming_response(102-118)create_response_channel(82-87)send_model_request(90-99)mistralrs-server-core/src/completions.rs (3)
handle_error(284-289)process_non_streaming_response(306-311)match_responses(314-338)mistralrs-server-core/src/chat_completion.rs (3)
handle_error(467-472)process_non_streaming_response(489-494)match_responses(497-521)mistralrs-core/src/lib.rs (1)
maybe_log_error(594-605)
mistralrs-server-core/src/chat_completion.rs (5)
mistralrs-server-core/src/completion_core.rs (3)
convert_stop_tokens(38-44)get_dry_sampling_params(47-65)handle_completion_error(27-34)mistralrs-server-core/src/handler_core.rs (5)
base_process_non_streaming_response(102-118)create_response_channel(82-87)send_model_request(90-99)new(36-38)new(73-78)mistralrs-server-core/src/streaming.rs (2)
base_create_streamer(45-62)get_keep_alive_interval(65-74)mistralrs-server-core/src/util.rs (1)
parse_image_url(41-81)mistralrs-server-core/src/completions.rs (4)
handle_error(284-289)create_streamer(292-303)process_non_streaming_response(306-311)match_responses(314-338)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Docs
- GitHub Check: Check (windows-latest, stable)
- GitHub Check: Test Suite (ubuntu-latest, stable)
- GitHub Check: Clippy
- GitHub Check: Check (macOS-latest, stable)
- GitHub Check: Test Suite (macOS-latest, stable)
- GitHub Check: Check (ubuntu-latest, stable)
- GitHub Check: Test Suite (windows-latest, stable)
🔇 Additional comments (21)
mistralrs-server-core/src/handler_core.rs (2)
18-48: Excellent abstraction design for error handling.The
ErrorToResponsetrait andJsonErrorstruct provide a clean, reusable approach to HTTP error response generation. The design promotes consistency across different handlers and reduces code duplication.
101-118: Well-designed generic response processing function.The
base_process_non_streaming_responsefunction effectively abstracts the common pattern of receiving and processing model responses, using function pointers for flexible handling of different response types while maintaining type safety.mistralrs-server-core/src/image_generation.rs (4)
17-24: Excellent consolidation of shared functionality.The refactored imports demonstrate effective use of the new shared abstractions from
handler_core. This consolidates error handling, channel management, and response processing into reusable components, reducing code duplication across handlers.
48-52: Good API design with public visibility and documentation.Making
parse_requestpublic and adding comprehensive documentation improves the module's testability and reusability. The documentation clearly explains the function's role in request transformation.
96-108: Clean handler implementation using shared utilities.The refactored handler effectively leverages shared utilities from
handler_core, resulting in cleaner, more maintainable code. The error handling is now consistent with other handlers in the codebase.
128-156: Comprehensive response matching with proper error handling.The
match_responsesfunction handles all relevant response variants appropriately, with proper logging for errors and responses. Theunreachable!()calls are justified here as these response types should not occur in the image generation context.mistralrs-server-core/src/speech_generation.rs (3)
18-22: Excellent use of shared abstractions.The refactored imports effectively leverage shared functionality from
handler_core, promoting code reuse and consistency across speech generation handling.
46-50: Well-documented public API for request parsing.Making
parse_requestpublic with comprehensive documentation enhances the module's testability and provides clear interface specifications for speech generation request transformation.
188-194: Proper WAV encoding implementation.The WAV encoding logic correctly delegates to
speech_utils::write_pcm_as_wavwith appropriate parameters for rate and channels, ensuring proper audio format conversion.mistralrs-server-core/src/completions.rs (5)
40-74: Excellent callback type definitions with comprehensive documentation.The new callback type definitions for
CompletionOnChunkCallbackandCompletionOnDoneCallbackprovide clear interfaces for streaming customization. The documentation includes helpful examples that demonstrate proper usage patterns.
102-104: Enhanced streaming with proper callback integration.The streaming logic now properly integrates completion callbacks, allowing for post-processing operations when the stream finishes. This enhancement improves the flexibility and extensibility of the completion system.
135-141: Well-implemented chunk processing with callbacks.The chunk processing logic correctly applies optional chunk modification callbacks and stores chunks when a completion callback is configured. This design supports both real-time modification and post-stream analysis.
203-208: Improved request parsing using shared utilities.The refactored request parsing effectively uses helper functions from
completion_corefor stop token conversion and dry sampling parameters, reducing code duplication and improving maintainability.
291-303: Clean streamer creation with configurable keep-alive.The
create_streamerfunction provides a clean interface for creating SSE streamers with optional callbacks and environment-configurable keep-alive intervals, improving both usability and operational flexibility.mistralrs-server-core/src/chat_completion.rs (7)
25-40: LGTM: Clean modular imports align with consolidation goals.The import reorganization successfully brings in the generic base abstractions from
completion_core,handler_core, andstreamingmodules, supporting the goal of reducing redundancy across route handlers.
59-59: LGTM: Type aliases provide clean API while using generic implementations.The type aliases like
ChatCompletionOnChunkCallback,ChatCompletionOnDoneCallback, andChatCompletionStreamermaintain a clean, domain-specific API while leveraging the genericBaseStreamerand callback types from the core modules.Also applies to: 76-76, 82-86
167-168: LGTM: Consistent use of generic base responder types.The
ChatCompletionRespondertype alias correctly usesBaseCompletionResponderwith appropriate type parameters, maintaining consistency with the consolidation pattern.Also applies to: 170-170
205-205: LGTM: Effective delegation to helper functions.The replacement of local logic with calls to
convert_stop_tokensandget_dry_sampling_paramsfromcompletion_coresuccessfully eliminates code duplication while maintaining the same functionality.Also applies to: 376-381
344-344: LGTM: Consistent utility function usage.The use of
parse_image_urlfrom theutilmodule maintains consistency with the modular approach and eliminates potential code duplication.
448-448: LGTM: Clean delegation to generic handler functions.The main handler function effectively uses the generic functions from
handler_core:
create_response_channelfor channel creationsend_model_requestfor request sendingcreate_streamerandprocess_non_streaming_responsefor response handlingThis successfully reduces redundancy while maintaining the same API contract.
Also applies to: 455-456, 460-460, 462-462
467-471: LGTM: Thin wrapper functions maintain clean API.The utility functions now serve as clean, domain-specific wrappers around the generic base implementations:
handle_errordelegates tohandle_completion_errorcreate_streamerdelegates tobase_create_streamerprocess_non_streaming_responsedelegates tobase_process_non_streaming_responseThis pattern successfully achieves consolidation while preserving a clean, typed API for chat completions.
Also applies to: 475-486, 489-494
dfc215b to
9654ad9
Compare
…ing implementations
…ming and non streaming)
…rse request also checks this
…) and provide proxy fn that wrap core fns so core mod doesn't have to be pub Make lib.rs example compile checked and update example
9654ad9 to
541e3f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
mistralrs-server-core/src/chat_completion.rs (2)
153-159:⚠️ Potential issueReplace unreachable!() calls with proper error handling.
These unreachable!() calls pose the same risk as in the completions module and should be replaced with proper error handling.
- Response::Done(_) => unreachable!(), - Response::CompletionDone(_) => unreachable!(), - Response::CompletionModelError(_, _) => unreachable!(), - Response::CompletionChunk(_) => unreachable!(), - Response::ImageGeneration(_) => unreachable!(), - Response::Speech { .. } => unreachable!(), - Response::Raw { .. } => unreachable!(), + Response::Done(_) => { + tracing::warn!("Unexpected Done response in chat completion streaming context"); + self.done_state = DoneState::SendingDone; + Poll::Ready(Some(Ok(Event::default().data("Unexpected response type")))) + } + Response::CompletionDone(_) => { + tracing::warn!("Unexpected CompletionDone response in chat streaming context"); + self.done_state = DoneState::SendingDone; + Poll::Ready(Some(Ok(Event::default().data("Unexpected response type")))) + } + Response::CompletionModelError(_, _) => { + tracing::warn!("Unexpected CompletionModelError response in chat streaming context"); + self.done_state = DoneState::SendingDone; + Poll::Ready(Some(Ok(Event::default().data("Unexpected response type")))) + } + Response::CompletionChunk(_) => { + tracing::warn!("Unexpected CompletionChunk response in chat context"); + self.done_state = DoneState::SendingDone; + Poll::Ready(Some(Ok(Event::default().data("Unexpected response type")))) + } + Response::ImageGeneration(_) => { + tracing::warn!("Unexpected ImageGeneration response in chat context"); + self.done_state = DoneState::SendingDone; + Poll::Ready(Some(Ok(Event::default().data("Unexpected response type")))) + } + Response::Speech { .. } => { + tracing::warn!("Unexpected Speech response in chat context"); + self.done_state = DoneState::SendingDone; + Poll::Ready(Some(Ok(Event::default().data("Unexpected response type")))) + } + Response::Raw { .. } => { + tracing::warn!("Unexpected Raw response in chat context"); + self.done_state = DoneState::SendingDone; + Poll::Ready(Some(Ok(Event::default().data("Unexpected response type")))) + }
559-565:⚠️ Potential issueAddress unreachable!() calls in match_responses.
Consistent with the completions module, these unreachable!() calls should be replaced with proper error handling to prevent potential panics.
- Response::Chunk(_) => unreachable!(), - Response::CompletionDone(_) => unreachable!(), - Response::CompletionModelError(_, _) => unreachable!(), - Response::CompletionChunk(_) => unreachable!(), - Response::ImageGeneration(_) => unreachable!(), - Response::Speech { .. } => unreachable!(), - Response::Raw { .. } => unreachable!(), + Response::Chunk(_) => { + tracing::warn!("Unexpected Chunk response in non-streaming chat context"); + ChatCompletionResponder::InternalError("Unexpected response type".into()) + } + Response::CompletionDone(_) => { + tracing::warn!("Unexpected CompletionDone response in chat context"); + ChatCompletionResponder::InternalError("Unexpected response type".into()) + } + Response::CompletionModelError(_, _) => { + tracing::warn!("Unexpected CompletionModelError response in chat context"); + ChatCompletionResponder::InternalError("Unexpected response type".into()) + } + Response::CompletionChunk(_) => { + tracing::warn!("Unexpected CompletionChunk response in chat context"); + ChatCompletionResponder::InternalError("Unexpected response type".into()) + } + Response::ImageGeneration(_) => { + tracing::warn!("Unexpected ImageGeneration response in chat context"); + ChatCompletionResponder::InternalError("Unexpected response type".into()) + } + Response::Speech { .. } => { + tracing::warn!("Unexpected Speech response in chat context"); + ChatCompletionResponder::InternalError("Unexpected response type".into()) + } + Response::Raw { .. } => { + tracing::warn!("Unexpected Raw response in chat context"); + ChatCompletionResponder::InternalError("Unexpected response type".into()) + }
♻️ Duplicate comments (2)
mistralrs-server-core/src/completions.rs (2)
145-151:⚠️ Potential issueReplace unreachable!() calls with proper error handling.
Similar to the past review comments, these unreachable!() calls could cause panics if the response variants become reachable in the future due to changes in the codebase.
- Response::Done(_) => unreachable!(), - Response::CompletionDone(_) => unreachable!(), - Response::Chunk(_) => unreachable!(), - Response::ImageGeneration(_) => unreachable!(), - Response::ModelError(_, _) => unreachable!(), - Response::Speech { .. } => unreachable!(), - Response::Raw { .. } => unreachable!(), + Response::Done(_) => { + tracing::warn!("Unexpected Done response in completion streaming context"); + self.done_state = DoneState::SendingDone; + Poll::Ready(Some(Ok(Event::default().data("Unexpected response type")))) + } + Response::CompletionDone(_) => { + tracing::warn!("Unexpected CompletionDone response in streaming context"); + self.done_state = DoneState::SendingDone; + Poll::Ready(Some(Ok(Event::default().data("Unexpected response type")))) + } + Response::Chunk(_) => { + tracing::warn!("Unexpected Chunk response in completions context"); + self.done_state = DoneState::SendingDone; + Poll::Ready(Some(Ok(Event::default().data("Unexpected response type")))) + } + Response::ImageGeneration(_) => { + tracing::warn!("Unexpected ImageGeneration response in completions context"); + self.done_state = DoneState::SendingDone; + Poll::Ready(Some(Ok(Event::default().data("Unexpected response type")))) + } + Response::ModelError(_, _) => { + tracing::warn!("Unexpected ModelError response in completions context"); + self.done_state = DoneState::SendingDone; + Poll::Ready(Some(Ok(Event::default().data("Unexpected response type")))) + } + Response::Speech { .. } => { + tracing::warn!("Unexpected Speech response in completions context"); + self.done_state = DoneState::SendingDone; + Poll::Ready(Some(Ok(Event::default().data("Unexpected response type")))) + } + Response::Raw { .. } => { + tracing::warn!("Unexpected Raw response in completions context"); + self.done_state = DoneState::SendingDone; + Poll::Ready(Some(Ok(Event::default().data("Unexpected response type")))) + }
330-337:⚠️ Potential issueAddress the unreachable!() calls in match_responses function.
This issue was already raised in past review comments and remains unaddressed. These calls could cause panics if the response variants become reachable.
🧹 Nitpick comments (2)
mistralrs-server-core/src/handler_core.rs (2)
11-16: Consider reducing the default channel buffer size or making it configurable.A buffer size of 10,000 is quite large and could lead to significant memory usage when multiple streams are active. Consider using a smaller default (e.g., 100-1000) or make it configurable via environment variable similar to the keep-alive interval pattern used elsewhere in the codebase.
-pub const DEFAULT_CHANNEL_BUFFER_SIZE: usize = 10_000; +pub const DEFAULT_CHANNEL_BUFFER_SIZE: usize = 1_000;Alternatively, make it configurable:
use std::env; pub fn get_channel_buffer_size() -> usize { env::var("RESPONSE_CHANNEL_BUFFER_SIZE") .map(|val| { val.parse::<usize>().unwrap_or_else(|e| { tracing::warn!("Failed to parse RESPONSE_CHANNEL_BUFFER_SIZE: {}. Using default.", e); DEFAULT_CHANNEL_BUFFER_SIZE }) }) .unwrap_or(DEFAULT_CHANNEL_BUFFER_SIZE) }
101-118: Consider using closures instead of function pointers for more flexibility.The current implementation uses function pointers which limits flexibility. Consider accepting closures to allow capturing variables from the surrounding scope.
-pub(crate) async fn base_process_non_streaming_response<R>( +pub(crate) async fn base_process_non_streaming_response<R, F, E>( rx: &mut Receiver<Response>, state: SharedMistralRsState, - match_fn: fn(SharedMistralRsState, Response) -> R, - error_handler: fn( - SharedMistralRsState, - Box<dyn std::error::Error + Send + Sync + 'static>, - ) -> R, + match_fn: F, + error_handler: E, ) -> R +where + F: FnOnce(SharedMistralRsState, Response) -> R, + E: FnOnce(SharedMistralRsState, Box<dyn std::error::Error + Send + Sync + 'static>) -> R, {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
mistralrs-server-core/src/chat_completion.rs(10 hunks)mistralrs-server-core/src/completion_core.rs(1 hunks)mistralrs-server-core/src/completions.rs(6 hunks)mistralrs-server-core/src/handler_core.rs(1 hunks)mistralrs-server-core/src/image_generation.rs(3 hunks)mistralrs-server-core/src/lib.rs(5 hunks)mistralrs-server-core/src/speech_generation.rs(6 hunks)mistralrs-server-core/src/streaming.rs(1 hunks)mistralrs-server-core/src/types.rs(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- mistralrs-server-core/src/streaming.rs
🚧 Files skipped from review as they are similar to previous changes (4)
- mistralrs-server-core/src/types.rs
- mistralrs-server-core/src/lib.rs
- mistralrs-server-core/src/speech_generation.rs
- mistralrs-server-core/src/image_generation.rs
🧰 Additional context used
🧬 Code Graph Analysis (1)
mistralrs-server-core/src/completions.rs (5)
mistralrs-server-core/src/completion_core.rs (3)
convert_stop_tokens(38-44)get_dry_sampling_params(47-65)handle_completion_error(27-34)mistralrs-server-core/src/handler_core.rs (5)
base_process_non_streaming_response(102-118)create_response_channel(82-87)send_model_request(90-99)new(36-38)new(73-78)mistralrs-server-core/src/streaming.rs (2)
base_create_streamer(45-62)get_keep_alive_interval(65-74)mistralrs-core/src/lib.rs (3)
maybe_log_response(682-694)maybe_log_request(669-680)maybe_log_error(696-707)mistralrs-server-core/src/chat_completion.rs (4)
handle_error(513-518)create_streamer(521-532)process_non_streaming_response(535-540)match_responses(543-567)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Docs
- GitHub Check: Clippy
- GitHub Check: Test Suite (windows-latest, stable)
- GitHub Check: Check (macOS-latest, stable)
- GitHub Check: Check (windows-latest, stable)
- GitHub Check: Test Suite (ubuntu-latest, stable)
- GitHub Check: Test Suite (macOS-latest, stable)
- GitHub Check: Check (ubuntu-latest, stable)
🔇 Additional comments (9)
mistralrs-server-core/src/handler_core.rs (2)
81-87: LGTM!The channel creation function is well-implemented with proper fallback to the default buffer size.
89-99: Well-implemented request sending function.Good use of context for meaningful error messages and proper error propagation.
mistralrs-server-core/src/completion_core.rs (3)
11-24: Well-designed generic responder enum.The enum provides a clean abstraction for different completion response types with good documentation.
38-44: Clean stop token conversion implementation.The function correctly handles both single and multiple stop sequences.
47-65: Well-structured dry sampling parameters helper.Good use of pattern matching and proper error propagation.
mistralrs-server-core/src/completions.rs (2)
187-251: Excellent refactoring of parse_request function.The use of helper functions
convert_stop_tokensandget_dry_sampling_paramsimproves code readability and maintainability.
261-281: Clean and well-structured route handler.Excellent use of shared utilities for channel creation, request sending, and response processing. The handler is now more maintainable and consistent with other endpoints.
mistralrs-server-core/src/chat_completion.rs (2)
197-480: Good refactoring of parse_request with shared utilities.The use of
convert_stop_tokensandget_dry_sampling_paramshelper functions improves code consistency across modules.
490-510: Consistent and clean route handler implementation.Excellent consistency with the completions handler, using the same shared utilities and patterns.
| pub(crate) fn handle_completion_error<R, S>( | ||
| state: SharedMistralRsState, | ||
| e: Box<dyn std::error::Error + Send + Sync + 'static>, | ||
| ) -> BaseCompletionResponder<R, S> { | ||
| let error = anyhow::Error::msg(e.to_string()); | ||
| MistralRs::maybe_log_error(state, &*error); | ||
| BaseCompletionResponder::InternalError(error.into()) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Preserve the original error instead of converting to string.
Converting the error to a string loses valuable debugging information like the error type and stack trace.
pub(crate) fn handle_completion_error<R, S>(
state: SharedMistralRsState,
e: Box<dyn std::error::Error + Send + Sync + 'static>,
) -> BaseCompletionResponder<R, S> {
- let error = anyhow::Error::msg(e.to_string());
- MistralRs::maybe_log_error(state, &*error);
- BaseCompletionResponder::InternalError(error.into())
+ MistralRs::maybe_log_error(state, &*e);
+ BaseCompletionResponder::InternalError(e)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pub(crate) fn handle_completion_error<R, S>( | |
| state: SharedMistralRsState, | |
| e: Box<dyn std::error::Error + Send + Sync + 'static>, | |
| ) -> BaseCompletionResponder<R, S> { | |
| let error = anyhow::Error::msg(e.to_string()); | |
| MistralRs::maybe_log_error(state, &*error); | |
| BaseCompletionResponder::InternalError(error.into()) | |
| } | |
| pub(crate) fn handle_completion_error<R, S>( | |
| state: SharedMistralRsState, | |
| e: Box<dyn std::error::Error + Send + Sync + 'static>, | |
| ) -> BaseCompletionResponder<R, S> { | |
| MistralRs::maybe_log_error(state, &*e); | |
| BaseCompletionResponder::InternalError(e) | |
| } |
🤖 Prompt for AI Agents
In mistralrs-server-core/src/completion_core.rs around lines 27 to 34, the
current code converts the error to a string which loses the original error type
and stack trace. Modify the function to preserve the original error by wrapping
or converting it into an anyhow::Error without calling to_string(), so the full
error context is retained for logging and returning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (3)
mistralrs-server-core/src/speech_generation.rs (1)
179-186: 🛠️ Refactor suggestionOver-allocates output buffer (
len * size_of::<i64>()).
Vec::with_capacity(samples.len() * size_of::<i64>())reserves 8 bytes per sample but you subsequently push i16 (2 bytes) little-endian values.
This inflates peak memory by 4× for large audio responses.-let mut buf = Vec::with_capacity(samples.len() * std::mem::size_of::<i64>()); +let mut buf = Vec::with_capacity(samples.len() * std::mem::size_of::<i16>());mistralrs-server-core/src/completions.rs (1)
80-156:⚠️ Potential issueTrait impl for a type-alias will not compile.
You declare
pub type CompletionStreamer = BaseStreamer<…>; impl futures::Stream for CompletionStreamer { … }Rust forbids implementing traits for type aliases (
E0643).
Either:
- Implement
Streamfor the concreteBaseStreamer<…>(generic impl), or- Wrap the alias in a new-type struct and implement
Streamfor that struct.Minimal fix:
-pub type CompletionStreamer = BaseStreamer<CompletionChunkResponse, - CompletionOnChunkCallback, CompletionOnDoneCallback>; - -impl futures::Stream for CompletionStreamer { … } +pub struct CompletionStreamer( + pub BaseStreamer< + CompletionChunkResponse, + CompletionOnChunkCallback, + CompletionOnDoneCallback, + >, +); + +impl futures::Stream for CompletionStreamer { + type Item = Result<Event, axum::Error>; + fn poll_next( + mut self: Pin<&mut Self>, + cx: &mut Context<'_>, + ) -> Poll<Option<Self::Item>> { + BaseStreamer::poll_next(Pin::new(&mut self.0), cx) + } +}Without this change the crate will fail to build.
Same issue exists inchat_completion.rs.mistralrs-server-core/src/chat_completion.rs (1)
82-164:⚠️ Potential issueSame type-alias / trait-impl compilation error as in
completions.rs.
ChatCompletionStreameris atypealias followed by an inherentimpland aStreamimpl – both illegal for aliases. Apply the same new-type or generic-impl fix described forCompletionStreamer.
Compilation will otherwise fail.
♻️ Duplicate comments (2)
mistralrs-server-core/src/completions.rs (1)
330-337: Repeatedunreachable!()paths – prefer graceful handling.
This was flagged in an earlier review and is still present; unexpected variants will panic the server. Replace with logged internal errors as suggested previously.mistralrs-server-core/src/chat_completion.rs (1)
559-566: Unhandled variants guarded byunreachable!()– identical concern noted previously; please replace with logged internal errors to avoid runtime panics.
🧹 Nitpick comments (1)
mistralrs-server/src/mcp_server.rs (1)
11-13: Minor: stale module coupling.
parse_requestis still imported fromchat_completionwhilecreate_response_channel/send_requestmoved tohandler_core.
Consider relocatingparse_requestnext so all low-level helpers reside in one module, reducing cross-crate coupling.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
mistralrs-server-core/src/chat_completion.rs(10 hunks)mistralrs-server-core/src/completions.rs(6 hunks)mistralrs-server-core/src/handler_core.rs(1 hunks)mistralrs-server-core/src/image_generation.rs(3 hunks)mistralrs-server-core/src/speech_generation.rs(6 hunks)mistralrs-server/src/mcp_server.rs(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- mistralrs-server-core/src/image_generation.rs
🧰 Additional context used
🧬 Code Graph Analysis (2)
mistralrs-server-core/src/speech_generation.rs (4)
mistralrs-server-core/src/handler_core.rs (2)
create_response_channel(82-87)send_request(90-99)mistralrs-server-core/src/image_generation.rs (3)
handle_error(111-118)process_non_streaming_response(121-126)match_responses(129-156)mistralrs-server-core/src/completions.rs (3)
handle_error(284-289)process_non_streaming_response(306-311)match_responses(314-338)mistralrs-core/src/lib.rs (1)
maybe_log_error(696-707)
mistralrs-server-core/src/chat_completion.rs (5)
mistralrs-server-core/src/completion_core.rs (3)
convert_stop_tokens(38-44)get_dry_sampling_params(47-65)handle_completion_error(27-34)mistralrs-server-core/src/handler_core.rs (5)
base_process_non_streaming_response(102-118)create_response_channel(82-87)send_request(90-99)new(36-38)new(73-78)mistralrs-server-core/src/streaming.rs (2)
base_create_streamer(45-62)get_keep_alive_interval(65-74)mistralrs-server-core/src/util.rs (2)
parse_audio_url(85-122)parse_image_url(42-82)mistralrs-server-core/src/completions.rs (4)
handle_error(284-289)create_streamer(292-303)process_non_streaming_response(306-311)match_responses(314-338)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Docs
- GitHub Check: Check (macOS-latest, stable)
- GitHub Check: Clippy
- GitHub Check: Test Suite (ubuntu-latest, stable)
- GitHub Check: Check (windows-latest, stable)
- GitHub Check: Check (ubuntu-latest, stable)
- GitHub Check: Test Suite (windows-latest, stable)
- GitHub Check: Test Suite (macOS-latest, stable)
| /// Default buffer size for the response channel used in streaming operations. | ||
| /// | ||
| /// This constant defines the maximum number of response messages that can be buffered | ||
| /// in the channel before backpressure is applied. A larger buffer reduces the likelihood | ||
| /// of blocking but uses more memory. | ||
| pub const DEFAULT_CHANNEL_BUFFER_SIZE: usize = 10_000; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Channel buffer size is excessive and hard-coded – make it configurable.
DEFAULT_CHANNEL_BUFFER_SIZE is set to 10_000, which allocates a bounded channel large enough to hold ten thousand full Response objects.
For high-volume streaming this may silently consume tens or hundreds of MB of memory per request. Either expose the size through configuration (env, CLI or State::config()), or document why this value is safe for all workloads.
-pub const DEFAULT_CHANNEL_BUFFER_SIZE: usize = 10_000;
+/// Default MPSC buffer size. Override via the
+/// `MISTRALRS_CHANNEL_BUFFER_SIZE` env-var if needed.
+pub const DEFAULT_CHANNEL_BUFFER_SIZE: usize =
+ option_env!("MISTRALRS_CHANNEL_BUFFER_SIZE")
+ .and_then(|v| v.parse().ok())
+ .unwrap_or(1_024);Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In mistralrs-server-core/src/handler_core.rs around lines 11 to 17, the constant
DEFAULT_CHANNEL_BUFFER_SIZE is hard-coded to 10,000, which can cause excessive
memory usage. Modify the code to make this buffer size configurable via
environment variables, CLI arguments, or the application's configuration state.
Replace the hard-coded value with a configurable parameter and ensure the
default is set to a reasonable value if no configuration is provided.
|
Rebased off the MCP work and I think everything is still working as intended $ cargo run --features metal '--' --port 1234 --mcp-port 4321 --isq 8 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct
Finished `dev` profile [optimized + debuginfo] target(s) in 0.55s
Running `mistral.rs/target/debug/mistralrs-server --port 1234 --mcp-port 4321 --isq 8 --paged-attn plain -m meta-llama/Llama-3.2-1B-Instruct`
2025-06-10T22:39:19.758760Z INFO mistralrs_server_core::mistralrs_for_server_builder: avx: false, neon: true, simd128: false, f16c: false
2025-06-10T22:39:19.758880Z INFO mistralrs_server_core::mistralrs_for_server_builder: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-06-10T22:39:19.758930Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model kind is: normal (no adapters)
2025-06-10T22:39:19.759071Z INFO hf_hub: Using token file found ".cache/huggingface/token"
2025-06-10T22:39:19.759962Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-06-10T22:39:19.760364Z INFO mistralrs_core::pipeline::normal: Loading `config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-06-10T22:39:20.148423Z INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model.safetensors"]
2025-06-10T22:39:20.256741Z INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-06-10T22:39:20.572139Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `meta-llama/Llama-3.2-1B-Instruct`
2025-06-10T22:39:20.684308Z INFO mistralrs_quant::utils::log: Automatic loader type determined to be `llama`
2025-06-10T22:39:20.684405Z INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-06-10T22:39:20.694059Z INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-06-10T22:39:20.704524Z INFO mistralrs_core::pipeline::loaders::auto_device_map: Using automatic device mapping parameters: text[max_seq_len: 4096, max_batch_size: 1].
2025-06-10T22:39:20.704581Z INFO mistralrs_quant::utils::log: Model has 16 repeating layers.
2025-06-10T22:39:20.704586Z INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-06-10T22:39:20.704658Z INFO mistralrs_quant::utils::log: Layers 0-15: metal[4294968389] (36 GB)
2025-06-10T22:39:20.706215Z INFO mistralrs_core::utils::normal: DType selected is BF16.
2025-06-10T22:39:20.706239Z INFO mistralrs_core::pipeline::normal: Model config: Config { hidden_act: Silu, hidden_size: 2048, intermediate_size: 8192, vocab_size: 128256, num_hidden_layers: 16, num_attention_heads: 32, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 500000.0, max_position_embeddings: 131072, rope_scaling: Some(Llama3RopeConfig { factor: 32.0, low_freq_factor: Some(1.0), high_freq_factor: Some(4.0), original_max_position_embeddings: Some(8192), rope_type: Llama3 }), quantization_config: None, tie_word_embeddings: true }
2025-06-10T22:39:20.707574Z INFO mistralrs_core::pipeline::normal: Applying ISQ to Some(AFQ8)
2025-06-10T22:39:20.707874Z INFO mistralrs_core::utils::varbuilder_utils: Loading model using mmap strategy.
2025-06-10T22:39:24.612934Z INFO mistralrs_core::paged_attention: Allocating 128 MB for PagedAttention KV cache per GPU
2025-06-10T22:39:24.612946Z INFO mistralrs_core::paged_attention: Using PagedAttention with block size 32 and 128 GPU blocks: available context length is 4096 tokens
2025-06-10T22:39:24.918844Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<|begin_of_text|>", eos_toks = "<|eot_id|>", "<|end_of_text|>", "<|eom_id|>", unk_tok = `None`
2025-06-10T22:39:24.927999Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-06-10T22:39:24.928068Z INFO mistralrs_core: Pipeline input modalities are [📝 Text]
2025-06-10T22:39:24.928103Z INFO mistralrs_core: Pipeline output modalities are [📝 Text]
2025-06-10T22:39:24.928181Z INFO mistralrs_core: Beginning dummy run.
2025-06-10T22:39:24.928700Z INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
2025-06-10T22:39:25.008699Z INFO mistralrs_core: Dummy run completed in 0.080513s.
2025-06-10T22:39:25.008720Z INFO mistralrs_server: MCP server listening on http://0.0.0.0:4321/mcp.
2025-06-10T22:39:25.008722Z INFO mistralrs_server: MCP protocol version is 2025-03-26.
2025-06-10T22:39:25.010932Z INFO mistralrs_server: OpenAI-compatible server listening on http://0.0.0.0:1234.curl -X POST http://localhost:4321/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "chat",
"arguments": {
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello." }
],
"maxTokens": 50,
"temperature": 0.7
}
}
}'{
"jsonrpc": "2.0",
"id": 3,
"result": {
"content": [{ "text": "How can I assist you today?", "type": "text" }]
}
}2025-06-10T22:39:29.934289Z INFO mistralrs_core::engine::logger: Throughput (T/s) 0.80, Prefix cache hitrate 0.00%, 0 running, 0 waiting
2025-06-10T22:42:45.054440Z INFO mistralrs_core::engine::logger: Throughput (T/s) 9.40, Prefix cache hitrate 0.00%, 0 running, 0 waiting
2025-06-10T22:45:15.152949Z INFO mistralrs_core::engine::logger: Throughput (T/s) 22.60, Prefix cache hitrate 0.00%, 0 running, 0 waiting
2025-06-10T22:45:35.165804Z INFO mistralrs_core::engine::logger: Throughput (T/s) 10.80, Prefix cache hitrate 0.00%, |
|
@EricLBuehler no rush at all (keep cranking out that awesome MCP work!), but when you have a moment, what do you think of this PR? The intent is to unify the handlers approach(es) and make them reusable (like chat completions). Also, if you've got more server stuff coming, I can hold off on this / wait to rebase this when there is a lull. Overall, I think it's pretty solid, but I ended up calling a lot of things The actual mistralrs-server args / flags / API / end user stuff remains unchanged (this is only a concern for anything that is directly using server-core). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks amazing, thank you!
Sorry, had little bandwidth over last week!
) * Start working on consolidating completion and chat_completion underlying implementations * Move response channel to util mod for now (since it's used with streaming and non streaming) * More work on consolidating completions and chat completions * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * Update docs and restrict completion core visibility * CodeRabbit feedback: remove logprobs warn from route handler since parse request also checks this * Use consistent var name for completions mod * Make route handler modules public API consistent (same fn names, etc.) and provide proxy fn that wrap core fns so core mod doesn't have to be pub Make lib.rs example compile checked and update example * Code formatting * Typo * Sync fork * Sync fork * Docs example fix
* Add most of paged attn kv quant * It builds a bit * All the functionality at least * Small fix * Add a scale * Fix bf16 usage * Make k_v_scale optional * Collector * Tweak collection * Refactor * Add to apis * Add cuda impl * Fix compilation * Fixes * Handle ENABLE_FP8 * Format * Tweak * Fix scaled_convert usage * Fix cache_t size * Fixed scale collection * Actual fix * Fix fp8 for CC<8 * Fix the usual String != &str bit (#1483) Co-authored-by: RageLtMan <rageltman [at] sempervictus> * chore: `Dockerfile` - Drop runtime rayon thread ENV (#1465) * chore: Dockerfile - Remove rayon threads env * chore: Dockerfile - Improve formatting for `apt-get` * Remove duplicate calls for api_dir_list (#1474) * Remove duplicate calls for api_dir_list * Support local cache for api_dir_list * Fix home folder for metal * Capitalized * Fix transient pyo3 dep (#1478) Co-authored-by: Eric Buehler <[email protected]> * Fix objc dep with non macos (#1480) * Fix phi 3/4 + nccl issue (#1481) * Fix log * Fix n kv heads * Fix phi3.5 moe (#1482) * Fix phi3.5 moe accum device * Fix again * Fix again * Support GLM4 model! (#1437) * Support GLM4 model * Mention GLM4 model in ReadMe * glm4 type hint * Typo fix * Fix unsupported chat_template function * Clippy fix * Refactor distributed backend (#1484) * Refactor distributed backend, check power of 2 * Fix compilation * Cap metal paged attn kv allocation (#1485) * Better paged attn metal cap (#1486) * Better paged attn metal cap * Small fix * Comment * Small fix * Refactor * Server core: consolidate and unify route handlers and API surface (#1423) * Start working on consolidating completion and chat_completion underlying implementations * Move response channel to util mod for now (since it's used with streaming and non streaming) * More work on consolidating completions and chat completions * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * Update docs and restrict completion core visibility * CodeRabbit feedback: remove logprobs warn from route handler since parse request also checks this * Use consistent var name for completions mod * Make route handler modules public API consistent (same fn names, etc.) and provide proxy fn that wrap core fns so core mod doesn't have to be pub Make lib.rs example compile checked and update example * Code formatting * Typo * Sync fork * Sync fork * Docs example fix * Support qwen3 gguf (#1488) * Add qwen3 gguf * Template fixup * Make bos/eos token IDs optional (#1493) * Remove python deps from CUDA dockerfiles (#1487) * Handle USE_FP8 for cuda * Fix cuda warn * Add readme * Saturating sub in sequence state --------- Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: RageLtMan <[email protected]> Co-authored-by: Brennan Kinney <[email protected]> Co-authored-by: Guoqing Bao <[email protected]> Co-authored-by: Matthew Haynes <[email protected]>
* Fix handling of Metal fused attn head dims (EricLBuehler#1234) * Fix handling of metal attn head dims * Fix handling of gemma3 1b when images * Tweak default for paged attn builder * Support paged attn for vision model rust api (EricLBuehler#1235) * [Breaking] Support setting HF cache path (EricLBuehler#1237) * Add it internally * Add the apis * Support tool calling for DeepSeek models (EricLBuehler#1239) * Support tool calling for deepseek models * Format * Fix deepseek * Server image processing refactor and fixes (EricLBuehler#1244) * Fix strict gemma3 case * Accept multiple images in the content array * Fix multiple images in one array ct * Add it to the python api * Typos * Optimized CUDA RoPE kernels (EricLBuehler#1247) * Add the kernels * It works * Works * Buulds * Typo fix (add_speial_tokens to add_special_tokens) (EricLBuehler#1246) * Fix typo * Update mistralrs.pyi * Fixes for UQFF + distributed layers (EricLBuehler#1250) * Fixes for uqff + distributed layers * Typo * Automatic agentic search integration (`web_search_options`) (EricLBuehler#1243) * Add the tool * Actually search * Clippy * Sort of works * Remove some debuggers * tweak * Add some rules * Works great * Tweak 'system' prompt * Update mistralrs-core/src/search/mod.rs Co-authored-by: Copilot <[email protected]> * Typo * Add it to all the apis * Add bert model for similarity reranking * Typos * Early detection of tools * Alias max_tokens -> max_completion_tokens too * Customizable bert model * Flip the enabler around * Add docs * Update readme * Typo --------- Co-authored-by: Copilot <[email protected]> * Format kernels (EricLBuehler#1251) * Update readme * Update readme * Remove test * Add quantize guards for uqff deserialize (EricLBuehler#1252) * Refactor cuBLASlt-related code (EricLBuehler#1253) * Centralize cublaslt into mistralrs-quant * Use cublaslt in unquant layer * Use beautiful trait constants for simpler code * Move tests * Dispatch to unquant for cublaslt * Dispatch to unquant for cublaslt * Fix feature * Add convert_to_gptq script * Update deps, bump pyo3 version (EricLBuehler#1259) * Faster cuda FP8 performance (EricLBuehler#1257) * Avoid fp8 sync * Fix dtype * Rust 1.86 clippy (EricLBuehler#1260) * Rust 1.86 clippy * Clippy * Refactor engine arch (EricLBuehler#1262) * Refactor engine add_request * Don't recompile regex * Clippy * Revamped LoRA support - removing the Ordering system! (EricLBuehler#1263) * Play with varbuilder lifetimes * Merge lora weights * Clippy * Lora works * Support multiple loras * Cleanup, remove adapter activation * Complete merge * Fast Metal-specific quantization method: AFQ (EricLBuehler#1264) * Add mlx quantized kernels * Add mlx quantized kernels * Kernel launcher * Add AFQ isq quant and dequant * Some quantmethod things * Begin to implement the qmm caller * Clippy * Much faster * Cache kernels * Docs * Clippy * Add it to uqff * Support prequantized models from MLX (EricLBuehler#1265) * Refactor quantizedconfig * Support AFQ prequantized * Update docs * Update docs * Automatic ISQ to select fastest & most accurate method (EricLBuehler#1266) * Automatic isq * typo * Doc * Improved usage metrics (EricLBuehler#1267) * Fix cuda * Bump tokio from 1.44.1 to 1.44.2 (EricLBuehler#1270) Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.44.1 to 1.44.2. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](tokio-rs/tokio@tokio-1.44.1...tokio-1.44.2) --- updated-dependencies: - dependency-name: tokio dependency-version: 1.44.2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Gather MM ops in mistralrs-quant (EricLBuehler#1272) * Update the caller * Wire things up * Broadcase for afq gathermm * Broadcase for afq gathermm * Clippy * Improve performance of deepseek models * Typo fix * BincountOp not used * Implement Llama 4! (EricLBuehler#1268) * Implement Llama 4 * Implement the main changes for the text model * Make chunked mask * Wire things up * Add some EP * Initial sketch of inputs processor * Runs * Progress * all reduce moes * It works! * Some cleanup * Faster moe block * Add device map * Make chunked matrix * Fully working now! * Reactivate cublaslt * Fix shared mlp cublaslt * Refactor to packed experts * Complete merge * It is a normal model now * Fixes * Set device for moe * ISQ fixes * Much faster sort kernel * Faster loading! * Faster loading! * Fp8 cpu copy ops in candle backend * Add the vision model * Add mmproj layer * Actually merge the inputs * Sketch most of the image processor * Add the rest of the image processor * Implement the whole processor * Add the loader * Some fixes * A batch of fixes * Some fixes * tmp * Actually support isq * Ok it works a bit * Fix norm device * It works * A bit cleaner * Support residul tensors * Remove text loader * Implement the device mapping system * Fix auto device map * Add examples * Add model card * Typo * Remove superflous logging * Fixes for Llama 4 UQFF loading (EricLBuehler#1275) * Support sharding for UQFF (EricLBuehler#1276) * Serialize sharded uqff files * Loading * Fix base64 * Fix bug for group-topk (group_limited_greedy) in deepseek models (EricLBuehler#1278) * Support the DeepCoder model (EricLBuehler#1279) * Add faq for metal not found * Improved PagedAttn scheduling accuracy (EricLBuehler#1282) * Scheduler ops by reference * Ensure scheduler gets correct prompts * Fix cuda build for copy_blocks * Fixes for scheduling image seqs with pagedattn (EricLBuehler#1283) * update to llguidance 0.7.16 (EricLBuehler#1284) * update llguidance to 0.7.16 from crates.io; use ParserFactory * add lark_llg.py example * use new llguidance::Matcher APIs * rework spec-decoding with llg * more work on spec sampling * check for parser stop * fix clippy * remove unneeded rollback * update build_llg_factory to return Result * Update dependencies (EricLBuehler#1286) * Much faster image inputs processing (EricLBuehler#1289) * Add more SDPA head dims for much faster SigLIP (EricLBuehler#1290) * More sdpa head dims, faster vision models * Move nonzero to above for faster metal synch * Doc * Update valid head dims * Show throughput in interactive mode (EricLBuehler#1291) * Update interactive mode throughput stats * Accurate prompt t/s * Accurate prompt t/s for usage * Unify bitwise operations (EricLBuehler#1288) * Unify bitwise ops * Tests pass * Fix cuda build * Clippy * Multimodal prefix caching support! (EricLBuehler#1298) * Initial progress * Support vision prefix caching * Update docs * Add multimodal data abstraction * Interactive mode improvements (EricLBuehler#1299) * More ergonomic image url parsing * Add option to clear * Add the Qwen 3 and Qwen 3 MoE models! (EricLBuehler#1285) * Add qwen3 model * Add enable_thinking * Add initial qwen3 moe * Add the moe model * Format * Fix order of norm * Fix expert shapes * Fix reverse * Fix norm device for isq * Fix nonzero when no nonzero * Moe model runs * Working qwen3 moe * Add metal fp8 blockwise dequant * Clean * Typo * Enable tool calling * Streamlined ux * Add some examples * Add docs * Fix dead link * Remove interactive mode max_len * Update QWEN3.md * Hotfix for vision mode clear * Revamped and streaming web search support (EricLBuehler#1301) * Streaming web search * Refactor a bit * More refactoring * Add some logging, parallelize some things * Allow url * Suppress warning, allow multi-turn searching * Batch compute_similarities * Cap content len * Typos * Doc * Handle vision messages or different tool call prefixes (EricLBuehler#1302) * Fix cuda * Tune web search budget * Simplify prefix cacher (EricLBuehler#1305) * Use rustyline to handle non-ascii in interactive mode (EricLBuehler#1306) The io::stdin().read_line() cannot handle non-ascii input, which caused crash when use backspace to delete non-ascii characters. Introduce rustyline to the interactive mode to solve the problem. Plus it can bring more editing features in the future. Close EricLBuehler#1140 * Add more tools for automatic search (EricLBuehler#1307) * Add interactive mode history * Add a website extraction tool * Pass toks by reference * Optimize prompt chunking * Fix CPU hogging in interactive mode (EricLBuehler#1309) The log enabler should be checked after the sleep instead of a busy loop checking. Since the interactive mode always disables the token speed logger, 100% CPU was taken by this loop always. * Add Metal precompilation support (EricLBuehler#1311) * Add metal precompilation for paged attn * Add for mistralrs-quant * Better constructor * Dont always build * Fix name for paged attn rebuild * Reduce thrashing of Metal autorelease (EricLBuehler#1313) * Reduce calls to autorelease * Optimize clone_in_cache * Refactor float8 * make `AdapterPaths` and `LoraAdapterPaths` public (EricLBuehler#1314) Make `AdapterPaths` and `LoraAdapterPaths` public so `LocalModelPaths` can be constructed outside of `mistralrs-core`. * Refactor KV cache manager (EricLBuehler#1315) * Refactor kv cache * Refactor caches * Fix some overflows * Add `Audio` and `Speech` model categories (EricLBuehler#1317) * add `Audio` to `ModelCategory` * add `Speech` to `ModelCategory` * fix to go back to PartialEq having an exhaustiveness check * Remove has_conv2d from vision model API (EricLBuehler#1318) * Unified/automatic flash attention enabler (EricLBuehler#1319) * Remove from sdpa params * Fix errors * No warnings * Log * Clippy * Fix cublaslt 4d mask (EricLBuehler#1320) * Fix cublaslt 4d mask * Clippy * Keep caches on gpu * Qwen VL models fixes (EricLBuehler#1322) * Add some defaults * Fix * Fix one thing * 2.5 vl works * Use caching again * Fix v2 * Move index inside loop * Offset in ropeidx * Default support for vision prefix caching is false * Fixes for all vision models (EricLBuehler#1323) * Fix phi input processor? * Fix phi input processor * Handle no_prefix_cache from pipeline * Phi models confirmed 👍 * Fixed for phi inputs processors * Fixed for phi4 * Llama 3 confirmed 😀 * Mistral 3 confirmed 😃 * Idefics 2/3 fixes * Some fixes * Remove unsafety * Improved+faster LRU prefix cacher (EricLBuehler#1321) * Show TTFT * Use LRU prefix cacher * Faster prefix cacher * Inplace ISQ support and default to mmap (EricLBuehler#1277) * Initial impl of immediate isq * Immediate isq -> !loading_isq * Varbuiler utils always using mmap! * Log * Add for packed experts * Afq without copy * Clarify * Clippy * Apple immediate isq * Better logic for loading_isq * Support showing ttft * Rename * Shared quantize guard * Parallel progress bar * Parallel loading for progress bars * Actual ISQ support * Conditional parallelism for NiceProgressBar * Use conditional iterator * Warn once * Predicate for applying immediate isq * Allow parallel * Remove debug print * Remove debug print * Remove debug print * Fix typos (EricLBuehler#1329) * Fix Idefics 3 arch chat templating (EricLBuehler#1330) * Update inputs merger * Fix * Better warning * Better warning * Better warning * Nonzero ahead of time * No f32 * Clippy * Optimize get_logprobs * Fix packed experts * Update masking * Use Sdpa in idefics3 * QuantMethod in idefics3 vision * Remove a .contiguous * Remove two space from PR comment (EricLBuehler#1331) * Add automatic vision loader type (EricLBuehler#1332) * Add automatic vision loader * Remove references to --arch * Update examples * Add the Dia 1.6b TTS model! (EricLBuehler#1304) * Add loading * Add rope, mlp, most of attn * Add encoder + encoder layer, decoder layer forwards * Add decoder forwards * Add prepare_audio_prompt * prepare_generation mostly done * Add a proper dia kvcache * Add most of decoder_step * Add the sampler * Add the generation loop * Wire things up * Add speech pipeline * Fixes * Loads * Some fixes * f32 * Some progress * Ok it runs upto dac decoding * Add dac part loading * Loads and runs at least * Remove encodec * Debugging * Debugging * Huh * Complete merge * Interactive * Confirmed dac works at least * Looks like encoder works * Much progress * Hmm * Sampling * Almost there * Sampler * Sampler * Bf16 support * Response * Use it in interactive mode * Fix oneshot * Add openai api * Add openai api * Refactor loading * Use naive sdpa for inplace * Factor out * Clippy * Clippy * Config * Refactor config * Metal clippy * Fix t/s * ISQ support * Some fixes, nits * Fix cuda * Clippy * Inhibit cublaslt for cuda * Add server example * Add python example * Add rust api * Add docs * Update config.toml * Fix .pyi * Update readme * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * config.toml tweak * update `llguidance` to `0.7.20` (EricLBuehler#1334) Update `llguidance` from `0.7.16` to `0.7.20` so that it has guidance-ai/llguidance#172 which is a fix for building on GCC 15. * Add model category <> messages check (EricLBuehler#1335) * Verify model category matches the messages * Add vision chat * Fixes * Add element-wise normalization check (EricLBuehler#1340) * Fix streaming example print statement (EricLBuehler#1339) * Fix normalization formula in comment (EricLBuehler#1338) * Fix image_to_pixels to handle non-RGB images (EricLBuehler#1337) * Fix typo in expect messages (EricLBuehler#1342) * Don't use mmap on cuda (EricLBuehler#1336) * No mmap on cuda * Simplify streaming tool call logic * Remove debug * Support AWQ format models (EricLBuehler#1350) * Support AWQ format models * Clippy fix * Fix uqff dummy layer ISQ application (EricLBuehler#1351) * Disable immediate isq if write_uqff (EricLBuehler#1352) * Fixes for UQFF loading on CUDA, ISQ pack factor (EricLBuehler#1354) * Fix logic for uqff on cuda * Updated pack_factor * Refactor Option references for model paths (EricLBuehler#1347) * refactor: use Option refs in model path helpers * Format * Add a script for server benchmarking (EricLBuehler#1355) * Serde alias * Fix * Update for tie_word_embeddings * Print running/waiting * 30 users * Update num_users * Update dummy paged attn * Optimized Metal qmv_fast path (EricLBuehler#1356) * Compile with lto * Tweak profiles * New, fast sampler for Metal! (EricLBuehler#1327) * Show TTFT * Use LRU prefix cacher * Faster prefix cacher * A bit of gpu sampling * Minp but cpu for now * Metal fast cumsum impl * Sampling with fast topp kernel * Hmm not perfect * Add metal sort kernels * Tmp * Add single block sort * Add most of multi block sort, just need copy op * Add copy kernels * Expose kernels * Add a test * Ok it works * Structure things * Add caching * Rename * Cpu is default * CUDA case * Topk * Refactor Option references for model paths (EricLBuehler#1347) * refactor: use Option refs in model path helpers * Format * Add a script for server benchmarking (EricLBuehler#1355) * Serde alias * Fix * Update for tie_word_embeddings * Print running/waiting * 30 users * Update num_users * Update dummy paged attn * Optimized Metal qmv_fast path (EricLBuehler#1356) * Compile with lto * Tweak profiles * Fix topk * Penalties * Add logits processor, clippy fixes * Fix chat port * Remove warning * Fix chat port * Fix metal parallel sampling (EricLBuehler#1357) * Cpu if parallel for now * Tweak bench script * Add immediate isq predicates for qwen3 (EricLBuehler#1358) * Add immediate isq predicates for qwen3 * Fix parsing of "parse_isq_value" depedent of device * Typo * Fix gemma3 logging * Regressions fixes (EricLBuehler#1359) * Fix regression for mmap * Revert EricLBuehler#1321 * Refactored matching_cache impl * Clippy * Revamped and smaller readme (EricLBuehler#1360) * Expandable detail sections * Refactor using derivative model * Tweak quick examples * Update llama * Update llama * Supported accelerators is a table * Update installation guides * Tweak apis * Remove --port in quick examples * Add demo gif * Add gif in readme * Update demo gif * Update demo gif * Update demo gif * Add gif in readme * Add gif in readme * Add a web chat app! (EricLBuehler#1362) * Initial * Markdown * Copy code * Add model loading sidebar * Support vision models * Tweak isq * Links go to another page * Clear when switch model * Fix html tags * Add image support! * More then one images * Fix * Improved textarea * Tab for switching between vision and text * No paged attn for now * Prettier format * Multiple models at once * Better switching, clearing ability * Mobile support * Inline markdown parser * Update examples * Typos * Support specifying isq * Fix mobile * Fixes * Fix button on mobile * Image height is capped * Thumbnail * Fix rotating kv cache edge case * Add drag and drop for images * Small things * Sidebar is frozen now * Better listner * Add readme * Tweak readme * Add chat history support to web chat app (EricLBuehler#1363) * Add chat history * Support renaming * Start immediately with new chat * Add timestamp * Prettier chat list * Style * Delete chat * Fix copy button * Fix markdown rendering * Store things in cache * Store things in cache * Refactor web chat, fix multichat image restore (EricLBuehler#1364) * Fix multichat image restoration. * Clippy * Refactor * Refactor frontent * Fix repeated immediate isq init (EricLBuehler#1365) * Add images_ref * Add debug impl * Fix the bug * Tweak style of buttons * Add a spinner * Move spinner * Tweak emoji * Add gif * Tweak initial gif * Include vision tower tensors in Mistral3 UQFF (EricLBuehler#1366) * Fix mistral 3 uqff resitdual tensors for vision * Rolling shard creation for uqff files (EricLBuehler#1367) * Fix occasional unstability during isq of afq (EricLBuehler#1368) * Fix unstability during isq of afq * Clippy * Fix web chat installation * Support web chat file uploading (EricLBuehler#1370) * Web chat fixes * Fix thumbnail in message, reuse blank chat * Add file uploading support * Fix scroll * Allowed extensions * Preserve files as literals * Support multiple clients * Add a stop button * New cache dir * New cache dir * Fix * Refactor * Update readme * Tweak drag-and-drop css * Add speech generation support to the web chat! (EricLBuehler#1373) * Initial speech gen support for web chat * Tweak ui * Update docs * Prefix caching for PagedAttention! (EricLBuehler#1369) * Exposing some things for logical token blocks * Prefix cache manager has the scheduler * Refactor * Get logical and physical blocks into the prefix cacher * Hash and cache * Pass physical block prefill * Allocation of prefilled block tables * Temp * Dont always use 2 * Hmm * Hmm * It mostly works * Increment refcount * Support images! * Add to dummy paged attn * Fix some clippy * Clippy * More checks * Include EricLBuehler#1371, closes EricLBuehler#1371 * Typos * Update docs * Metal PagedAttention accuracy improvements (EricLBuehler#1374) * Fix subtle bug * Fix half sum bug * Format metal paged attention * Handle images in paged attn scheduler (EricLBuehler#1375) * Include schemas needed for chatcompletions endpoint (EricLBuehler#1353) * EricLBuehler#1326: WIP include schemas needed for chat completions endpoint Conflicts: Cargo.lock mistralrs-server/src/main.rs * EricLBuehler#1326: WIP define utoipa as a workspace dep since core and server both need it * EricLBuehler#1326: first draft of handling schemas that use Either * EricLBuehler#1326: first draft of handling schema for Grammar * EricLBuehler#1326: Add in other endpoints to API docs. * EricLBuehler#1326: Adjust code comments * EricLBuehler#1326: Implement coderabbitai suggestions - EricLBuehler#1353 (review) - EricLBuehler#1353 (comment) * Fix constraints with metal sampler * Revert EricLBuehler#1375 * Fix case where prefix cacher returns no toks (EricLBuehler#1377) * Fix AFQ UQFF serialization * Faster UQFF serialization (EricLBuehler#1379) * Faster UQFF serialization * Fix uqff gemma3 * Improve gemma3 auto loader names * UQFF creation for AFQ on CPU support (EricLBuehler#1380) * Add afq cpu quantize/dequantize * Clippy * Improved device for afq quantize * Improved dtype handling for cpu afq (de)quantize * Improved generate_uqff_card * Add fused CPU attention kernel! (EricLBuehler#1382) * Working * Fix warnings * Allow mask * Support bf16, f16 * Handle striding * Parallelized * Add initial vector flash attn * Avoid repeated allocations * Tiled kv * Apply some clippy * Some small fixes * Chunked vec_dot * Clipy * Use T::zero * Refactor attention backends (EricLBuehler#1384) * Refactor attention code * Refactor attention code * Move into backends * Set macOS thread affinity for CPU attn (EricLBuehler#1385) * Use lazylock * Format * Fix metal warn build * Faster Qwen 3 MoE support on Metal (EricLBuehler#1387) * Fix load * Use afq gather qmm * Well it runs * It works * Polish * Fast and slow options * Remove quantized.rs * Polish some more * Refactor * Add isq * Update load in parallel * Support fp8 * Refactor for FusedExperts * Clippy * Handle pack factor when loading prequantized models * Use f32 only in moe * Avoid using f32 so much * Avoid using f32 so much * Fix PagedAttention block leaks (EricLBuehler#1388) * Warn and ignore if ignored * Fix a block allocation leak * Update bench.py * Fix double free in block engine * Do not apply ISQ if loading a prequantized model * Fix cuda build again (EricLBuehler#1389) * Fix cuda build * Fix * Format * Fixes for cuda docker * Update dockerfiles * Bump version to 0.6.0 (EricLBuehler#1390) * Bump version to 0.6.0 * Remove lower_level api * Make a static dir * Update deps * Fix routing for static handler in web chat * Fewer .contiguous calls for qwen3 moe (EricLBuehler#1391) * Allow speech models to accept batched inputs (EricLBuehler#1393) * Allow speech models to accept batched inputs * Clippy * Ring distributed backend for heterogeneous TP (EricLBuehler#1238) * Begin work on ring distributed backend for Metal * Add the actual ring functionality * It loads and kind of runs * It works * Optimize buffer allocation * Avoid copy * It works * Add allgather * Fix load * Ping-pong * Small things * Add config json * Allow different ip address * Read config once * Read config when appropriate * Replicate requests * Small fix * Fix small compat with openai * Clippy * Update docs * Add deepseek tool calling chat template * Add auto loader for vision/text detection! (EricLBuehler#1402) * Add auto loader for vision/text detection * Build fixes * Add model loader * Update docs * Format * Create Mistral.rs Server Core Lib: `mistralrs-server-core` (EricLBuehler#1346) * First draft of exposing mistral server routes as lib * make arg struct fields pub * Take base path so utoipa swagger route can properly redirect * Expose swagger routes and make it configurable * Add base path option for swagger docs * More work on modularizing mistralrs server * Sync fork (+1 squashed commit) Squashed commits: [169ae9e] Sync fork * Adjust fn params to use refs / individual params instead of args * Start breaking down controller actions into smaller pieces * Continue refactoring * Make mods pub so they can be used outside crate * Allow chat completion streamer to take a callback so that you can get the complete response when finished WIP (+3 squashed commits) Squashed commits: [0061d87] WIP [c484d56] WIP [16f8a60] WIP * Sync fork * Adjust callback type * Remove throughput_log arg that was removed in 26afcc3 * Implement defaults for Args (and use for Clap) * Small code formatting tweaks * Rename callback to match SSE event and code clean up * Sync fork * WIP: first very rough draft of server core builder. Doesn't meet parity with old functional approach yet (slower / unstable?). * Clean up (+4 squashed commits) Squashed commits: [e1cff387] Sync fork [d8301025] WIP debugging [1ea9f8c8] Sync fork [4fe28cf5] WIP: debug function * WIP server core builders * Code clean up * Add on_chunk callback * Code clean up * First draft of creating version of mistral-server that uses server-core Code clean up (+1 squashed commit) Squashed commits: [adea1693] * Sync fork * Add helper methods to builder to make optional args more ergonomic (since .build validates params) * Start adding docs * Start cleaning up crates deps * Example commit of mistral-server with implementing server-core * Start addressing CodeRabbit feedback * Fix comment typo * Tweak doc blocks * - Update type alias naming for clarity (MistralRs instead of Mistral) - CodeRabbit, don't use eprintln for lib (use trace) - Allow buffer size to be passed in and default to Constant - Allow router body limit to be passed in and default to Constant - Update doc examples * Typo * Address CoderRabbitAI feedback * Support linear rope for llama3 (EricLBuehler#1408) * Hotfix for loading * Fix vllama4 uqff loading (EricLBuehler#1409) * Fix vllama4 uqff loading * Fix regex * Fix regex * Maybe a fix * Gracefully handle receiver disconnects (EricLBuehler#1410) * Handle receiver disconnects * Format * Fix Qwen3 MoE device mapping irregularities (EricLBuehler#1411) * Fix bias * Fix lm_head packing case * Account for gate * Fix head dim * Fix interactive mode URL parsing (EricLBuehler#1412) * fix url regex in vision interactive mode * Fix regex * Clippy * Refactor auto device map (EricLBuehler#1413) * Refactor auto device map * Refactor a bit more * Clippy * Enable runtime sampling tweaks in interactive mode (EricLBuehler#1414) * Document runtime sampling commands * Fix readme * Tweak * Bounds checking * Tweak temp bounds * Send streaming tokens every time * Gumbel sampling for fast sampler (EricLBuehler#1416) * Improved handling for initialize_logging * Improved CPU flash attention accuracy & performance (EricLBuehler#1417) * Downcast correctly * Operate internally in f32 * Avoid some casts and striding * Prefetch * Provide chat_templates to container users (EricLBuehler#1419) Models often come without chat templates requiring mapping them from the source repository into a container for access by the mistralrs-server. Copy the templates from the build tree into the root of the image to permit use via `--chat-template /chat_templates/something.json` TODO: With the increase in quantized models and support for other formats, the initial benchmark run during model load can be used to qualify/select existing chat templates embedded into the binary for models which do not come with any (to include output of the functional failures in each test allowing users to modify the ones already provided correctly to suit the model being loaded). Co-authored-by: RageLtMan <rageltman [at] sempervictus> * Faster cpu flash attn (EricLBuehler#1418) * Faster cpu flash attn * Prefetch * Clippy * Add some tests * Add softcap tests * Fix test_parse_image_url test * Update tests * Update tests * Web search improvements (bm25, web chat) (EricLBuehler#1420) * Fix web search blocking case * Web search support in web chat * Tweak ui * Support fallback to bm25 * Clippy * Reinject descriptions * Propely handle consecutive searches (EricLBuehler#1421) * Update extraction tool reinjection * Looped * Update docs (EricLBuehler#1422) - lib.rs: clean up example var names and match logging change from EricLBuehler@201d6be - server_builder: fix typo - READMEs: link to crate docs * Better tool call detection logic (EricLBuehler#1424) * Add web search hook callbacks (EricLBuehler#1426) * feat: add customizable search hook * Move to builder * Update docs * Fix CUDA context switching, bind thread on CudaStorage drop (EricLBuehler#1428) * Add CUDA context helper and use in Llama forward * No flashparams? * working * Tweak * Update to use dep * conditionally build flash attention inputs (EricLBuehler#1429) * Add AGENTS.md (EricLBuehler#1430) * Support Qwen3 GGUF model (EricLBuehler#1432) * Support QWen3 GGUF model * Clippy fix * cargo fmt * Improved paged attn prefix caching (EricLBuehler#1434) * Improved paged attn prefix caching * Disable * Clippy * Temporary fix for qwen3 gguf tokenizer (EricLBuehler#1433) * Temporary fix for qwen3 gguf tokenizer * Typo fix * Add tool callback support (EricLBuehler#1427) * Add tool callback support * Fixes * Support named tool callbacks * Update examples * Update docs * Clippy * Centralize crate dependencies (EricLBuehler#1438) * chore: centralize dependencies * Format * Fix bug in tokenizer created with gguf metadata (EricLBuehler#1440) * Fix bug in tokenizer created with gguf metadata * Clippy fix * Update deps (EricLBuehler#1441) * Small things * Update deps * Update deps * Update breaking changes * Doc fixes (EricLBuehler#1442) * Mention uqff_maker * Downgrade rustyline 16.0.0 -> 15.0.0 (EricLBuehler#1444) * Add max_completion_tokens alias for server (EricLBuehler#1451) * Audio input support (Phi 4 multimodal) (EricLBuehler#1448) * Deps * Add conformer * Nemo loading * Position embeds * Load t5 attn bias * Attn and feed forward * Add conv module and glu pointwise * Implement relative attn bias * Add the forward methods * Add encoder embedding * Fix oproj * Some loading * Conformer loads! * Fully loading speech stack * Merger * Dont need that * First pass at audio processing * Read samples * Optional * Small loading fix * Runs but not correct yet * Improved audio processing? * Works with this * Fix t5 attn bias * It works! * Comment * Use some other crates * Clippy * Allow bf16 on metal * Add prefix_audio * Remove unused * Typo * User specified * Add audio url parsing * AudioProjectionMode -> InputMode * Audio prefix caching * Fix bug in audio prefix caching * Support both at the same time! * Tweak logging * Support stereo * Add mistralrs-audio * Support batching * Add server and rust api example * Add python api * Fix add_multimodal_message * Fix unfold for conformer * Streaming example * Add web chat support * Add modalities registry * Fix offline cache issue for gguf models (EricLBuehler#1452) * Add MCP server endpoints (EricLBuehler#1453) * feat(server): add MCP server support * Add mcp docs * Add handle_list_tools_request * Better launch, tool handling * Tmp state * Ok works * Handle modalities * Update docs * Add ping * Tweak temperature bounds, args * MCP documentation pass (EricLBuehler#1455) * Fix table * Update mcp docs * Improve readme header * Improve readme header * Integrate an MCP client (EricLBuehler#1456) * Add builtin mcp client * Use async loader * Add headers * Handle sse * More flexible search request * Add tool callbacks with tools, for mcp * Add bearer token support * Add websocket support * Update docs * Add python api * Clippy * Add http api, docs * Tests pass * Make these configs actually work * Add docs * Make mistralrs-mcp * Refactor examples * Update examples * Add defaults * Add defaults * Add defaults * Update docs * Improved docs * Add -y to npx usages * Even better examples * Update generate_wheels * Update generate_wheels * Update generate_wheels * Fix Dockerfile.cuda-all * Improve automatic tool call (EricLBuehler#1460) * Improved auto tool call * Add logging * chore: `Dockerfile.cuda-all` configurable threads (EricLBuehler#1458) * chore: `Dockerfile.cuda-all` - Merge `RUN` for `apt-get install` (EricLBuehler#1459) * Add fallback definition for isnan (EricLBuehler#1463) * chore: `Dockerfile` - Drop runtime rayon thread ENV (EricLBuehler#1465) * chore: Dockerfile - Remove rayon threads env * chore: Dockerfile - Improve formatting for `apt-get` * Remove duplicate calls for api_dir_list (EricLBuehler#1474) * Remove duplicate calls for api_dir_list * Support local cache for api_dir_list * Fix home folder for metal * Capitalized * Fix transient pyo3 dep (EricLBuehler#1478) Co-authored-by: Eric Buehler <[email protected]> * Fix objc dep with non macos (EricLBuehler#1480) * Fix phi 3/4 + nccl issue (EricLBuehler#1481) * Fix log * Fix n kv heads * Fix phi3.5 moe (EricLBuehler#1482) * Fix phi3.5 moe accum device * Fix again * Fix again * Support GLM4 model! (EricLBuehler#1437) * Support GLM4 model * Mention GLM4 model in ReadMe * glm4 type hint * Typo fix * Fix unsupported chat_template function * Clippy fix * Refactor distributed backend (EricLBuehler#1484) * Refactor distributed backend, check power of 2 * Fix compilation * Cap metal paged attn kv allocation (EricLBuehler#1485) * Better paged attn metal cap (EricLBuehler#1486) * Better paged attn metal cap * Small fix * Comment * Small fix * Refactor * Server core: consolidate and unify route handlers and API surface (EricLBuehler#1423) * Start working on consolidating completion and chat_completion underlying implementations * Move response channel to util mod for now (since it's used with streaming and non streaming) * More work on consolidating completions and chat completions * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * Update docs and restrict completion core visibility * CodeRabbit feedback: remove logprobs warn from route handler since parse request also checks this * Use consistent var name for completions mod * Make route handler modules public API consistent (same fn names, etc.) and provide proxy fn that wrap core fns so core mod doesn't have to be pub Make lib.rs example compile checked and update example * Code formatting * Typo * Sync fork * Sync fork * Docs example fix * Support qwen3 gguf (EricLBuehler#1488) * Add qwen3 gguf * Template fixup * Make bos/eos token IDs optional (EricLBuehler#1493) * Remove python deps from CUDA dockerfiles (EricLBuehler#1487) * Handle noncontiguous v in naive_sdpa (EricLBuehler#1499) Co-authored-by: Eric Buehler <[email protected]> * Server Core: refactor Paged Attention configuration (EricLBuehler#1500) * Use StorageModePrivate for Metal PA kv cache (EricLBuehler#1506) * Fix OpenAI stream: emit field in tool-call deltas for schema compliance (EricLBuehler#1507) * FP8 KV-cache quantization for PagedAttention (EricLBuehler#1400) * Add most of paged attn kv quant * It builds a bit * All the functionality at least * Small fix * Add a scale * Fix bf16 usage * Make k_v_scale optional * Collector * Tweak collection * Refactor * Add to apis * Add cuda impl * Fix compilation * Fixes * Handle ENABLE_FP8 * Format * Tweak * Fix scaled_convert usage * Fix cache_t size * Fixed scale collection * Actual fix * Fix fp8 for CC<8 * Fix the usual String != &str bit (EricLBuehler#1483) Co-authored-by: RageLtMan <rageltman [at] sempervictus> * chore: `Dockerfile` - Drop runtime rayon thread ENV (EricLBuehler#1465) * chore: Dockerfile - Remove rayon threads env * chore: Dockerfile - Improve formatting for `apt-get` * Remove duplicate calls for api_dir_list (EricLBuehler#1474) * Remove duplicate calls for api_dir_list * Support local cache for api_dir_list * Fix home folder for metal * Capitalized * Fix transient pyo3 dep (EricLBuehler#1478) Co-authored-by: Eric Buehler <[email protected]> * Fix objc dep with non macos (EricLBuehler#1480) * Fix phi 3/4 + nccl issue (EricLBuehler#1481) * Fix log * Fix n kv heads * Fix phi3.5 moe (EricLBuehler#1482) * Fix phi3.5 moe accum device * Fix again * Fix again * Support GLM4 model! (EricLBuehler#1437) * Support GLM4 model * Mention GLM4 model in ReadMe * glm4 type hint * Typo fix * Fix unsupported chat_template function * Clippy fix * Refactor distributed backend (EricLBuehler#1484) * Refactor distributed backend, check power of 2 * Fix compilation * Cap metal paged attn kv allocation (EricLBuehler#1485) * Better paged attn metal cap (EricLBuehler#1486) * Better paged attn metal cap * Small fix * Comment * Small fix * Refactor * Server core: consolidate and unify route handlers and API surface (EricLBuehler#1423) * Start working on consolidating completion and chat_completion underlying implementations * Move response channel to util mod for now (since it's used with streaming and non streaming) * More work on consolidating completions and chat completions * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * More WIP consolidation of server core handlers * Update docs and restrict completion core visibility * CodeRabbit feedback: remove logprobs warn from route handler since parse request also checks this * Use consistent var name for completions mod * Make route handler modules public API consistent (same fn names, etc.) and provide proxy fn that wrap core fns so core mod doesn't have to be pub Make lib.rs example compile checked and update example * Code formatting * Typo * Sync fork * Sync fork * Docs example fix * Support qwen3 gguf (EricLBuehler#1488) * Add qwen3 gguf * Template fixup * Make bos/eos token IDs optional (EricLBuehler#1493) * Remove python deps from CUDA dockerfiles (EricLBuehler#1487) * Handle USE_FP8 for cuda * Fix cuda warn * Add readme * Saturating sub in sequence state --------- Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: RageLtMan <[email protected]> Co-authored-by: Brennan Kinney <[email protected]> Co-authored-by: Guoqing Bao <[email protected]> Co-authored-by: Matthew Haynes <[email protected]> * Validate model name in OpenAI API (EricLBuehler#1509) * Validate model name in openai api * Add docs, allow 'ignore' * Updated examples for EricLBuehler#1509 * Fix mcp import in doc string (EricLBuehler#1510) * Add multi-model support! (EricLBuehler#1512) * Refactor MistralRs * Working multi-model! * Add mutli-model docs initially * Update mistralrs-pyo3, mistralrs-bench, mistralrs * Update apis for consistency * API tweaks * Logging tweaks * Add examples, tweak cli * Clearer pipeline id * Fix config key semantics * Format and clippy * Tweak logging, fix example * Clippy refactor * Update examples * Remove unused multi model docs * Replace 'ignore' with 'default' * Update docs * Add stars label to readme (EricLBuehler#1513) * Add CLAUDE.md * Handle base_model.model case in lora (EricLBuehler#1514) * Add thread_local! for engine-specific const/static (EricLBuehler#1517) * Fix MCP doc test (EricLBuehler#1511) * Allow disabling metal precompilation (EricLBuehler#1518) * Allow disabling metal precompilation * Simple preprocessor * Simple docs --------- Co-authored-by: Eric Buehler <[email protected]> * Rust 1.88 clippy (EricLBuehler#1522) * Rust 1.88 clippy * Format * Fix cuda warnings (EricLBuehler#1526) * Avoid panic decoding tokens on error (EricLBuehler#1527) * Split Marlin and Paged Attention kernels for faster build (EricLBuehler#1525) * Split Marlin and Paged Attention kernels for faster build * Typo fix * chore: update llguidance (EricLBuehler#1535) * chore: update llguidance * chore: remove unused import * Add the SmolLM3 model! (EricLBuehler#1501) * Add model * Update loader * Fix llama config usage * Docs * Fix config no_rope_layers * Fix tie_word_embeddings default * Add chat template * Embed the chat templates * Fix embedding template * enable_thinking default true * Update examples * XML tools for smollm3 * Add smollm3 docs * Fix openai examples * Clippy --------- Co-authored-by: Eric Buehler <[email protected]> * Add full Gemma 3n support! (EricLBuehler#1519) * Add initial * Loading for text model * Add ple embeddings * Add altup, laurel block * Update rmsnorm * Add mlp * Update attn norm application * Currently no kv shared * Wire it up * It runs * Fix bf16 * Fix scaled embd * Fixes for mean * tmp * Attn confirmed * Fix target_magnitude * Add shared kv * Ok it works * Remove npy * Fix streaming * Remove warnings * Remove paged attn * Refactor rope * Add immediate isq * Add vision & mproj * Update image processor * Vision merge runs, not correct * Remove * Add mobilenet v5 * Add multimodal vision embedding * Fix load * runs * Fix gamma * Works but just not vision tower * It works!! * Tweak * Fix warnings * Move vision tower * Fix warn * Update cache manager things * Refactor * Add audio model, it loads * Add audio processing * It runs at least * tmp * A bit better * Audio works!!!! * Fused attn in vision * Clippy * Update audio runner * Optimized audio model * Remove unused things * Fix inputs processor bug * Remove comments * Clippy * Small optimizations * Format * Correctly register modalities * Add docs * Update readme * Runs there * Fixed padding from Blaizzy/mlx-vlm#410 * Add better checks * Fix sdpa n_kv_groups * Vision encoder works! * Rotate image * Clippy * Fix cuda loading * Updated device mapper * Fix overflow * Fix dtype errors * Refactor image/audio embeddings * Fix metal * Fix dtype mismatch * Audio processing fixes * Audio processing fixes * Works * Audio is good * Fix boi/eoi too * Embed the chat templates * Better embedding accuracy in non f32 * More f32 * Support bf16 on metal * Add more ISQ * Fixed device map * Clippy * Gemma3n no paged attn * Fix saturating sub * Faster rmsnorm * Use sdpa for vision model * Fix ple bug * Fix name * Fix multiaudio * Add matformer config loading * Add docs * Add support for matformer in auto device mapper * Update docs * Typos * Tweak * Tweak * Fix multidevice * Fix gemma3n text model auto device map * Fix dims3 * Fix auto devic emap vision * Non-metal keeps PLE on cpu * Complete merge * Vision dtype f16 -> f32 * Fix metal nm device * Fix uqff * Typos * Reference uqff * Fix tests * Fix sequence length check (EricLBuehler#1546) * update candle version (EricLBuehler#1545) Co-authored-by: AlpineVibrations <[email protected]> * add ios target to metal deps (EricLBuehler#1548) --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: edwko <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Guoqing Bao <[email protected]> Co-authored-by: Michał Moskal <[email protected]> Co-authored-by: Chen Mulong <[email protected]> Co-authored-by: Steph Wolski <[email protected]> Co-authored-by: omahs <[email protected]> Co-authored-by: Viktor Szépe <[email protected]> Co-authored-by: Matthew Haynes <[email protected]> Co-authored-by: RageLtMan <[email protected]> Co-authored-by: Brennan Kinney <[email protected]> Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: Sbargaoui <[email protected]> Co-authored-by: Gaétan Lepage <[email protected]> Co-authored-by: Ammar Elsabe <[email protected]> Co-authored-by: luke <[email protected]> Co-authored-by: AlpineVibrations <[email protected]> Co-authored-by: Michael Tissen <[email protected]>
While fundamentally different, I think parts of the underlying route handlers can be consolidated and I think that the API surface can be made consistent (currently only chatcompletions has the special treatment that allows it to be extended / reused).
The lowest hanging fruit is anything that is completely the same across all of them, but even things that are close but not quite the same may be able to be consolidated through generics I think.
Pre Merge Tasks
Summary by CodeRabbit
Refactor
New Features