Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
276 changes: 274 additions & 2 deletions Cargo.lock

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,7 @@ mime_guess = "2.0.5"
include_dir = "0.7.4"
http = "1.3.1"
hyper = "1.6.0"
rust-mcp-sdk = { version = "0.4.2", default-features = false, features = ["server", "hyper-server", "2025_03_26"] }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Consider documenting the purpose of the date-based feature flag.

The feature flag "2025_03_26" appears to be date-based, which could make future maintenance challenging. Consider adding a comment explaining what this feature represents or check if there's a more semantic feature name available.


🌐 Web query:

What does the "2025_03_26" feature flag mean in rust-mcp-sdk version 0.4.2?

💡 Result:

The "2025_03_26" feature flag in the rust-mcp-sdk version 0.4.2 likely corresponds to a specific version of the Model Context Protocol (MCP) schema. In the Rust MCP ecosystem, feature flags are often used to select different versions of the MCP schema, allowing developers to work with various schema versions as needed.

For instance, the rust-mcp-schema crate provides multiple schema versions, including "2024_11_05" and "draft". To use a specific schema version, you can enable the corresponding feature in your Cargo.toml file. While the "2025_03_26" version isn't explicitly listed in the available versions, it's possible that this feature flag is intended to select that specific schema version.

To utilize the "2025_03_26" schema version, you would typically add the following to your Cargo.toml:

[dependencies]
rust-mcp-schema = { version = "0.1.0", features = ["2025_03_26"] }

This configuration instructs Cargo to compile the rust-mcp-schema crate with the "2025_03_26" feature enabled, thereby selecting that specific schema version.

It's important to note that the availability of the "2025_03_26" schema version depends on its inclusion in the rust-mcp-schema crate. If this version isn't listed among the available features, it may not be supported or released yet. In such cases, you might need to consult the official MCP documentation or the maintainers of the rust-mcp-schema crate for further information.

For more details on the rust-mcp-schema crate and its available schema versions, you can visit the GitHub repository: (github.com)

Citations:


Document the MCP schema version feature flag

The feature flag "2025_03_26" in rust-mcp-sdk = { version = "0.4.2", … } selects the Model Context Protocol schema v2025-03-26. To improve maintainability:

  • Add an inline comment in Cargo.toml explaining that this flag pins the MCP schema to version 2025-03-26.
  • Confirm that the rust-mcp-schema crate exposes a 2025_03_26 feature; if it doesn’t, coordinate with its maintainers or choose an available schema version.
🤖 Prompt for AI Agents
In Cargo.toml at line 163, the feature flag "2025_03_26" used in rust-mcp-sdk
dependency selects the MCP schema version 2025-03-26 but lacks documentation.
Add an inline comment next to this feature flag explaining that it pins the
Model Context Protocol schema to version 2025-03-26 for clarity and
maintainability. Also verify that the rust-mcp-schema crate supports this
feature flag; if not, coordinate with its maintainers or select a supported
schema version.

bindgen_cuda = { git = "https://github.com/guoqingbao/bindgen_cuda.git", version = "0.1.6" }
rubato = "0.16.2"
rustfft = "6.3.0"
Expand Down
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ Please submit requests for new models [here](https://github.com/EricLBuehler/mis
- [Rust API](https://ericlbuehler.github.io/mistral.rs/mistralrs/) & [Python API](mistralrs-pyo3/API.md)
- [Automatic device mapping](docs/DEVICE_MAPPING.md) (multi-GPU, CPU)
- [Chat templates](docs/CHAT_TOK.md) & tokenizer auto-detection
- [MCP protocol](docs/MCP.md) for structured, realtime tool calls

2. **Performance**
- CPU acceleration (MKL, AVX, [NEON](docs/DEVICE_MAPPING.md#arm-neon), [Accelerate](docs/DEVICE_MAPPING.md#apple-accelerate))
Expand Down Expand Up @@ -184,6 +185,16 @@ OpenAI API compatible API server
- [Example](examples/server/chat.py)
- [Use or extend the server in other axum projects](https://ericlbuehler.github.io/mistral.rs/mistralrs_server_core/)

### MCP Protocol

Serve the same models over the open [MCP](docs/MCP.md) (Model Control Protocol) in parallel to the HTTP API:

```bash
./mistralrs-server --mcp-port 4321 plain -m Qwen/Qwen3-4B
```

See the [docs](docs/MCP.md) for feature flags, examples and limitations.


### Llama Index integration

Expand Down
3 changes: 3 additions & 0 deletions docs/HTTP.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ Mistral.rs provides a lightweight OpenAI API compatible HTTP server based on [ax

The API consists of the following endpoints. They can be viewed in your browser interactively by going to `http://localhost:<port>/docs`.

> ℹ️ Besides the HTTP endpoints described below `mistralrs-server` can also expose the same functionality via the **MCP protocol**.
> Enable it with `--mcp-port <port>` and see [MCP.md](MCP.md) for details.

## Additional object keys

To support additional features, we have extended the completion and chat completion request objects. Both have the same keys added:
Expand Down
100 changes: 100 additions & 0 deletions docs/MCP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# MCP protocol support

`mistralrs-server` can serve **MCP (Model Control Protocol)** traffic next to the regular OpenAI-compatible HTTP interface.
MCP is an open, tool-based protocol that lets clients interact with models through structured *tool calls* instead of free-form HTTP routes.
Under the hood the server uses [`rust-mcp-sdk`](https://crates.io/crates/rust-mcp-sdk) and exposes a single tool called **`chat`** that mirrors the behaviour of the `/v1/chat/completions` endpoint.

---

## 1. Building

Support for MCP is compiled in by default because the workspace enables the `server` and `hyper-server` features of `rust-mcp-sdk`.
When you only compile the `mistralrs-server` crate outside the workspace enable the `mcp-server` Cargo feature manually:

```bash
cargo build -p mistralrs-server --release --features "mcp-server"
```

## 2. Running

Start the normal HTTP server and add the `--mcp-port` flag to spin up an MCP server on a separate port:

```bash
./target/release/mistralrs-server \
--port 1234 # OpenAI compatible HTTP API
--mcp-port 4321 # MCP protocol endpoint (SSE over HTTP)
plain -m mistralai/Mistral-7B-Instruct-v0.3
```

* `--mcp-port` takes precedence over `--port` – you can run the HTTP and MCP servers on totally independent ports or omit `--port` when you only need MCP.*

The server prints an extra line such as

```
MCP ‑ listening on http://0.0.0.0:9001
```

## 3. Capabilities announced to clients

At start-up the MCP handler advertises the following `InitializeResult` (abridged):

```jsonc
{
"server_info": { "name": "mistralrs", "version": "<crate-version>" },
"protocol_version": "2025-03-26", // latest spec version from rust-mcp-sdk
"instructions": "use tool 'chat'",
"capabilities": {
"tools": {}
}
}
```

Only one tool is currently exposed:

| tool | description |
|------|------------------------------------------------------|
| `chat` | Wraps the OpenAI `/v1/chat/completions` endpoint. |

## 4. Calling the `chat` tool

Clients send a [`CallToolRequest`](https://docs.rs/rust-mcp-schema/latest/rust_mcp_schema/struct.CallToolRequest.html) event where `params.name` is `"chat"` and `params.arguments` contains a standard MCP [`CreateMessageRequest`](https://docs.rs/rust-mcp-schema/latest/rust_mcp_schema/struct.CreateMessageRequest.html).

Example request (sent as SSE `POST /mcp/stream` or via the convenience helpers in `rust-mcp-sdk`):

```jsonc
{
"kind": "callToolRequest",
"id": "123",
"params": {
"name": "chat",
"arguments": {
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [
{ "role": "user", "content": "Explain Rust ownership." }
]
}
}
}
```

The response is a `CallToolResult` event whose `content` array contains a single `TextContent` item with the assistant response.

```jsonc
{
"kind": "callToolResult",
"id": "123",
"content": [
{ "type": "text", "text": "Rust’s ownership system ..." }
]
}
```

Error cases are mapped to `CallToolError` with `is_error = true`.

## 5. Limitations & future work

• Only synchronous, single-shot requests are supported right now.
• Streaming responses (`partialCallToolResult`) are not yet implemented.
• No authentication layer is provided – run the MCP port behind a reverse proxy if you need auth.

Contributions to extend MCP coverage (streaming, more tools, auth hooks) are welcome!
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
- [Sampling](SAMPLING.md)
- [TOML selector](TOML_SELECTOR.md)
- [Tool calling](TOOL_CALLING.md)
- [MCP protocol](MCP.md)

## Cross-device inference
- [Device mapping](DEVICE_MAPPING.md)
Expand Down
3 changes: 3 additions & 0 deletions mistralrs-server/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ serde.workspace = true
serde_json.workspace = true
tokio.workspace = true
tracing.workspace = true
rust-mcp-sdk.workspace = true
async-trait.workspace = true

[features]
cuda = ["mistralrs-core/cuda", "mistralrs-server-core/cuda"]
Expand All @@ -43,3 +45,4 @@ accelerate = ["mistralrs-core/accelerate", "mistralrs-server-core/accelerate"]
mkl = ["mistralrs-core/mkl", "mistralrs-server-core/mkl"]
nccl = ["mistralrs-core/nccl", "mistralrs-server-core/nccl"]
ring = ["mistralrs-core/ring", "mistralrs-server-core/ring"]
mcp-server = ["rust-mcp-sdk/server", "rust-mcp-sdk/hyper-server"]
23 changes: 22 additions & 1 deletion mistralrs-server/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ use mistralrs_server_core::{

mod interactive_mode;
use interactive_mode::interactive_mode;
mod mcp_server;

#[derive(Parser)]
#[command(version, about, long_about = None)]
Expand Down Expand Up @@ -134,6 +135,10 @@ struct Args {
/// Enable thinking for interactive mode and models that support it.
#[arg(long = "enable-thinking")]
enable_thinking: bool,

/// Port to serve MCP protocol on
#[arg(long)]
mcp_port: Option<u16>,
}

fn parse_token_source(s: &str) -> Result<TokenSource, String> {
Expand Down Expand Up @@ -188,7 +193,10 @@ async fn main() -> Result<()> {
// Needs to be after the .build call as that is where the daemon waits.
let setting_server = if !args.interactive_mode {
let port = args.port.expect("Interactive mode was not specified, so expected port to be specified. Perhaps you forgot `-i` or `--port`?");
let ip = args.serve_ip.unwrap_or_else(|| "0.0.0.0".to_string());
let ip = args
.serve_ip
.clone()
.unwrap_or_else(|| "0.0.0.0".to_string());

// Create listener early to validate address before model loading
let listener = tokio::net::TcpListener::bind(format!("{ip}:{port}")).await?;
Expand All @@ -197,6 +205,19 @@ async fn main() -> Result<()> {
None
};

if let Some(port) = args.mcp_port {
let host = args
.serve_ip
.clone()
.unwrap_or_else(|| "0.0.0.0".to_string());
let mcp_server = mcp_server::create_mcp_server(mistralrs.clone(), host, port);
tokio::spawn(async move {
if let Err(e) = mcp_server.start().await {
eprintln!("MCP server error: {e}");
}
});
}

let app = MistralRsServerRouterBuilder::new()
.with_mistralrs(mistralrs)
.build()
Expand Down
95 changes: 95 additions & 0 deletions mistralrs-server/src/mcp_server.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
use async_trait::async_trait;
use rust_mcp_sdk::{
mcp_server::{hyper_server, HyperServerOptions, ServerHandler},
schema::{
schema_utils::CallToolError, CallToolRequest, CallToolResult, CallToolResultContentItem,
Implementation, InitializeResult, ServerCapabilities, ServerCapabilitiesTools, TextContent,
LATEST_PROTOCOL_VERSION,
},
};
use std::io;

use mistralrs_server_core::{
chat_completion::{create_response_channel, parse_request},
types::SharedMistralRsState,
};

pub struct MistralMcpHandler {
pub state: SharedMistralRsState,
}

#[async_trait]
impl ServerHandler for MistralMcpHandler {
async fn handle_call_tool_request(
&self,
request: CallToolRequest,
_runtime: &dyn rust_mcp_sdk::McpServer,
) -> std::result::Result<CallToolResult, CallToolError> {
if request.params.name != "chat" {
return Err(CallToolError::unknown_tool(request.params.name));
}
let args = request.params.arguments.into();
let req: rust_mcp_sdk::schema::CreateMessageRequest =
serde_json::from_value(args).map_err(|e| CallToolError::new(io::Error::other(e)))?;
// Translate to ChatCompletionRequest
let chat_req: mistralrs_server_core::openai::ChatCompletionRequest =
serde_json::from_value(serde_json::to_value(req).unwrap())
.map_err(CallToolError::new)?;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Optimize the request conversion to avoid unnecessary serialization.

The current implementation serializes to JSON and then deserializes, which is inefficient. Consider implementing a direct conversion method or using a more efficient mapping approach.

-        let args = request.params.arguments.into();
-        let req: rust_mcp_sdk::schema::CreateMessageRequest =
-            serde_json::from_value(args).map_err(|e| CallToolError::new(io::Error::other(e)))?;
-        // Translate to ChatCompletionRequest
-        let chat_req: mistralrs_server_core::openai::ChatCompletionRequest =
-            serde_json::from_value(serde_json::to_value(req).unwrap())
-                .map_err(CallToolError::new)?;
+        let args = request.params.arguments.into();
+        let req: rust_mcp_sdk::schema::CreateMessageRequest =
+            serde_json::from_value(args).map_err(|e| CallToolError::new(io::Error::other(e)))?;
+        // TODO: Implement direct conversion from CreateMessageRequest to ChatCompletionRequest
+        // to avoid the overhead of JSON serialization/deserialization
+        let chat_req: mistralrs_server_core::openai::ChatCompletionRequest =
+            serde_json::from_value(serde_json::to_value(req).unwrap())
+                .map_err(CallToolError::new)?;

Would you like me to help implement a direct conversion method between these request types?

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let req: rust_mcp_sdk::schema::CreateMessageRequest =
serde_json::from_value(args).map_err(|e| CallToolError::new(io::Error::other(e)))?;
// Translate to ChatCompletionRequest
let chat_req: mistralrs_server_core::openai::ChatCompletionRequest =
serde_json::from_value(serde_json::to_value(req).unwrap())
.map_err(CallToolError::new)?;
let args = request.params.arguments.into();
let req: rust_mcp_sdk::schema::CreateMessageRequest =
serde_json::from_value(args).map_err(|e| CallToolError::new(io::Error::other(e)))?;
// TODO: Implement direct conversion from CreateMessageRequest to ChatCompletionRequest
// to avoid the overhead of JSON serialization/deserialization
let chat_req: mistralrs_server_core::openai::ChatCompletionRequest =
serde_json::from_value(serde_json::to_value(req).unwrap())
.map_err(CallToolError::new)?;
🤖 Prompt for AI Agents
In mistralrs-server/src/mcp_server.rs around lines 32 to 37, the code
inefficiently converts CreateMessageRequest to ChatCompletionRequest by
serializing to JSON and then deserializing. To fix this, implement a direct
conversion method such as From or TryFrom trait between these types or manually
map fields from CreateMessageRequest to ChatCompletionRequest without
intermediate JSON serialization.


let (tx, mut rx) = create_response_channel(None);
let (request, _is_streaming) = parse_request(chat_req, self.state.clone(), tx)
.await
.map_err(|e| CallToolError::new(io::Error::other(e.to_string())))?;
mistralrs_server_core::chat_completion::send_request(&self.state, request)
.await
.map_err(|e| CallToolError::new(io::Error::other(e.to_string())))?;
match rx.recv().await {
Some(mistralrs_core::Response::Done(resp)) => {
let content = resp
.choices
.iter()
.filter_map(|c| c.message.content.clone())
.collect::<Vec<_>>()
.join("\n");
Ok(CallToolResult {
content: vec![CallToolResultContentItem::TextContent(TextContent::new(
content, None,
))],
is_error: None,
meta: None,
})
}
Some(mistralrs_core::Response::ModelError(msg, _)) => {
Err(CallToolError::new(io::Error::other(msg)))
}
Some(_) | None => Err(CallToolError::new(io::Error::other("no response"))),
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🛠️ Refactor suggestion

🧩 Analysis chain

Review the response handling for completeness and structure preservation.

The current implementation has potential issues:

  1. It only receives one response, which might miss subsequent streaming responses
  2. Joining multiple choice contents with newlines loses the original response structure

Consider preserving the response structure or documenting the expected behavior.


🏁 Script executed:

#!/bin/bash
# Check if streaming responses are used elsewhere in the codebase
rg -A 5 "Response::Chunk|is_streaming.*true" --type rust

Length of output: 10753


Ensure mcp_server handles streaming responses and preserves choice structure

The current match in mistralrs-server/src/mcp_server.rs (lines 46–66) only handles Response::Done and treats any other variant (including streaming chunks) as an error. Since the core and examples use Response::Chunk for streaming, this will break any tool calls configured for streaming:

  • Streaming tool responses will hit the Some(_) arm and return a “no response” error.
  • Joining all choice contents with "\n" flattens multiple messages and loses per-choice metadata.

Please update this handler to either:

  • Accumulate Response::Chunk variants (e.g. buffer incoming chunk.choices until a terminal signal) and then emit a single CallToolResult, or
  • Forward each chunk as it arrives (e.g. streaming CallToolResultContentItem::TextContent), or
  • Clearly document that mcp_server does not support streaming tool responses and ensure the engine never emits chunks for this endpoint.

}
}

pub fn create_mcp_server(
state: SharedMistralRsState,
host: String,
port: u16,
) -> rust_mcp_sdk::mcp_server::HyperServer {
let server_details = InitializeResult {
server_info: Implementation {
name: "mistralrs".to_string(),
version: env!("CARGO_PKG_VERSION").to_string(),
},
capabilities: ServerCapabilities {
tools: Some(ServerCapabilitiesTools { list_changed: None }),
..Default::default()
},
meta: None,
instructions: Some("use tool 'chat'".to_string()),
protocol_version: LATEST_PROTOCOL_VERSION.to_string(),
};
let handler = MistralMcpHandler { state };
let opts = HyperServerOptions {
host,
port,
..Default::default()
};
hyper_server::create_server(server_details, handler, opts)
}
Loading