Complete qwen2_5_vl, and some fixes #1184

brrr · 2025-03-10T10:27:46Z

Complete: qwen2_5_vl model
Add: examples for qwen2_5_vl
Fix: for deterministic sampling, top k SHOULD be Some(1) rather than None
Fix: unquantLinear forward error when device is cuda but not using batch_matmul

Todo: set_use_matmul_via_f16(true) from "pipline/inputs_processor" cause a significant loss of precision. It’s hard to figure it out during subsequent debugging Anyhow, globally setting matnuml precision MAY not be a ideal solution. For now, change the precision back in mistralrs-core/src/vision_models/qwen2_5_vl/inputs_processor.rs Qwen2_5vl feature is functional, start to clean code Add examples for lower_level_qwen2_5vl Fix: for deterministic sampling, top k SHOULD be Some(1) rather than None Clean code Rebase Clean code Fix cuda

github-actions · 2025-03-10T10:28:45Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           34           29            0            5
 Dockerfile              1           41           22           10            9
 JSON                   12          105          104            0            1
 Makefile                1            6            5            0            1
 Python                 71         3026         2622           81          323
 Shell                   1           58           22           18           18
 Plain Text              3         3723            0         2413         1310
 TOML                   19          530          492            2           36
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               49         4041            0         3069          972
 |- BASH                 6          103          100            0            3
 |- JSON                 1           12           12            0            0
 |- Python               7          121          109            0           12
 |- Rust                16          549          464            0           85
 |- TOML                 2           75           63            0           12
 (Total)                           4901          748         3069         1084
-------------------------------------------------------------------------------
 Rust                  327       107133        95921         2118         9094
 |- Markdown           157         1803           25         1639          139
 (Total)                         108936        95946         3757         9233
===============================================================================
 Total                 492       118718        99236         7713        11769
===============================================================================

MonolithFoundation · 2025-03-10T11:55:55Z

Would like this merged in. Facing Qwen2.5VL issues as well. @EricLBuehler

btw, can we making some common used mod inside mistralrs_core to be public? So that not everyone using mistralrs need clone the source and modify upon it. It could be easier to maintain a seperated repo host various customized models but powered by mistralrs.

EricLBuehler

Hi @brrr - thanks for the great work! I haven't done any testing yet, but it looks good at an initial glance. Could you please fix the cargo fmt and cargo clippy issues?

brrr · 2025-03-11T03:11:24Z

As @MonolithFoundation mentioned, it MAY be helpful to port new models that making some mistral core structure public. During the preocess of qwen 2.5vl porting, I've made some of those public(crate) and wrote a simple model loader and pipeline, but dicarded them before final push. That because:

Most time, SHOULD NOT use core inner structure directly, but use lifetime hooks with some kind of context. Seperated repos using the inner structure directly MAY lead to lots of break changes as long as upstream changing, make them hard to maintance.
Besides lifetime hooks and context, we SHOULD also provide a scaffold which MAY include Tracing and Performance metric like OpenTelemetry, Evals to test each hook function input/output (tensor shape, precision, even somehow to compare or use a persisted tensor such as exported npy file from Transformers)
"Convertion Kits": SHOULD provide OEM Kits including prefix cache, KV cache, attention implement... but also provide a simple way to let downstream change some parts easily. For example a agentic framework with built-in LLM inferrence engine MAY want to use their own prefix and kv cache impletments.
I think it will be the key factor to make Mistralrs massive adoption, plus "Blazingly fast ".

EricLBuehler

Hi @brrr!

I tested this out and it works well! I added a few (relatively minor) comments which should be addressed before merge; otherwise everything looks great.

mistralrs-core/src/vision_models/qwen2_5_vl/inputs_processor.rs

mistralrs-core/src/vision_models/qwen2_5_vl/vision.rs

mistralrs/Cargo.toml

mistralrs/examples/qwen2_5vl/main.rs

EricLBuehler · 2025-03-11T16:19:44Z

mistralrs/examples/qwen2_5vl/main.rs

+    let image = image::load_from_memory(&bytes)?;
+
+    //force map all layers to gpu
+    let device_mapper = DeviceMapSetting::Map(DeviceMapMetadata::from_num_device_layers(vec![


By default, automatic device mapping will be used to optimally spread layers across multiple-GPUs/CPU, while preserving space for activations. Can you please remove this?

By default, automatic device mapping will be used to optimally spread layers across multiple-GPUs/CPU, while preserving space for activations. Can you please remove this?

Done.

And one more thing, when I removed manual device mapping, I want to have a test with the CPU and Gpu(Metal) mixture inference. When without Accelerate feature, matmul will use f16 which MAY cause NaN error. I intended to use clamp_for_f16 to prevent that happening, but I think it's not elegant. @EricLBuehler could you please reconsider a comprehensive refactoring of these code? Especially, when Apple releases new chip which supports native fp8 while raw precision is fp8, it's will still be a problem.

mistral.rs/mistralrs-quant/src/lib.rs

Lines 191 to 199 in b8237b2

#[cfg(not(feature = "accelerate"))]

{

if a.device().is_cpu() {

let original_dtype = a.dtype();

return a

.to_dtype(DType::F16)?

.matmul(&b.to_dtype(DType::F16)?)?

.to_dtype(original_dtype);

} else if !get_use_matmul_via_f16() {

MonolithFoundation · 2025-03-12T02:06:01Z

@EricLBuehler @brrr In this PR #1185 it's port needed components to public, and we written an thirdparty-model zoo including qwen2.5 vl, it works well. https://github.com/lucasjinreal/Namors

It's far more easy to write customized model with this modification.

EricLBuehler · 2025-03-12T10:50:06Z

As @MonolithFoundation mentioned, it MAY be helpful to port new models that making some mistral core structure public. During the preocess of qwen 2.5vl porting, I've made some of those public(crate) and wrote a simple model loader and pipeline, but dicarded them before final push. That because:

Most time, SHOULD NOT use core inner structure directly, but use lifetime hooks with some kind of context. Seperated repos using the inner structure directly MAY lead to lots of break changes as long as upstream changing, make them hard to maintance.

Besides lifetime hooks and context, we SHOULD also provide a scaffold which MAY include Tracing and Performance metric like OpenTelemetry, Evals to test each hook function input/output (tensor shape, precision, even somehow to compare or use a persisted tensor such as exported npy file from Transformers)

"Convertion Kits": SHOULD provide OEM Kits including prefix cache, KV cache, attention implement... but also provide a simple way to let downstream change some parts easily. For example a agentic framework with built-in LLM inferrence engine MAY want to use their own prefix and kv cache impletments.
I think it will be the key factor to make Mistralrs massive adoption, plus "Blazingly fast ".

I think I mostly agree with this. #1185 is a good step, and I'll evaluate if it's reasonable to export that functionality - I feel that we already export too much from mistralrs-core. Perhaps we could design a system for trait-based hooks into the loading and pipeline process, exporting the utilities that we have access to but nothing internal.

"Convertion Kits": SHOULD provide OEM Kits including prefix cache, KV cache, attention implement... but also provide a simple way to let downstream change some parts easily

I think this could also be done with the trait-based hook system. Similar to what we already have with CustomLogitsProcessor!

EricLBuehler · 2025-03-12T10:51:25Z

@brrr the changes you made look good. There was an internal-API that had a breaking change in #1190, which is causing these errors. Could you please take a look?

brrr · 2025-03-12T17:20:32Z

@brrr the changes you made look good. There was an internal-API that had a breaking change in #1190, which is causing these errors. Could you please take a look?

Done.

EricLBuehler

Thank you!

* Refactor NCCL device mappers (EricLBuehler#1172) * Bump ring from 0.17.11 to 0.17.13 (EricLBuehler#1179) Bumps [ring](https://github.com/briansmith/ring) from 0.17.11 to 0.17.13. - [Changelog](https://github.com/briansmith/ring/blob/main/RELEASES.md) - [Commits](https://github.com/briansmith/ring/commits) --- updated-dependencies: - dependency-name: ring dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * DSV3/R1 fixes (EricLBuehler#1173) * DSv3 fixes * Just save the progress * Fix launch of blockwise fp8 dequant * It actually works * Async ops * Optimize non-mla with cat * Fix non-cuda build * Update build * Add more CUDA_CHECK * Works really now * Working fully now with pagedattn * Format everything * Fix diffusion device mapping (EricLBuehler#1187) * Internal abstraction for distributed op (EricLBuehler#1188) * Make Sequence::set_toks more safe (EricLBuehler#1190) * Fix CI tests out of storage (EricLBuehler#1191) * Internal abstraction for distributed op (EricLBuehler#1189) * Fix build_cuda_all.yaml CI (EricLBuehler#1193) * Support tensor parallelism for vision models! (EricLBuehler#1194) * Refactor distributed mapper prep * Support vision model TP * Update docs * Add vision model TP for mllama * Always pass _USE_MATH_DEFINES for CUDA (EricLBuehler#1195) * Always pass _USE_MATH_DEFINES * Cargo.lock * Remove matmul via f16 framework (EricLBuehler#1196) * Remove API for matmul_via_f16 (EricLBuehler#1197) * Add UQFF text/vision model API (EricLBuehler#1198) * Add UQFF text/vision model API * Typos * Implement Qwen 2.5 VL! (EricLBuehler#1184) * Implement Qwen 2.5 VL * Reverse window index select * Switch to rmsnorm * Warn * Fix config, loads now * Fixes * Complete qwen2_5vl feature Todo: set_use_matmul_via_f16(true) from "pipline/inputs_processor" cause a significant loss of precision. It’s hard to figure it out during subsequent debugging Anyhow, globally setting matnuml precision MAY not be a ideal solution. For now, change the precision back in mistralrs-core/src/vision_models/qwen2_5_vl/inputs_processor.rs Qwen2_5vl feature is functional, start to clean code Add examples for lower_level_qwen2_5vl Fix: for deterministic sampling, top k SHOULD be Some(1) rather than None Clean code Rebase Clean code Fix cuda * Fix Rustfmt and Clippy issues * Clean code * Merge branch ‘main’ --------- Co-authored-by: Eric Buehler <[email protected]> * Implement Gemma 3 (text only)! (EricLBuehler#1201) * Add config * Add the text model * Add inputs processor, loads/runs now * It works! * Add to APIs * Implement Gemma 3 vision support! (EricLBuehler#1202) * Add vision support for Gemma 3 * Implement image preprocessor and processor * It works, kind of * It works great * Mask must be contiguous * Update docs * Format * Manually fixup sentencepiece detok (EricLBuehler#1204) * More vision models with TP (EricLBuehler#1200) * More models for tp * Fix clippy * Fix topology link in the docs (EricLBuehler#1205) * Gemma3 1b support and optimized rotating cache (EricLBuehler#1206) * Support text-only gemma3 * Add rotating kv cache * Do not preallocate rotating kv cache * Improve rotating kv cache, prefix cacher system (EricLBuehler#1207) * Improve rotating kv cache set_len and more intelligent prefix cacher v2 * Remove prefix cacher v1 * Better handling for kvcache set_len (EricLBuehler#1208) * Fix gemma3 vision device in isq * Update deps and use rand 0.9 (EricLBuehler#1210) * Fix flash-attn v3 build * Update hf hub dep, add initial blockwise fp8 GEMM tests (EricLBuehler#1212) * Update hf_hub dep to not require openssl and add tests * Update deps * Fixes * Undo 'fix' from clippy * Ok maybe finally fix it * Growable RotatingKvCache and fixes for Phi-4 mini (EricLBuehler#1215) * Fixes for phi4 mini * Fix causal mask * Growable rotating kv cache * Fix clippy * Use docker build for x86 pyo3 wheels * Fix cuda warn * Vision model pagedattn fixes (EricLBuehler#1217) * Gemma 3 cuda fixes * Fix pagedattn bug * Clippy * Small fix for rotating cache? * Add pydantic schema examples! (EricLBuehler#1219) * Sliding window attention fixes (EricLBuehler#1220) * Initial fixes for sliding window * Fix swa, still without prefix cache * Ok finally it works * Handle multiple eos toks * adapt to rig crate as client (EricLBuehler#1214) * adapt to rig crate as client * adapt to rig crate as client * Implement Mistral 3! (EricLBuehler#1221) * Add vision model and load language model * Implement the mmproj and patch merger! * Remove plot * Reshaping patch embeds with image sizes, make block attn mask * Add the inputs merging and forward * Basic loader, a bunch of todos still * Add the inputs processor * Clippy * Some fixes * It works! * Implement for the automatic device mapping * ISQ support for the vision model too * Docs * Fused Metal SDPA with masking! (EricLBuehler#1225) * Metal SDPA with masking * Much faster quantization on metal! * Check if actually metal * Materialize the mask * Fix cuda * Format * Send [DONE] SSE chunk per openai spec (EricLBuehler#1226) * Fix handling of device when compiled for but disabled nccl (EricLBuehler#1227) * Fix nccl blocking case (EricLBuehler#1228) * Native Llama, Mistral Small 3.1, Mistral Nemo, Hermes 2 Pro, Hermes 3 tool calling! (EricLBuehler#1229) * Llama model tool calling support * Llama tool calling works * Nice tool calling support * Tool calling working with Mistral 3 * Support hermes * Mistral nemo support * Update server tool calling example * OpenAI API compatability fixes (EricLBuehler#1230) * Content itself is optional * Only provide tool calls if they are not empty * Add response_format support * Fix response-format * Fix json_schema.py example * [Breaking] Automatic server logging (EricLBuehler#1231) * Add logger for server * Clippy * Tweak * Configurable * Format * Remove simple_tool_calling.py as deprecated * Use default stream for flash attn (EricLBuehler#1232) * More accurate throughput logging * Bump version to 0.5.0 (EricLBuehler#1233) * Fix handling of Metal fused attn head dims (EricLBuehler#1234) * Fix handling of metal attn head dims * Fix handling of gemma3 1b when images * Tweak default for paged attn builder * Support paged attn for vision model rust api (EricLBuehler#1235) * [Breaking] Support setting HF cache path (EricLBuehler#1237) * Add it internally * Add the apis * Support tool calling for DeepSeek models (EricLBuehler#1239) * Support tool calling for deepseek models * Format * Fix deepseek * Server image processing refactor and fixes (EricLBuehler#1244) * Fix strict gemma3 case * Accept multiple images in the content array * Fix multiple images in one array ct * Add it to the python api * Typos * Optimized CUDA RoPE kernels (EricLBuehler#1247) * Add the kernels * It works * Works * Buulds * Typo fix (add_speial_tokens to add_special_tokens) (EricLBuehler#1246) * Fix typo * Update mistralrs.pyi * Fixes for UQFF + distributed layers (EricLBuehler#1250) * Fixes for uqff + distributed layers * Typo * Automatic agentic search integration (`web_search_options`) (EricLBuehler#1243) * Add the tool * Actually search * Clippy * Sort of works * Remove some debuggers * tweak * Add some rules * Works great * Tweak 'system' prompt * Update mistralrs-core/src/search/mod.rs Co-authored-by: Copilot <[email protected]> * Typo * Add it to all the apis * Add bert model for similarity reranking * Typos * Early detection of tools * Alias max_tokens -> max_completion_tokens too * Customizable bert model * Flip the enabler around * Add docs * Update readme * Typo --------- Co-authored-by: Copilot <[email protected]> * Format kernels (EricLBuehler#1251) * Update readme * Update readme * Remove test * Add quantize guards for uqff deserialize (EricLBuehler#1252) * Refactor cuBLASlt-related code (EricLBuehler#1253) * Centralize cublaslt into mistralrs-quant * Use cublaslt in unquant layer * Use beautiful trait constants for simpler code * Move tests * Dispatch to unquant for cublaslt * Dispatch to unquant for cublaslt * Fix feature * Add convert_to_gptq script * Update deps, bump pyo3 version (EricLBuehler#1259) * Faster cuda FP8 performance (EricLBuehler#1257) * Avoid fp8 sync * Fix dtype * Rust 1.86 clippy (EricLBuehler#1260) * Rust 1.86 clippy * Clippy * Refactor engine arch (EricLBuehler#1262) * Refactor engine add_request * Don't recompile regex * Clippy * Revamped LoRA support - removing the Ordering system! (EricLBuehler#1263) * Play with varbuilder lifetimes * Merge lora weights * Clippy * Lora works * Support multiple loras * Cleanup, remove adapter activation * Complete merge * Fast Metal-specific quantization method: AFQ (EricLBuehler#1264) * Add mlx quantized kernels * Add mlx quantized kernels * Kernel launcher * Add AFQ isq quant and dequant * Some quantmethod things * Begin to implement the qmm caller * Clippy * Much faster * Cache kernels * Docs * Clippy * Add it to uqff * Support prequantized models from MLX (EricLBuehler#1265) * Refactor quantizedconfig * Support AFQ prequantized * Update docs * Update docs * Automatic ISQ to select fastest & most accurate method (EricLBuehler#1266) * Automatic isq * typo * Doc * Improved usage metrics (EricLBuehler#1267) * Fix cuda * Bump tokio from 1.44.1 to 1.44.2 (EricLBuehler#1270) Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.44.1 to 1.44.2. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](tokio-rs/tokio@tokio-1.44.1...tokio-1.44.2) --- updated-dependencies: - dependency-name: tokio dependency-version: 1.44.2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Gather MM ops in mistralrs-quant (EricLBuehler#1272) * Update the caller * Wire things up * Broadcase for afq gathermm * Broadcase for afq gathermm * Clippy * Improve performance of deepseek models * Typo fix * BincountOp not used * Implement Llama 4! (EricLBuehler#1268) * Implement Llama 4 * Implement the main changes for the text model * Make chunked mask * Wire things up * Add some EP * Initial sketch of inputs processor * Runs * Progress * all reduce moes * It works! * Some cleanup * Faster moe block * Add device map * Make chunked matrix * Fully working now! * Reactivate cublaslt * Fix shared mlp cublaslt * Refactor to packed experts * Complete merge * It is a normal model now * Fixes * Set device for moe * ISQ fixes * Much faster sort kernel * Faster loading! * Faster loading! * Fp8 cpu copy ops in candle backend * Add the vision model * Add mmproj layer * Actually merge the inputs * Sketch most of the image processor * Add the rest of the image processor * Implement the whole processor * Add the loader * Some fixes * A batch of fixes * Some fixes * tmp * Actually support isq * Ok it works a bit * Fix norm device * It works * A bit cleaner * Support residul tensors * Remove text loader * Implement the device mapping system * Fix auto device map * Add examples * Add model card * Typo * Remove superflous logging * Fixes for Llama 4 UQFF loading (EricLBuehler#1275) * Support sharding for UQFF (EricLBuehler#1276) * Serialize sharded uqff files * Loading * Fix base64 * Fix bug for group-topk (group_limited_greedy) in deepseek models (EricLBuehler#1278) * Support the DeepCoder model (EricLBuehler#1279) * Add faq for metal not found * updates from candle * fixes * relax tokio * make AdapterPaths, LoraAdapterPaths public --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: brrr <[email protected]> Co-authored-by: Eric Buehler <[email protected]> Co-authored-by: Etienne Balit <[email protected]> Co-authored-by: benliao <[email protected]> Co-authored-by: edwko <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Guoqing Bao <[email protected]>

EricLBuehler and others added 7 commits March 10, 2025 18:07

Implement Qwen 2.5 VL

f525ed7

Reverse window index select

cf67d15

Switch to rmsnorm

cfad0b8

Warn

f685bb6

Fix config, loads now

ea1072e

Fixes

12e2229

EricLBuehler requested changes Mar 10, 2025

View reviewed changes

Fix Rustfmt and Clippy issues

052e7c0

EricLBuehler requested changes Mar 11, 2025

View reviewed changes

Clean code

b4f03ad

brrr added 2 commits March 13, 2025 00:49

Merge remote-tracking branch 'origin/master'

e17d1c7

Merge branch ‘main’

15f8f01

EricLBuehler approved these changes Mar 12, 2025

View reviewed changes

EricLBuehler merged commit 939f674 into EricLBuehler:master Mar 12, 2025
12 checks passed

jgonera mentioned this pull request Mar 13, 2025

Errors when providing an input image to MiniCPM-O 2.6 #1166

Open

	#[cfg(not(feature = "accelerate"))]
	{
	if a.device().is_cpu() {
	let original_dtype = a.dtype();
	return a
	.to_dtype(DType::F16)?
	.matmul(&b.to_dtype(DType::F16)?)?
	.to_dtype(original_dtype);
	} else if !get_use_matmul_via_f16() {

Complete qwen2_5_vl, and some fixes #1184

Complete qwen2_5_vl, and some fixes #1184

Uh oh!

Conversation

brrr commented Mar 10, 2025

Uh oh!

github-actions bot commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MonolithFoundation commented Mar 10, 2025

Uh oh!

EricLBuehler left a comment

Choose a reason for hiding this comment

Uh oh!

brrr commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EricLBuehler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EricLBuehler Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

brrr Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MonolithFoundation commented Mar 12, 2025

Uh oh!

EricLBuehler commented Mar 12, 2025

Uh oh!

EricLBuehler commented Mar 12, 2025

Uh oh!

brrr commented Mar 12, 2025

Uh oh!

EricLBuehler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 10, 2025 •

edited

Loading

brrr commented Mar 11, 2025 •

edited

Loading

brrr Mar 12, 2025 •

edited

Loading