Commit 91a5ad7
Updates from EricLBuehler/mistralrs (#27)
* Refactor NCCL device mappers (EricLBuehler#1172)
* Bump ring from 0.17.11 to 0.17.13 (EricLBuehler#1179)
Bumps [ring](https://github.com/briansmith/ring) from 0.17.11 to 0.17.13.
- [Changelog](https://github.com/briansmith/ring/blob/main/RELEASES.md)
- [Commits](https://github.com/briansmith/ring/commits)
---
updated-dependencies:
- dependency-name: ring
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* DSV3/R1 fixes (EricLBuehler#1173)
* DSv3 fixes
* Just save the progress
* Fix launch of blockwise fp8 dequant
* It actually works
* Async ops
* Optimize non-mla with cat
* Fix non-cuda build
* Update build
* Add more CUDA_CHECK
* Works really now
* Working fully now with pagedattn
* Format everything
* Fix diffusion device mapping (EricLBuehler#1187)
* Internal abstraction for distributed op (EricLBuehler#1188)
* Make Sequence::set_toks more safe (EricLBuehler#1190)
* Fix CI tests out of storage (EricLBuehler#1191)
* Internal abstraction for distributed op (EricLBuehler#1189)
* Fix build_cuda_all.yaml CI (EricLBuehler#1193)
* Support tensor parallelism for vision models! (EricLBuehler#1194)
* Refactor distributed mapper prep
* Support vision model TP
* Update docs
* Add vision model TP for mllama
* Always pass _USE_MATH_DEFINES for CUDA (EricLBuehler#1195)
* Always pass _USE_MATH_DEFINES
* Cargo.lock
* Remove matmul via f16 framework (EricLBuehler#1196)
* Remove API for matmul_via_f16 (EricLBuehler#1197)
* Add UQFF text/vision model API (EricLBuehler#1198)
* Add UQFF text/vision model API
* Typos
* Implement Qwen 2.5 VL! (EricLBuehler#1184)
* Implement Qwen 2.5 VL
* Reverse window index select
* Switch to rmsnorm
* Warn
* Fix config, loads now
* Fixes
* Complete qwen2_5vl feature
Todo: set_use_matmul_via_f16(true) from "pipline/inputs_processor" cause a significant loss of precision.
It’s hard to figure it out during subsequent debugging
Anyhow, globally setting matnuml precision MAY not be a ideal solution.
For now, change the precision back in mistralrs-core/src/vision_models/qwen2_5_vl/inputs_processor.rs
Qwen2_5vl feature is functional, start to clean code
Add examples for lower_level_qwen2_5vl
Fix: for deterministic sampling, top k SHOULD be Some(1) rather than None
Clean code
Rebase
Clean code
Fix cuda
* Fix Rustfmt and Clippy issues
* Clean code
* Merge branch ‘main’
---------
Co-authored-by: Eric Buehler <[email protected]>
* Implement Gemma 3 (text only)! (EricLBuehler#1201)
* Add config
* Add the text model
* Add inputs processor, loads/runs now
* It works!
* Add to APIs
* Implement Gemma 3 vision support! (EricLBuehler#1202)
* Add vision support for Gemma 3
* Implement image preprocessor and processor
* It works, kind of
* It works great
* Mask must be contiguous
* Update docs
* Format
* Manually fixup sentencepiece detok (EricLBuehler#1204)
* More vision models with TP (EricLBuehler#1200)
* More models for tp
* Fix clippy
* Fix topology link in the docs (EricLBuehler#1205)
* Gemma3 1b support and optimized rotating cache (EricLBuehler#1206)
* Support text-only gemma3
* Add rotating kv cache
* Do not preallocate rotating kv cache
* Improve rotating kv cache, prefix cacher system (EricLBuehler#1207)
* Improve rotating kv cache set_len and more intelligent prefix cacher v2
* Remove prefix cacher v1
* Better handling for kvcache set_len (EricLBuehler#1208)
* Fix gemma3 vision device in isq
* Update deps and use rand 0.9 (EricLBuehler#1210)
* Fix flash-attn v3 build
* Update hf hub dep, add initial blockwise fp8 GEMM tests (EricLBuehler#1212)
* Update hf_hub dep to not require openssl and add tests
* Update deps
* Fixes
* Undo 'fix' from clippy
* Ok maybe finally fix it
* Growable RotatingKvCache and fixes for Phi-4 mini (EricLBuehler#1215)
* Fixes for phi4 mini
* Fix causal mask
* Growable rotating kv cache
* Fix clippy
* Use docker build for x86 pyo3 wheels
* Fix cuda warn
* Vision model pagedattn fixes (EricLBuehler#1217)
* Gemma 3 cuda fixes
* Fix pagedattn bug
* Clippy
* Small fix for rotating cache?
* Add pydantic schema examples! (EricLBuehler#1219)
* Sliding window attention fixes (EricLBuehler#1220)
* Initial fixes for sliding window
* Fix swa, still without prefix cache
* Ok finally it works
* Handle multiple eos toks
* adapt to rig crate as client (EricLBuehler#1214)
* adapt to rig crate as client
* adapt to rig crate as client
* Implement Mistral 3! (EricLBuehler#1221)
* Add vision model and load language model
* Implement the mmproj and patch merger!
* Remove plot
* Reshaping patch embeds with image sizes, make block attn mask
* Add the inputs merging and forward
* Basic loader, a bunch of todos still
* Add the inputs processor
* Clippy
* Some fixes
* It works!
* Implement for the automatic device mapping
* ISQ support for the vision model too
* Docs
* Fused Metal SDPA with masking! (EricLBuehler#1225)
* Metal SDPA with masking
* Much faster quantization on metal!
* Check if actually metal
* Materialize the mask
* Fix cuda
* Format
* Send [DONE] SSE chunk per openai spec (EricLBuehler#1226)
* Fix handling of device when compiled for but disabled nccl (EricLBuehler#1227)
* Fix nccl blocking case (EricLBuehler#1228)
* Native Llama, Mistral Small 3.1, Mistral Nemo, Hermes 2 Pro, Hermes 3 tool calling! (EricLBuehler#1229)
* Llama model tool calling support
* Llama tool calling works
* Nice tool calling support
* Tool calling working with Mistral 3
* Support hermes
* Mistral nemo support
* Update server tool calling example
* OpenAI API compatability fixes (EricLBuehler#1230)
* Content itself is optional
* Only provide tool calls if they are not empty
* Add response_format support
* Fix response-format
* Fix json_schema.py example
* [Breaking] Automatic server logging (EricLBuehler#1231)
* Add logger for server
* Clippy
* Tweak
* Configurable
* Format
* Remove simple_tool_calling.py as deprecated
* Use default stream for flash attn (EricLBuehler#1232)
* More accurate throughput logging
* Bump version to 0.5.0 (EricLBuehler#1233)
* Fix handling of Metal fused attn head dims (EricLBuehler#1234)
* Fix handling of metal attn head dims
* Fix handling of gemma3 1b when images
* Tweak default for paged attn builder
* Support paged attn for vision model rust api (EricLBuehler#1235)
* [Breaking] Support setting HF cache path (EricLBuehler#1237)
* Add it internally
* Add the apis
* Support tool calling for DeepSeek models (EricLBuehler#1239)
* Support tool calling for deepseek models
* Format
* Fix deepseek
* Server image processing refactor and fixes (EricLBuehler#1244)
* Fix strict gemma3 case
* Accept multiple images in the content array
* Fix multiple images in one array ct
* Add it to the python api
* Typos
* Optimized CUDA RoPE kernels (EricLBuehler#1247)
* Add the kernels
* It works
* Works
* Buulds
* Typo fix (add_speial_tokens to add_special_tokens) (EricLBuehler#1246)
* Fix typo
* Update mistralrs.pyi
* Fixes for UQFF + distributed layers (EricLBuehler#1250)
* Fixes for uqff + distributed layers
* Typo
* Automatic agentic search integration (`web_search_options`) (EricLBuehler#1243)
* Add the tool
* Actually search
* Clippy
* Sort of works
* Remove some debuggers
* tweak
* Add some rules
* Works great
* Tweak 'system' prompt
* Update mistralrs-core/src/search/mod.rs
Co-authored-by: Copilot <[email protected]>
* Typo
* Add it to all the apis
* Add bert model for similarity reranking
* Typos
* Early detection of tools
* Alias max_tokens -> max_completion_tokens too
* Customizable bert model
* Flip the enabler around
* Add docs
* Update readme
* Typo
---------
Co-authored-by: Copilot <[email protected]>
* Format kernels (EricLBuehler#1251)
* Update readme
* Update readme
* Remove test
* Add quantize guards for uqff deserialize (EricLBuehler#1252)
* Refactor cuBLASlt-related code (EricLBuehler#1253)
* Centralize cublaslt into mistralrs-quant
* Use cublaslt in unquant layer
* Use beautiful trait constants for simpler code
* Move tests
* Dispatch to unquant for cublaslt
* Dispatch to unquant for cublaslt
* Fix feature
* Add convert_to_gptq script
* Update deps, bump pyo3 version (EricLBuehler#1259)
* Faster cuda FP8 performance (EricLBuehler#1257)
* Avoid fp8 sync
* Fix dtype
* Rust 1.86 clippy (EricLBuehler#1260)
* Rust 1.86 clippy
* Clippy
* Refactor engine arch (EricLBuehler#1262)
* Refactor engine add_request
* Don't recompile regex
* Clippy
* Revamped LoRA support - removing the Ordering system! (EricLBuehler#1263)
* Play with varbuilder lifetimes
* Merge lora weights
* Clippy
* Lora works
* Support multiple loras
* Cleanup, remove adapter activation
* Complete merge
* Fast Metal-specific quantization method: AFQ (EricLBuehler#1264)
* Add mlx quantized kernels
* Add mlx quantized kernels
* Kernel launcher
* Add AFQ isq quant and dequant
* Some quantmethod things
* Begin to implement the qmm caller
* Clippy
* Much faster
* Cache kernels
* Docs
* Clippy
* Add it to uqff
* Support prequantized models from MLX (EricLBuehler#1265)
* Refactor quantizedconfig
* Support AFQ prequantized
* Update docs
* Update docs
* Automatic ISQ to select fastest & most accurate method (EricLBuehler#1266)
* Automatic isq
* typo
* Doc
* Improved usage metrics (EricLBuehler#1267)
* Fix cuda
* Bump tokio from 1.44.1 to 1.44.2 (EricLBuehler#1270)
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.44.1 to 1.44.2.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](tokio-rs/tokio@tokio-1.44.1...tokio-1.44.2)
---
updated-dependencies:
- dependency-name: tokio
dependency-version: 1.44.2
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Gather MM ops in mistralrs-quant (EricLBuehler#1272)
* Update the caller
* Wire things up
* Broadcase for afq gathermm
* Broadcase for afq gathermm
* Clippy
* Improve performance of deepseek models
* Typo fix
* BincountOp not used
* Implement Llama 4! (EricLBuehler#1268)
* Implement Llama 4
* Implement the main changes for the text model
* Make chunked mask
* Wire things up
* Add some EP
* Initial sketch of inputs processor
* Runs
* Progress
* all reduce moes
* It works!
* Some cleanup
* Faster moe block
* Add device map
* Make chunked matrix
* Fully working now!
* Reactivate cublaslt
* Fix shared mlp cublaslt
* Refactor to packed experts
* Complete merge
* It is a normal model now
* Fixes
* Set device for moe
* ISQ fixes
* Much faster sort kernel
* Faster loading!
* Faster loading!
* Fp8 cpu copy ops in candle backend
* Add the vision model
* Add mmproj layer
* Actually merge the inputs
* Sketch most of the image processor
* Add the rest of the image processor
* Implement the whole processor
* Add the loader
* Some fixes
* A batch of fixes
* Some fixes
* tmp
* Actually support isq
* Ok it works a bit
* Fix norm device
* It works
* A bit cleaner
* Support residul tensors
* Remove text loader
* Implement the device mapping system
* Fix auto device map
* Add examples
* Add model card
* Typo
* Remove superflous logging
* Fixes for Llama 4 UQFF loading (EricLBuehler#1275)
* Support sharding for UQFF (EricLBuehler#1276)
* Serialize sharded uqff files
* Loading
* Fix base64
* Fix bug for group-topk (group_limited_greedy) in deepseek models (EricLBuehler#1278)
* Support the DeepCoder model (EricLBuehler#1279)
* Add faq for metal not found
* updates from candle
* fixes
* relax tokio
* make AdapterPaths, LoraAdapterPaths public
---------
Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Eric Buehler <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: brrr <[email protected]>
Co-authored-by: Eric Buehler <[email protected]>
Co-authored-by: Etienne Balit <[email protected]>
Co-authored-by: benliao <[email protected]>
Co-authored-by: edwko <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Guoqing Bao <[email protected]>1 parent 808f39b commit 91a5ad7
File tree
319 files changed
+29402
-10676
lines changed- .github/workflows
- chat_templates
- docs
- examples
- python
- server
- mistralrs-bench
- src
- mistralrs-core
- src
- cublaslt
- cuda
- diffusion_models
- clip
- flux
- t5
- dummy_paged_attention
- embedding
- engine
- gguf
- lora
- models
- paged_attention
- pipeline
- loaders
- scheduler
- search
- tools
- utils
- vision_models
- gemma3
- idefics2
- idefics3
- llama4
- llava
- llava_llm
- minicpmo
- mistral3
- mllama
- phi3
- phi4
- qwen2_5_vl
- qwen2vl
- xlora_models
- mistralrs-paged-attn
- src
- cuda
- attention
- backend
- metal/kernels
- mistralrs-pyo3
- src
- mistralrs-quant
- kernels
- bitsandbytes
- blockwise_fp8
- gptq
- hqq
- marlin
- ops
- rotary
- src
- afq
- bitsandbytes
- blockwise_fp8
- cublaslt
- distributed
- dummy
- fp8
- gguf
- gptq
- hqq
- lora
- metal_kernels
- rotary
- unquantized
- utils
- mistralrs-server
- src
- mistralrs-vision/src
- mistralrs
- examples
- custom_logits_processor
- gemma3
- llama4
- llguidance
- lora_activation
- lora
- lower_level
- anymoe_lora
- anymoe
- batching
- custom_logits_processor
- gemma2
- gguf_locally
- grammar
- idefics2
- isq
- llava_next
- llava
- lora
- mixture_of_quant_experts
- paged_attn
- phi3_5_moe
- phi3v
- quantized
- qwen2_5vl
- simple
- topology
- xlora
- mistral3
- perplexity
- qwen2_5vl
- tools_llama_8b
- tools
- uqff_vision
- uqff
- web_search
- src
- orderings
- scripts
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
319 files changed
+29402
-10676
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| 46 | + | |
46 | 47 | | |
47 | 48 | | |
48 | 49 | | |
| |||
58 | 59 | | |
59 | 60 | | |
60 | 61 | | |
61 | | - | |
| 62 | + | |
62 | 63 | | |
63 | 64 | | |
64 | 65 | | |
65 | 66 | | |
66 | 67 | | |
67 | | - | |
68 | | - | |
| 68 | + | |
| 69 | + | |
69 | 70 | | |
70 | 71 | | |
71 | 72 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
52 | | - | |
| 52 | + | |
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
5 | | - | |
| 5 | + | |
| 6 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | | - | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
10 | 13 | | |
11 | 14 | | |
12 | 15 | | |
| |||
0 commit comments