Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
300 commits
Select commit Hold shift + click to select a range
6ec265d
Fix handling of Metal fused attn head dims (#1234)
EricLBuehler Mar 24, 2025
11bcd69
Tweak default for paged attn builder
EricLBuehler Mar 24, 2025
8069f90
Support paged attn for vision model rust api (#1235)
EricLBuehler Mar 24, 2025
5a8f36c
[Breaking] Support setting HF cache path (#1237)
EricLBuehler Mar 26, 2025
b903faf
Support tool calling for DeepSeek models (#1239)
EricLBuehler Mar 26, 2025
01ee538
Server image processing refactor and fixes (#1244)
EricLBuehler Mar 27, 2025
0442d5b
Optimized CUDA RoPE kernels (#1247)
EricLBuehler Mar 27, 2025
d7cb787
Typo fix (add_speial_tokens to add_special_tokens) (#1246)
edwko Mar 27, 2025
8b656a9
Fixes for UQFF + distributed layers (#1250)
EricLBuehler Mar 29, 2025
2c9da34
Automatic agentic search integration (`web_search_options`) (#1243)
EricLBuehler Mar 29, 2025
aecf0fb
Format kernels (#1251)
EricLBuehler Mar 29, 2025
de8c675
Update readme
EricLBuehler Mar 29, 2025
4207edb
Update readme
EricLBuehler Mar 29, 2025
e8ff90d
Remove test
EricLBuehler Mar 29, 2025
944ae5e
Add quantize guards for uqff deserialize (#1252)
EricLBuehler Mar 29, 2025
dc861ee
Refactor cuBLASlt-related code (#1253)
EricLBuehler Mar 30, 2025
b7e17f4
Add convert_to_gptq script
EricLBuehler Mar 30, 2025
efbef3a
Update deps, bump pyo3 version (#1259)
EricLBuehler Apr 3, 2025
4c8fbf2
Faster cuda FP8 performance (#1257)
EricLBuehler Apr 3, 2025
e43d602
Rust 1.86 clippy (#1260)
EricLBuehler Apr 3, 2025
ecf23b0
Refactor engine arch (#1262)
EricLBuehler Apr 4, 2025
b286f3e
Revamped LoRA support - removing the Ordering system! (#1263)
EricLBuehler Apr 4, 2025
07dafc0
Fast Metal-specific quantization method: AFQ (#1264)
EricLBuehler Apr 5, 2025
d0e45ce
Support prequantized models from MLX (#1265)
EricLBuehler Apr 5, 2025
fac37b3
Automatic ISQ to select fastest & most accurate method (#1266)
EricLBuehler Apr 5, 2025
0fd8e40
Improved usage metrics (#1267)
EricLBuehler Apr 5, 2025
7be249d
Fix cuda
EricLBuehler Apr 5, 2025
3953f9f
Bump tokio from 1.44.1 to 1.44.2 (#1270)
dependabot[bot] Apr 8, 2025
f3a73c3
Gather MM ops in mistralrs-quant (#1272)
EricLBuehler Apr 8, 2025
4317618
Improve performance of deepseek models
guoqingbao Apr 10, 2025
b267630
Typo fix
guoqingbao Apr 10, 2025
1897d2a
BincountOp not used
guoqingbao Apr 10, 2025
72c58a2
Implement Llama 4! (#1268)
EricLBuehler Apr 13, 2025
a1f1523
Remove superflous logging
EricLBuehler Apr 13, 2025
63387e2
Fixes for Llama 4 UQFF loading (#1275)
EricLBuehler Apr 13, 2025
fd2456c
Support sharding for UQFF (#1276)
EricLBuehler Apr 13, 2025
cece61c
Fix base64
EricLBuehler Apr 13, 2025
9f88ed3
Fix bug for group-topk (group_limited_greedy) in deepseek models (#1278)
guoqingbao Apr 14, 2025
5ed0e7e
Support the DeepCoder model (#1279)
EricLBuehler Apr 14, 2025
98ce4ff
Add faq for metal not found
EricLBuehler Apr 14, 2025
e11c264
Improved PagedAttn scheduling accuracy (#1282)
EricLBuehler Apr 16, 2025
864faaf
Fix cuda build for copy_blocks
EricLBuehler Apr 17, 2025
3cd1bb1
Fixes for scheduling image seqs with pagedattn (#1283)
EricLBuehler Apr 18, 2025
dd3cdfa
update to llguidance 0.7.16 (#1284)
mmoskal Apr 18, 2025
5a6fd03
Update dependencies (#1286)
EricLBuehler Apr 18, 2025
dfbb183
Much faster image inputs processing (#1289)
EricLBuehler Apr 20, 2025
630b65c
Add more SDPA head dims for much faster SigLIP (#1290)
EricLBuehler Apr 20, 2025
26afcc3
Show throughput in interactive mode (#1291)
EricLBuehler Apr 20, 2025
7882cba
Accurate prompt t/s for usage
EricLBuehler Apr 20, 2025
d7d1209
Unify bitwise operations (#1288)
EricLBuehler Apr 25, 2025
a4aa0b5
Multimodal prefix caching support! (#1298)
EricLBuehler Apr 28, 2025
010f6c8
Interactive mode improvements (#1299)
EricLBuehler Apr 28, 2025
e4a8222
Add the Qwen 3 and Qwen 3 MoE models! (#1285)
EricLBuehler Apr 29, 2025
c933fde
Fix dead link
EricLBuehler Apr 29, 2025
c40c680
Remove interactive mode max_len
EricLBuehler Apr 29, 2025
0a1d36e
Update QWEN3.md
EricLBuehler Apr 29, 2025
c3aa58e
Hotfix for vision mode clear
EricLBuehler Apr 29, 2025
d4617c3
Revamped and streaming web search support (#1301)
EricLBuehler Apr 29, 2025
2f12595
Handle vision messages or different tool call prefixes (#1302)
EricLBuehler Apr 29, 2025
2025fbe
Fix cuda
EricLBuehler Apr 30, 2025
d2e6c03
Tune web search budget
EricLBuehler Apr 30, 2025
6ce88b3
Simplify prefix cacher (#1305)
EricLBuehler Apr 30, 2025
15cc7a8
Use rustyline to handle non-ascii in interactive mode (#1306)
beeender May 1, 2025
a63da3c
Add more tools for automatic search (#1307)
EricLBuehler May 1, 2025
af6f04c
Fix CPU hogging in interactive mode (#1309)
beeender May 2, 2025
8f1f1ec
Add Metal precompilation support (#1311)
EricLBuehler May 3, 2025
7bff8bb
Reduce thrashing of Metal autorelease (#1313)
EricLBuehler May 3, 2025
b17455c
make `AdapterPaths` and `LoraAdapterPaths` public (#1314)
Slowki May 6, 2025
ca5794d
Refactor KV cache manager (#1315)
EricLBuehler May 6, 2025
0e650dc
Add `Audio` and `Speech` model categories (#1317)
Slowki May 7, 2025
7a794da
Remove has_conv2d from vision model API (#1318)
EricLBuehler May 7, 2025
99ea36c
Unified/automatic flash attention enabler (#1319)
EricLBuehler May 7, 2025
ebd50e3
Fix cublaslt 4d mask (#1320)
EricLBuehler May 8, 2025
11b2718
Keep caches on gpu
EricLBuehler May 8, 2025
e1672b7
Qwen VL models fixes (#1322)
EricLBuehler May 9, 2025
0b540ea
Fixes for all vision models (#1323)
EricLBuehler May 9, 2025
0521cd5
Improved+faster LRU prefix cacher (#1321)
EricLBuehler May 11, 2025
4243e84
Inplace ISQ support and default to mmap (#1277)
EricLBuehler May 13, 2025
a8ab6c8
Remove debug print
EricLBuehler May 13, 2025
f1bf0f8
Remove debug print
EricLBuehler May 13, 2025
34c50a4
Remove debug print
EricLBuehler May 13, 2025
e76a71c
Fix typos (#1329)
omahs May 13, 2025
9a228b8
Fix Idefics 3 arch chat templating (#1330)
EricLBuehler May 13, 2025
becbdd6
Remove two space from PR comment (#1331)
szepeviktor May 14, 2025
380da23
Add automatic vision loader type (#1332)
EricLBuehler May 14, 2025
cb2231c
Add the Dia 1.6b TTS model! (#1304)
EricLBuehler May 16, 2025
641f166
update `llguidance` to `0.7.20` (#1334)
Slowki May 16, 2025
dc87009
Add model category <> messages check (#1335)
EricLBuehler May 17, 2025
fa39bf3
Add element-wise normalization check (#1340)
EricLBuehler May 17, 2025
566e171
Fix streaming example print statement (#1339)
EricLBuehler May 17, 2025
4fea0f5
Fix normalization formula in comment (#1338)
EricLBuehler May 17, 2025
5d5b622
Fix image_to_pixels to handle non-RGB images (#1337)
EricLBuehler May 17, 2025
43eff96
Fix typo in expect messages (#1342)
EricLBuehler May 17, 2025
c116ce4
Don't use mmap on cuda (#1336)
EricLBuehler May 17, 2025
ec43205
Support AWQ format models (#1350)
guoqingbao May 19, 2025
a9a4c99
Fix uqff dummy layer ISQ application (#1351)
EricLBuehler May 19, 2025
c351ee6
Disable immediate isq if write_uqff (#1352)
EricLBuehler May 19, 2025
e23a25e
Fixes for UQFF loading on CUDA, ISQ pack factor (#1354)
EricLBuehler May 20, 2025
6c0b453
Refactor Option references for model paths (#1347)
EricLBuehler May 20, 2025
30c9bb8
Add a script for server benchmarking (#1355)
EricLBuehler May 21, 2025
64a0b75
Optimized Metal qmv_fast path (#1356)
EricLBuehler May 21, 2025
60579ef
Compile with lto
EricLBuehler May 21, 2025
c89aa3c
Tweak profiles
EricLBuehler May 21, 2025
a97de2b
New, fast sampler for Metal! (#1327)
EricLBuehler May 21, 2025
87a3eb6
Remove warning
EricLBuehler May 21, 2025
a63a8a3
Fix chat port
EricLBuehler May 21, 2025
504401f
Fix metal parallel sampling (#1357)
EricLBuehler May 22, 2025
cb939a8
Add immediate isq predicates for qwen3 (#1358)
EricLBuehler May 22, 2025
287c870
Fix gemma3 logging
EricLBuehler May 23, 2025
3681a9f
Regressions fixes (#1359)
EricLBuehler May 23, 2025
44e4535
Revamped and smaller readme (#1360)
EricLBuehler May 23, 2025
0e02779
Add a web chat app! (#1362)
EricLBuehler May 23, 2025
e5bedab
Add chat history support to web chat app (#1363)
EricLBuehler May 23, 2025
33e25d6
Refactor web chat, fix multichat image restore (#1364)
EricLBuehler May 24, 2025
8d0aca7
Fix repeated immediate isq init (#1365)
EricLBuehler May 24, 2025
2c9deae
Add gif
EricLBuehler May 24, 2025
7e36413
Tweak initial gif
EricLBuehler May 24, 2025
7f8ad73
Include vision tower tensors in Mistral3 UQFF (#1366)
EricLBuehler May 24, 2025
84eb046
Fix mistral 3 uqff resitdual tensors for vision
EricLBuehler May 24, 2025
49c512e
Rolling shard creation for uqff files (#1367)
EricLBuehler May 24, 2025
910bf12
Fix occasional unstability during isq of afq (#1368)
EricLBuehler May 24, 2025
85fe519
Fix web chat installation
EricLBuehler May 24, 2025
8194b41
Support web chat file uploading (#1370)
EricLBuehler May 25, 2025
8e7a30f
Add speech generation support to the web chat! (#1373)
EricLBuehler May 25, 2025
9387241
Prefix caching for PagedAttention! (#1369)
EricLBuehler May 26, 2025
58df07e
Metal PagedAttention accuracy improvements (#1374)
EricLBuehler May 26, 2025
9b30e5b
Format metal paged attention
EricLBuehler May 26, 2025
4571f4a
Handle images in paged attn scheduler (#1375)
EricLBuehler May 26, 2025
2b56c10
Include schemas needed for chatcompletions endpoint (#1353)
matthewhaynesonline May 26, 2025
0006787
Fix constraints with metal sampler
EricLBuehler May 26, 2025
30a08e3
Revert #1375
EricLBuehler May 26, 2025
45ccd26
Fix case where prefix cacher returns no toks (#1377)
EricLBuehler May 27, 2025
6c63203
Fix AFQ UQFF serialization
EricLBuehler May 27, 2025
50b805c
Faster UQFF serialization (#1379)
EricLBuehler May 27, 2025
bdb5e8b
Improve gemma3 auto loader names
EricLBuehler May 27, 2025
29a30cc
UQFF creation for AFQ on CPU support (#1380)
EricLBuehler May 27, 2025
6e811f2
Improved device for afq quantize
EricLBuehler May 27, 2025
9d72f7d
Improved dtype handling for cpu afq (de)quantize
EricLBuehler May 28, 2025
ec9ee69
Improved generate_uqff_card
EricLBuehler May 28, 2025
f690f9c
Add fused CPU attention kernel! (#1382)
EricLBuehler May 29, 2025
15648b9
Refactor attention backends (#1384)
EricLBuehler May 29, 2025
032a567
Set macOS thread affinity for CPU attn (#1385)
EricLBuehler May 29, 2025
0069792
Use lazylock
EricLBuehler May 29, 2025
ecbe897
Format
EricLBuehler May 29, 2025
68b9986
Fix metal warn build
EricLBuehler May 29, 2025
abb6185
Faster Qwen 3 MoE support on Metal (#1387)
EricLBuehler May 30, 2025
b95c8ec
Fix PagedAttention block leaks (#1388)
EricLBuehler May 30, 2025
722e46e
Fix double free in block engine
EricLBuehler May 30, 2025
4a23791
Do not apply ISQ if loading a prequantized model
EricLBuehler May 30, 2025
5cebfee
Fix cuda build again (#1389)
EricLBuehler May 30, 2025
7064b83
Update dockerfiles
EricLBuehler May 30, 2025
542a500
Bump version to 0.6.0 (#1390)
EricLBuehler May 30, 2025
698a943
Fix routing for static handler in web chat
EricLBuehler May 30, 2025
4f255b5
Fewer .contiguous calls for qwen3 moe (#1391)
EricLBuehler May 30, 2025
ddcaca1
Allow speech models to accept batched inputs (#1393)
EricLBuehler May 31, 2025
ed198b2
Ring distributed backend for heterogeneous TP (#1238)
EricLBuehler May 31, 2025
5fc5a15
Add deepseek tool calling chat template
EricLBuehler Jun 1, 2025
b4a0a2f
Add auto loader for vision/text detection! (#1402)
EricLBuehler Jun 2, 2025
9478d2c
Create Mistral.rs Server Core Lib: `mistralrs-server-core` (#1346)
matthewhaynesonline Jun 3, 2025
31b0e8b
Support linear rope for llama3 (#1408)
EricLBuehler Jun 3, 2025
4d60531
Hotfix for loading
EricLBuehler Jun 3, 2025
190ad20
Fix vllama4 uqff loading (#1409)
EricLBuehler Jun 3, 2025
b0af3ad
Gracefully handle receiver disconnects (#1410)
EricLBuehler Jun 3, 2025
7c819c8
Fix Qwen3 MoE device mapping irregularities (#1411)
EricLBuehler Jun 3, 2025
3281d67
Fix interactive mode URL parsing (#1412)
EricLBuehler Jun 3, 2025
1037ac3
Refactor auto device map (#1413)
EricLBuehler Jun 3, 2025
f93bec1
Enable runtime sampling tweaks in interactive mode (#1414)
EricLBuehler Jun 3, 2025
32126d3
Send streaming tokens every time
EricLBuehler Jun 3, 2025
26fc5c9
Gumbel sampling for fast sampler (#1416)
EricLBuehler Jun 3, 2025
201d6be
Improved handling for initialize_logging
EricLBuehler Jun 3, 2025
d92bfca
Improved CPU flash attention accuracy & performance (#1417)
EricLBuehler Jun 3, 2025
4e156ad
Provide chat_templates to container users (#1419)
sempervictus Jun 3, 2025
6547156
Faster cpu flash attn (#1418)
EricLBuehler Jun 3, 2025
57d6e12
Web search improvements (bm25, web chat) (#1420)
EricLBuehler Jun 3, 2025
ecb6907
Propely handle consecutive searches (#1421)
EricLBuehler Jun 4, 2025
d6c227e
Update docs (#1422)
matthewhaynesonline Jun 4, 2025
c53d346
Better tool call detection logic (#1424)
EricLBuehler Jun 4, 2025
39673eb
Add web search hook callbacks (#1426)
EricLBuehler Jun 4, 2025
9989719
Fix CUDA context switching, bind thread on CudaStorage drop (#1428)
EricLBuehler Jun 4, 2025
8d13759
conditionally build flash attention inputs (#1429)
EricLBuehler Jun 4, 2025
328dea1
Add AGENTS.md (#1430)
EricLBuehler Jun 4, 2025
8612b92
Support Qwen3 GGUF model (#1432)
guoqingbao Jun 5, 2025
d37db05
Improved paged attn prefix caching (#1434)
EricLBuehler Jun 5, 2025
77436b8
Clippy
EricLBuehler Jun 5, 2025
cfd1e89
Temporary fix for qwen3 gguf tokenizer (#1433)
guoqingbao Jun 5, 2025
d345011
Add tool callback support (#1427)
EricLBuehler Jun 6, 2025
3d1b29b
Centralize crate dependencies (#1438)
EricLBuehler Jun 6, 2025
15b7228
Fix bug in tokenizer created with gguf metadata (#1440)
guoqingbao Jun 6, 2025
072cda3
Update deps (#1441)
EricLBuehler Jun 6, 2025
c21c505
Doc fixes (#1442)
EricLBuehler Jun 6, 2025
9597fd4
Mention uqff_maker
EricLBuehler Jun 6, 2025
2532852
Downgrade rustyline 16.0.0 -> 15.0.0 (#1444)
EricLBuehler Jun 6, 2025
2cb0a3e
Add max_completion_tokens alias for server (#1451)
EricLBuehler Jun 8, 2025
fff2665
Audio input support (Phi 4 multimodal) (#1448)
EricLBuehler Jun 9, 2025
ea1d5d6
Fix offline cache issue for gguf models (#1452)
guoqingbao Jun 9, 2025
988280f
Add MCP server endpoints (#1453)
EricLBuehler Jun 10, 2025
8c05fd0
Tweak temperature bounds, args
EricLBuehler Jun 10, 2025
feefd40
MCP documentation pass (#1455)
EricLBuehler Jun 10, 2025
a7bb740
Improve readme header
EricLBuehler Jun 10, 2025
d7384b8
Improve readme header
EricLBuehler Jun 10, 2025
f1ad6ae
Integrate an MCP client (#1456)
EricLBuehler Jun 10, 2025
8a53c71
Update generate_wheels
EricLBuehler Jun 10, 2025
9cabf9f
Update generate_wheels
EricLBuehler Jun 10, 2025
3410183
Update generate_wheels
EricLBuehler Jun 10, 2025
a5c4eda
Fix Dockerfile.cuda-all
EricLBuehler Jun 10, 2025
39e0ff5
Improve automatic tool call (#1460)
EricLBuehler Jun 11, 2025
30859cd
chore: `Dockerfile.cuda-all` configurable threads (#1458)
polarathene Jun 11, 2025
fa647ab
chore: `Dockerfile.cuda-all` - Merge `RUN` for `apt-get install` (#1459)
polarathene Jun 11, 2025
e5f0f04
Add fallback definition for isnan (#1463)
EricLBuehler Jun 11, 2025
f3b1afa
chore: `Dockerfile` - Drop runtime rayon thread ENV (#1465)
polarathene Jun 12, 2025
37a55f9
Remove duplicate calls for api_dir_list (#1474)
guoqingbao Jun 18, 2025
d5e80a8
Fix transient pyo3 dep (#1478)
EricLBuehler Jun 18, 2025
e2c0822
Fix objc dep with non macos (#1480)
EricLBuehler Jun 18, 2025
4608202
Fix phi 3/4 + nccl issue (#1481)
EricLBuehler Jun 18, 2025
badbe10
Fix phi3.5 moe (#1482)
EricLBuehler Jun 19, 2025
210061f
Support GLM4 model! (#1437)
guoqingbao Jun 19, 2025
408c888
Refactor distributed backend (#1484)
EricLBuehler Jun 19, 2025
c8ab1f1
Cap metal paged attn kv allocation (#1485)
EricLBuehler Jun 19, 2025
0feb38c
Better paged attn metal cap (#1486)
EricLBuehler Jun 19, 2025
f13db3b
Server core: consolidate and unify route handlers and API surface (#1…
matthewhaynesonline Jun 19, 2025
1901d2d
Support qwen3 gguf (#1488)
EricLBuehler Jun 19, 2025
bc5581a
Make bos/eos token IDs optional (#1493)
EricLBuehler Jun 19, 2025
d7577dd
Remove python deps from CUDA dockerfiles (#1487)
EricLBuehler Jun 20, 2025
2aa89c3
Handle noncontiguous v in naive_sdpa (#1499)
EricLBuehler Jun 21, 2025
f38567a
Server Core: refactor Paged Attention configuration (#1500)
matthewhaynesonline Jun 21, 2025
a1220c6
Use StorageModePrivate for Metal PA kv cache (#1506)
EricLBuehler Jun 22, 2025
c9d0a0e
Fix OpenAI stream: emit field in tool-call deltas for schema complian…
Sbargaoui Jun 23, 2025
1ad6488
FP8 KV-cache quantization for PagedAttention (#1400)
EricLBuehler Jun 23, 2025
d8bbbe9
Validate model name in OpenAI API (#1509)
EricLBuehler Jun 23, 2025
aa4b218
Updated examples for #1509
EricLBuehler Jun 23, 2025
c25d1db
Fix mcp import in doc string (#1510)
GaetanLepage Jun 24, 2025
272f7ee
Add multi-model support! (#1512)
EricLBuehler Jun 24, 2025
620117e
Add stars label to readme (#1513)
EricLBuehler Jun 25, 2025
4ae689b
Add CLAUDE.md
EricLBuehler Jun 25, 2025
e46bf86
Handle base_model.model case in lora (#1514)
EricLBuehler Jun 25, 2025
9e33c8f
Add thread_local! for engine-specific const/static (#1517)
EricLBuehler Jun 25, 2025
5fbf607
Fix MCP doc test (#1511)
GaetanLepage Jun 25, 2025
359f99c
Allow disabling metal precompilation (#1518)
EricLBuehler Jun 26, 2025
30d1cce
Rust 1.88 clippy (#1522)
EricLBuehler Jun 26, 2025
ea3f517
Fix cuda warnings (#1526)
EricLBuehler Jun 28, 2025
d025ebd
Avoid panic decoding tokens on error (#1527)
EricLBuehler Jun 29, 2025
d38a7e1
Split Marlin and Paged Attention kernels for faster build (#1525)
guoqingbao Jul 2, 2025
8a4faf3
chore: update llguidance (#1535)
ammar-elsabe Jul 4, 2025
85dcfbe
Add the SmolLM3 model! (#1501)
EricLBuehler Jul 8, 2025
70c7f86
Add full Gemma 3n support! (#1519)
EricLBuehler Jul 9, 2025
60a530c
Fix sequence length check (#1546)
EricLBuehler Jul 10, 2025
d256806
update candle version (#1545)
AlpineVibrations Jul 10, 2025
103eb3f
add ios target to metal deps (#1548)
rubiktubik Jul 10, 2025
5915c4c
Merge branch 'master' into jeadie/25-07-08/updaste
Jeadie Jul 11, 2025
a940796
Merge remote-tracking branch 'origin/master' into jeadie/25-07-08/upd…
Jeadie Jul 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
17 changes: 11 additions & 6 deletions .cargo/config.toml
Original file line number Diff line number Diff line change
@@ -1,12 +1,17 @@
[target.x86_64-unknown-linux-gnu]
[build]
rustflags = ["-C", "target-cpu=native"]

[target.aarch64-apple-darwin]
[build]
rustflags = ["-C", "target-cpu=native"]
rustflags = [
"-C", "target-cpu=native",
"-C", "target-feature=+aes,+sha2,+fp16",
]

[target.x86_64-apple-darwin]
rustflags = [
"-C", "target-cpu=native",
"-C", "target-feature=-avx,-avx2",
]

[target.wasm32-unknown-unknown]
rustflags = ["-C", "target-feature=+simd128"]

[target.x86_64-apple-darwin]
rustflags = ["-C", "target-feature=-avx,-avx2"]
4 changes: 1 addition & 3 deletions .github/workflows/analysis.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,7 @@ jobs:
const codeReport = `
<details>
<summary>${uniqueIdentifier}</summary>
<pre>
${tokeiOutput}
</pre>
<pre>${tokeiOutput}</pre>
</details>
`;

Expand Down
18 changes: 8 additions & 10 deletions .github/workflows/build_cuda_all.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: deploy_cuda_docker
name: build_cuda_all

# gh workflow run deploy_cuda_docker
# gh workflow run build_cuda_all
# This also runs on release deploy
on:
workflow_dispatch:
Expand All @@ -16,7 +16,7 @@ jobs:
matrix:
compute_capability: [75, 80, 86, 89, 90]
fail-fast: false
runs-on: [ubuntu-latest]
runs-on: ubuntu-latest

permissions:
contents: write
Expand Down Expand Up @@ -59,14 +59,14 @@ jobs:
uses: docker/metadata-action@v5
with:
images: |
ghcr.io/${{ github.repository_owner }}/$(basename ${{ github.repository }})
ghcr.io/${{ github.repository }}
flavor: |
latest=false
tags: |
type=semver,pattern=cuda-${{matrix.compute_capability}}-{{version}}
type=semver,pattern=cuda-${{matrix.compute_capability}}-{{major}}.{{minor}}
type=raw,value=cuda-${{matrix.compute_capability}}-sha-${{ steps.slug.outputs.short_sha }}
type=raw,value=cuda-${{matrix.compute_capability}}-sha-${{ github.sha }}
type=semver,pattern=cuda-${{ matrix.compute_capability }}-{{ version }}
type=semver,pattern=cuda-${{ matrix.compute_capability }}-{{ major }}.{{ minor }}
type=raw,value=cuda-${{ matrix.compute_capability }}-sha-${{ steps.slug.outputs.short_sha }}
type=raw,value=cuda-${{ matrix.compute_capability }}-sha-${{ github.sha }}
- name: Build and push Docker image
id: build-and-push-cuda
uses: docker/build-push-action@v6
Expand All @@ -80,5 +80,3 @@ jobs:
build-args: |
CUDA_COMPUTE_CAP=${{matrix.compute_capability}}
cache-from: type=local,src=/tmp/.buildx-cache


2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@
*.a
.DS_Store
.idea
mistral.rs/
mistralrs-web-chat/cache
10 changes: 8 additions & 2 deletions .typos.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,18 @@ extend-ignore-identifiers-re = [
"_thw",
"thr",
"nd",
"uneeded"
"uneeded",
"tese",
"seperable",
"Seperable",
"setp",
"cna",
]

[files]
extend-exclude = [
"mistralrs-pyo3/pdoc/*",
"examples/server/phi3_duckduckgo_mistral.rs.ipynb",
"calibration_data/*"
"calibration_data/*",
"mistralrs-web-chat/static*"
]
147 changes: 147 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
<!-- AGENTS.md: Guidance for AI agents to navigate, build, test, and contribute to this repository -->
# AGENTS

This file provides instructions for AI agents to understand the layout of the `mistral.rs` repository, run builds/tests, and follow project conventions.

## Repository Structure

- `/mistralrs/` : Main Rust crate (text & multimodal inference API)
- `/mistralrs-core/` : Core inference logic and tensor operations (text models)
- `/mistralrs-vision/` : Vision inference support (image-based inputs & vision-enabled models)
- `/mistralrs-quant/` : Quantization support (ISQ, GGUF, GPTQ, AWQ, FP8, HQQ, etc.)
- `/mistralrs-paged-attn/`: PagedAttention implementation
- `/mistralrs-pyo3/` : Python bindings (PyO3)
- `/mistralrs-server/` : CLI & OpenAI-compatible HTTP server (subcommands: run/vision-plain, diffusion, speech)
- `/mistralrs-server-core/`: Shared server core logic
- `/mistralrs-web-chat/` : Web chat application (static assets & backend integration)
- `/mistralrs-bench/` : Benchmarking tools
- `/docs/` : Markdown documentation for models, features, and guides
- `/examples/` : Usage examples (Rust, Python, server samples, notebooks)
- `/chat_templates/` : Chat formatting templates (JSON/Jinja)
- `/scripts/` : Utility scripts (e.g., AWQ conversion)

## Feature Organization

Mistral.rs supports multiple model types and advanced features via dedicated crates and CLI subcommands:

- **Text Inference**
- Crate: `mistralrs-core` (low-level ops), `mistralrs` (API wrapper)
- CLI: `run` / `plain` subcommand in `mistralrs-server`
- Docs: `docs/SAMPLING.md`, `docs/TOOL_CALLING.md`
- **Vision Models**
- Crate: `mistralrs-vision`
- CLI: `vision-plain` subcommand
- Docs: `docs/VISION_MODELS.md`, `docs/IMAGEGEN_MODELS.md`, `docs/IMATRIX.md`
- **Diffusion Models**
- CLI: `diffusion` subcommand
- Docs: `docs/FLUX.md`
- **Speech Models**
- CLI: `speech` subcommand
- Docs: `docs/DIA.md`
- **Quantization & ISQ**
- Crate: `mistralrs-quant`
- Docs: `docs/QUANTS.md`, `docs/ISQ.md`
- Conversion Script: `scripts/convert_awq_marlin.py`
- **Paged Attention**
- Crate: `mistralrs-paged-attn`
- Docs: `docs/PAGED_ATTENTION.md`
- **Adapters & LoRA/X-LoRA**
- Docs: `docs/ADAPTER_MODELS.md`, `docs/LORA_XLORA.md`
- **Mixture of Experts (AnyMoE)**
- Docs: `docs/ANYMOE.md`

## Building

1. Install Rust via rustup (Rust 2021 edition).
2. Choose optional features (e.g., `cuda`, `flash-attn`, `cudnn`, `metal`, `mkl`, `accelerate`).
3. Build the entire workspace:
```bash
cargo build --workspace --release --features "<features>"
```
4. Or build/install only the server binary:
```bash
cargo build --release --package mistralrs-server --features "<features>"
cargo install --path mistralrs-server --features "<features>"
```

## Models

When integrating a new model, make sure it respects all of the varbuilder `.pp` calls. In Candle, a VarBuilder maintains an internal path vector that acts like a “current working directory” for model weights; every call to pp("sub") (alias for push_prefix) clones the builder and appends sub, so successive calls accumulate a dotted prefix such as transformer.h.0 while leaving the original builder untouched . When you eventually call get(...), Candle joins that prefix with the tensor name (prefix + "." + name) and looks it up in the checkpoint backend, producing keys that exactly match the dot-separated names emitted by PyTorch’s state_dict/named_parameters, which means PyTorch-trained weights can be loaded without any renaming . This lets you recreate the PyTorch module tree in Rust by “walking” it: e.g. vb.pp("word_embeddings") grabs word_embeddings.*, while a chain like vb.pp("encoder").pp("layers").pp(i.to_string()) targets keys such as encoder.layers.0.*, exactly as shown in community tutorials porting Transformers models to Candle . As one maintainer put it, the prefix system lets you “cd” around the parameter hierarchy, giving a lightweight namespace mechanism that keeps Candle fully compatible with PyTorch naming conventions while remaining ergonomic to use.

You should also look for a model.safetensors.index.json file for the model at hand to verify correct structure.

## Testing

- Core test suite (requires HF token for some tests):
```bash
export HF_TOKEN=<your_token> # or TESTS_HF_TOKEN for CI parity
cargo test -p mistralrs-core -p mistralrs-quant -p mistralrs-vision
```
- Run all tests across workspace (may skip some crates without tests):
```bash
cargo test --workspace
```

You should *always* run `cargo check`/`cargo c` before returning to make sure code compiles. If code does not compile, only make edits.

Avoid returning TODOs.

## Formatting & Linting

- Format all Rust code:
```bash
cargo fmt --all
make fmt # also formats Python/CUDA/C++ files via ruff, clang-format
```
- Lint with Clippy:
```bash
cargo clippy --workspace --tests --examples -- -D warnings
```

## Documentation

- Generate Rust docs for all crates:
```bash
cargo doc --workspace
```
- Preview at `target/doc/` or publish to GitHub Pages as configured.
- Refer to `/docs/` for in-depth markdown guides (e.g., DEVICE_MAPPING.md, TOOL_CALLING.md).

## Examples

- Rust examples: `mistralrs/examples/`
- Python examples: `examples/python/`
- Server samples: `examples/server/`
- Run Python scripts:
```bash
python3 examples/python/<script>.py
```
- Run server/CLI:
```bash
./target/release/mistralrs-server -i <mode> -m <model> [options]
```

## CI Parity

The CI pipeline is defined in `.github/workflows/ci.yml` and includes:
- `cargo check` for all targets
- `cargo test` on core crates
- `cargo fmt -- --check`
- `cargo clippy -D warnings`
- `cargo doc`
- Typos check (`crate-ci/typos`)

## Contribution Conventions

- Follow Rust 2021 idioms, keep code minimal and focused.
- Update `/docs/` and examples when adding features or breaking changes.
- Add tests and examples for new functionality.
- Commit messages should be clear and follow conventional style where possible.
```
feat(crate): describe new feature
fix(crate): describe bug fix
docs: update docs for ...
```

---
*This AGENTS.md file is intended solely to improve AI-driven assistance and does not affect runtime behavior.*
122 changes: 122 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

mistral.rs is a blazing-fast LLM inference engine written in Rust. It supports text, vision, image generation, and speech models with multiple APIs (Rust, Python, OpenAI HTTP, MCP).

## Essential Commands

### Building
```bash
# Basic release build
cargo build --release

# With CUDA support (Linux)
cargo build --release --features "cuda flash-attn cudnn"

# With Metal support (macOS)
cargo build --release --features metal

# Install server binary
cargo install --path mistralrs-server --features <features>
```

### Testing & Quality
```bash
# Run core tests
cargo test -p mistralrs-core -p mistralrs-quant -p mistralrs-vision

# Format code (uses rustfmt, ruff, clang-format)
make fmt

# Check formatting
cargo fmt --all -- --check

# Run clippy
cargo clippy --workspace --tests --examples -- -D warnings
```

### Running Models
```bash
# Run interactive mode with plain model
cargo run --release --features <features> -- -i plain -m <model_id> -a <arch>

# Run with GGUF quantized model
cargo run --release --features <features> -- -i gguf -f <file> -t <tokenizer>

# Run server
cargo run --release --features <features> -- --port 1234 <model_args>
```

## Models

When integrating a new model, make sure it respects all of the varbuilder `.pp` calls. In Candle, a VarBuilder maintains an internal path vector that acts like a “current working directory” for model weights; every call to pp("sub") (alias for push_prefix) clones the builder and appends sub, so successive calls accumulate a dotted prefix such as transformer.h.0 while leaving the original builder untouched . When you eventually call get(...), Candle joins that prefix with the tensor name (prefix + "." + name) and looks it up in the checkpoint backend, producing keys that exactly match the dot-separated names emitted by PyTorch’s state_dict/named_parameters, which means PyTorch-trained weights can be loaded without any renaming . This lets you recreate the PyTorch module tree in Rust by “walking” it: e.g. vb.pp("word_embeddings") grabs word_embeddings.*, while a chain like vb.pp("encoder").pp("layers").pp(i.to_string()) targets keys such as encoder.layers.0.*, exactly as shown in community tutorials porting Transformers models to Candle . As one maintainer put it, the prefix system lets you “cd” around the parameter hierarchy, giving a lightweight namespace mechanism that keeps Candle fully compatible with PyTorch naming conventions while remaining ergonomic to use.

You should also look for a model.safetensors.index.json file for the model at hand to verify correct structure.

## Architecture Overview

### Workspace Structure
- `mistralrs-core/` - Core inference engine, model implementations, pipelines
- `mistralrs-server/` - CLI binary entry point
- `mistralrs-server-core/` - HTTP server routing, OpenAI API implementation
- `mistralrs-pyo3/` - Python bindings (PyO3)
- `mistralrs/` - High-level Rust API
- `mistralrs-vision/` - Vision model support
- `mistralrs-quant/` - Quantization implementations (ISQ, GGUF, GPTQ, etc.)
- `mistralrs-paged-attn/` - PagedAttention implementation
- `mistralrs-audio/` - Audio processing
- `mistralrs-mcp/` - Model Context Protocol client
- `mistralrs-bench/` - Benchmarking tools

### Key Design Patterns

1. **Pipeline Architecture**: All models implement the `Pipeline` trait in `mistralrs-core/src/pipeline/mod.rs`. Different model types (Plain, GGUF, GGML, Vision) have their own pipeline implementations.

2. **Model Loading**: Models are loaded through `Loader` traits that handle different formats and quantizations. See `mistralrs-core/src/loader.rs`.

3. **Request Handling**: The server uses message passing with `MistralRs` struct managing a background thread pool. Requests flow through `mistralrs-core/src/engine/mod.rs`.

4. **Device Management**: Automatic and manual device mapping for multi-GPU setups handled in `mistralrs-core/src/device_map.rs`.

### Adding New Features

When adding new model architectures:
1. Implement the model in `mistralrs-core/src/models/`
2. Add pipeline support in `mistralrs-core/src/pipeline/`
3. Update model detection in `mistralrs-core/src/pipeline/normal.rs`
4. Add architecture enum variant in `mistralrs-core/src/lib.rs`
5. Update CLI args in `mistralrs-server/src/main.rs`

When adding new quantization methods:
1. Implement in `mistralrs-quant/src/`
2. Add to quantization loading logic in pipelines
3. Update documentation in `docs/QUANTIZATION.md`

### Important Files to Know

- `mistralrs-core/src/engine/mod.rs` - Main engine orchestration
- `mistralrs-core/src/pipeline/mod.rs` - Pipeline trait and common logic
- `mistralrs-server-core/src/routes.rs` - HTTP API endpoints
- `mistralrs-pyo3/src/lib.rs` - Python API entry point
- `mistralrs/examples/` - Usage examples for Rust API

### Testing Approach

You should *always* run `cargo check`/`cargo c` before returning to make sure code compiles. If code does not compile, only make edits.

Avoid returning TODOs.

- Unit tests are colocated with source files
- Integration tests in `tests/` directories
- Use `cargo test -p <crate>` to test specific components
- Python tests require building and installing the package first

### Common Pitfalls

1. **Feature Flags**: Many features are gated behind Cargo features. Always check what features are needed for your use case.
2. **Device Indices**: CUDA device selection uses 0-based indexing
3. **Chat Templates**: Models may need specific chat templates - check `chat_templates/` directory
4. **Quantization**: Different quantization methods have different hardware requirements
Loading
Loading