Releases · vllm-project/vllm-spyre

10 Dec 15:09

joerunde

v1.4.0

9824186

v1.4.0 Latest

Latest

v1.4.0 - Initial Prefix Caching Support

This Release:

Adds an experimental implementation of prefix caching
Adds support for vllm v0.11.1 and updates the default locked version
Upgrades the AFTU dev dependency lock to v0.5.0 to support chunked prefill tests on spyre hardware

What's Changed

✨ vllm support for 0.11.1 release by @joerunde in #546
🔥 remove prints by @joerunde in #594
⬆️ aftu to v0.5.0 by @tjohnson31415 in #595
🔥 Limit CI usage by @joerunde in #596
🐛 Fix main CI by @joerunde in #598
✨ Add prefix caching by @maxdebayser in #586
📝 add doc section on VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERS by @tjohnson31415 in #593

Full Changelog: v1.3.0...v1.4.0

Contributors

maxdebayser, joerunde, and tjohnson31415

Assets 2

09 Dec 16:22

tjohnson31415

v1.3.0

30fcc38

v1.3.0

This release adds support for Chunked Prefill for non-quantized models that can be enabled with:

VLLM_SPYRE_USE_CHUNKED_PREFILL=1 VLLM_SPYRE_USE_CB=1

What's Changed

feat: chunked prefill spyre model runner by @wallashss in #552
[tests] cleanup: remove temporary hack by @yannicks1 in #555
ChunkedPrefillSpyreScheduler: No Interleaving by @sducouedic in #554
fix: left padding of prompts less than chunk size by @wallashss in #557
feat: left padding from model runner to scheduler by @wallashss in #559
[CP] rewrite scheduler constraints for chunked prefill (🐛 fix) by @yannicks1 in #560
[CB] remove decode/prefill prioritization heuristic by @yannicks1 in #561
Bugfix: padding block cannot be reused with chunked prefill by @sducouedic in #563
[CP] scheduler constraints typo by @yannicks1 in #565
test: add maybe_xfail for quantized micro static batch logprobs checks by @tjohnson31415 in #566
[CP] fix empty model runner output by @yannicks1 in #570
[CB] remove env var VLLM_SPYRE_ENABLE_PREFILL_OPTIMIZATION by @yannicks1 in #562
[CP] optimal chunked prefill scheduler constraints by @yannicks1 in #564
[CP] Simplify code by @maxdebayser in #572
[CB] tighten constraint max model length decode sequences by @yannicks1 in #573
Set default chunk size to 4k for granite 3 8b TP4 by @tjohnson31415 in #571
Interleave chunked prefills with single decoding steps by @sducouedic in #558
feat/fix: add finish_requests to handle removal from ongoing_prefills by @tjohnson31415 in #577
fix: check only decoding requests in _satisfies_last_chunk_constraints by @tjohnson31415 in #576
tests: include chunked prefill on existing tests by @wallashss in #574
Add step tests for chunked prefill by @maxdebayser in #575
docs: chunked prefill updated documentation by @wallashss in #578
[Docs] Prep and publish GH Pages doc by @rafvasq in #579
[Docs] Update GH artifact versions by @rafvasq in #581
[Docs] Add workflow files to docs action triggers by @rafvasq in #582
[Docs] Avoid multiple artifacts by @rafvasq in #583
fix test_compare_graphs_chunked_prefill by @tjohnson31415 in #580
[Docs] Use mkdocs gh-deploy by @rafvasq in #584
[Docs] Update links to documentation by @rafvasq in #587
[PC] Refactor CB model runner to use vLLMs block pool by @maxdebayser in #585
[Docs] Add note about move by @rafvasq in #589
test: a few test configuration updates to have chunked prefill tests pass on Spyre by @tjohnson31415 in #588
Update default granite 8b chunk size to 1024 by @tjohnson31415 in #592
fix time logging and other small things by @yannicks1 in #590

Full Changelog: v1.2.3...v1.3.0

Contributors

maxdebayser, wallashss, and 4 other contributors

Assets 2

12 Nov 17:46

tjohnson31415

v1.2.3

9d049db

v1.2.3

Includes a change required to support Torch >= 2.8

What's Changed

fix: use sampling during warmup and disable backed_size_oblivious after model compilation by @tjohnson31415 in #551

Full Changelog: v1.2.2...v1.2.3

Contributors

tjohnson31415

Assets 2

05 Nov 18:16

rafvasq

v1.2.2

f081f4f

v1.2.2

What's Changed

Remove aftu script copying, use directly by @rafvasq in #548

Full Changelog: v1.2.1...v1.2.2

Contributors

rafvasq

Assets 2

31 Oct 16:07

joerunde

v1.2.1

ddf3c4d

v1.2.1

v1.2.1 Torch profiler bugfix release

🐛 Fixes a bug where the aiu profiler crashes in tensor parallel mode

What's Changed

[profiler] fix multi-aiu profiling and add setable options by @mcalman in #519

Full Changelog: v1.2.0...v1.2.1

Contributors

mcalman

Assets 2

29 Oct 17:14

tjohnson31415

v1.2.0

07928f2

v1.2.0

✨ Adds custom GoldenTokenInjector LogitsProcessor for evaluating model quality
✨ Initial Granite 4 model support
🐛 Fixes a bug where min_tokens was not behaving properly (forced longer sequences than desired)
🐛 Fixes a bug in handling top_k that could crash the server
📝 Adds runtime_config_validator to check and warn about unsupported model configurations that may not work

What's Changed

update and expand online example to continious batching by @yannicks1 in #517
refact: removed unnecessary logits processor by @wallashss in #520
test: update tests to use golden token injection by @wallashss in #510
Fix test model revision usage by @prashantgupta24 in #522
[ppc64le] Update ppc64le dependencies by @Daniel-Schenker in #524
[CI] Enable model revisions in GHA test by @ckadner in #523
Manage supported model configurations by @ckadner in #445
📜 Add documentation and diagrams on the plugin architecture by @maxdebayser in #530
♻️ Simplify env var overrides and add tests by @joerunde in #525
[Docs] Add arch doc to dev guide view by @rafvasq in #534
Granite4 2b & 3b support by @yannicks1 in #496
🐛 Fix fp8 model name check with quantization check by @gkumbhat in #535
📝 add supported torch versions by @joerunde in #528
add e5-multilingual to known configurations by @maxdebayser in #533
fix: logits processor state at each step by @wallashss in #544
fix crashes with the usage of top_k by @tjohnson31415 in #543
feat: improve golden token injection by @maxdebayser in #540
Update links to granite FP8 model by @ckadner in #539
fix: min_tokens > 1 causes long generation with continuous batching by @tjohnson31415 in #545

Full Changelog: v1.1.0...v1.2.0

Contributors

maxdebayser, joerunde, and 8 other contributors

Assets 2

10 Oct 23:28

joerunde

v1.1.0

dff277b

v1.1.0

⬆️ Adds support for vllm v0.11.0
🔥 Drops support for vllm v0.10.1.1
✨ Writes performance metrics to file when VLLM_SPYRE_PERF_METRIC_LOGGING_ENABLED is set
🐛 Fixes a bug where incorrect logits processors were applied to requests under load
🐛 Fixes a bug where /chat/completions required a user-specified max_tokens param to function

What's Changed

fix: unbatch removals of requests from input_batch by @tjohnson31415 in #511
🐛 fixup more tests to use the default max model length by @joerunde in #512
✨ Add vLLM 0.11.0 support by @joerunde in #513
[CB] consistent max context length by @yannicks1 in #514
[docs] rephrase comment about continuous batching configuration by @yannicks1 in #518
[CB] set new_tokens to to max value given the constraints by @yannicks1 in #516
✨ add debug perf logger by @joerunde in #515

Full Changelog: v1.0.2...v1.1.0

Contributors

joerunde, tjohnson31415, and yannicks1

Assets 2

07 Oct 21:14

joerunde

v1.0.2

0c9b971

v1.0.2

v1.0.2 patch- test fixes only

This contains fixes for our test suites to run with the full granite 8b models, and to be compatible with post-1.0 versions of the spyre runtime stack

What's Changed

feat: golden token injector logits processor by @wallashss in #478
🐛 fixup full_model marker by @joerunde in #507
🐛 use 512 tokens instead of 256 by @joerunde in #509

Full Changelog: v1.0.1...v1.0.2

Contributors

joerunde and wallashss

Assets 2

06 Oct 20:41

joerunde

v1.0.1

0ae7872

v1.0.1

1.0.1 Bugfix Release

This Release:

Fixes a bug where cancelling multiple in-flight requests could crash the vllm server
Fixes a bug where granite-3.x-8b models were not detected correctly, leading to VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERS not functioning properly
Fixes a bug where the number of processors was not detected correctly for setting threading configs.
1. VLLM_SPYRE_NUM_CPUS is now available as a manual override to set the number of cpu cores available to vllm
Fixes a bug where attempting to run pooling models in continuous batching mode would crash, instead of defaulting to static batching
Fixes a bug where the lower bound of FMS was not properly specified
Disables prompt logprobs completely because it's still broken
Updates the "simple compile backend" to inductor to align with vLLM

What's Changed

disable prompt logprobs by @yannicks1 in #486
[docs] update docs continuous batching by @yannicks1 in #485
🐛 correct fms lower bound by @joerunde in #493
🎨 scheduler: make holdback queue a local variable by @yannicks1 in #465
[CB] 🐛 fix padding of position ids by @yannicks1 in #495
[s390x] Update s390x depencies by @nikheal2 in #494
fix: logits processors for CB by @wallashss in #484
[fp8] fix cb scheduler step tests by @yannicks1 in #491
🔥 remove auto-marked xfail for fp8, include fp8 tests by default, add xfail manually by @prashantgupta24 in #490
feat: add VLLM_SPYRE_NUM_CPUS and psutil to help with cpu checks by @tjohnson31415 in #487
🐛 implement better checking for granite by @joerunde in #500
Better Error Handling for attempts to run CB with pooling models by @gmarinho2 in #476
🔥 remove unused test parametrizations by @joerunde in #505
🔧 Update default simple compile backend by @joerunde in #506

New Contributors

@nikheal2 made their first contribution in #494

Full Changelog: v1.0.0...v1.0.1

Contributors

joerunde, wallashss, and 5 other contributors

Assets 2

29 Sep 22:42

joerunde

v1.0.0

88350f8

v1.0.0

🎉 vllm-spyre v1.0.0 🎉

This release of vllm-spyre is compatible with the 1.0.0 version of the spyre runtime stack.

See the docs for a list of supported models and configurations

Supported Features:

⚡⚡⚡ Production-ready continuous batching (with VLLM_SPYRE_USE_CB=1) for a gpu-like user experience
🤓 Accurate text generation results with continuous batching for contexts up to 32k
🤏 Support for FP8-quantized models
🥅 Support for enforcing pre-compiled model graphs with VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERS=1

Known Issues:

The container image for this release does not have the correct v1.0 spyre runtime stack installed and will not funciton properly, the containerfile is still for demonstration purposes only
Logits processors (custom and builtin) are not applied to the first generated token (prefill phase). Users might have incorrect results for the sampling params: min_p, logit_bias and min_tokens.
It is possible to crash the server with an IndexError and a StackTrace pointing at logits[self.logits_slice] = -float("inf") if sending and cancelling batches of requests with certain parameters; see #492
The lower bound for ibm-fms is wrong, it should be <= 1.4.0. The lockfile contains a valid set of dependencies. See #493
For reranker models, with the sendnn backend the outputs scores can be up to 15% different compared with a sentence-tranformers inference on GPU or CPU.

What's Changed

[CB] 🧹 moving VLLM_SPYRE_MAX_WAITING_TIME_SECONDS to dev branch by @yannicks1 in #459
[fp8] fix tests: increase ISCLOSE_ABS_TOL_QUANTIZATION by @yannicks1 in #460
Fix dimension of tensor passed to transformer classifier by @maxdebayser in #458
[CB][FP8] throw error for batch size 1 by @yannicks1 in #467
fix: tests for graph comparison with FP8 by @wallashss in #462
Add Sampling Params tests by @gmarinho2 in #379
⬆️ bump vllm lower bound to 0.10.1.1 by @prashantgupta24 in #468
feat: enable custom logits processors by @wallashss in #473
🐛 override flex_hdma_p2psize by @joerunde in #475
test: restored test_swap_decode_programs_for_cb with 32K context by @wallashss in #474
[Tests] Enable up to 32k by @rafvasq in #472
⬆️ bump vllm upper bound to support 0.10.2 by @prashantgupta24 in #463
get the token_type_ids from pooling params by @maxdebayser in #480
disable transformers pooler by @maxdebayser in #481
🔥 rip out VLLM_SPYRE_TEST_BACKEND_LIST by @prashantgupta24 in #482
Document supported model configurations by @ckadner in #479
Disable compilation catalog by @gkumbhat in #471
🐛 use eager compile by @joerunde in #488
[CB] optimization only return last block of prefill logits by @yannicks1 in #464
[high prio] enable VLLM_SPYRE_ENABLE_PREFILL_OPTIMIZATION by default by @yannicks1 in #477
fix: custom logits processor by @wallashss in #489

Full Changelog: v0.9.4...v1.0.0

Contributors

maxdebayser, joerunde, and 7 other contributors

Assets 2

Releases: vllm-project/vllm-spyre

v1.4.0

v1.4.0 - Initial Prefix Caching Support

What's Changed

Contributors

Uh oh!

v1.3.0

What's Changed

Contributors

Uh oh!

v1.2.3

What's Changed

Contributors

Uh oh!

v1.2.2

What's Changed

Contributors

Uh oh!

v1.2.1

v1.2.1 Torch profiler bugfix release

What's Changed

Contributors

Uh oh!

v1.2.0

v1.2.0

What's Changed

Contributors

Uh oh!

v1.1.0

v1.1.0

What's Changed

Contributors

Uh oh!

v1.0.2

v1.0.2 patch- test fixes only

What's Changed

Contributors

Uh oh!

v1.0.1

1.0.1 Bugfix Release

What's Changed

New Contributors

Contributors

Uh oh!

v1.0.0

🎉 vllm-spyre v1.0.0 🎉

Supported Features:

Known Issues:

What's Changed

Contributors

Uh oh!