Releases: vllm-project/vllm-spyre
Releases · vllm-project/vllm-spyre
v1.4.0
v1.4.0 - Initial Prefix Caching Support
This Release:
- Adds an experimental implementation of prefix caching
- Adds support for vllm v0.11.1 and updates the default locked version
- Upgrades the AFTU dev dependency lock to v0.5.0 to support chunked prefill tests on spyre hardware
What's Changed
- ✨ vllm support for 0.11.1 release by @joerunde in #546
- 🔥 remove prints by @joerunde in #594
- ⬆️ aftu to v0.5.0 by @tjohnson31415 in #595
- 🔥 Limit CI usage by @joerunde in #596
- 🐛 Fix main CI by @joerunde in #598
- ✨ Add prefix caching by @maxdebayser in #586
- 📝 add doc section on VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERS by @tjohnson31415 in #593
Full Changelog: v1.3.0...v1.4.0
v1.3.0
This release adds support for Chunked Prefill for non-quantized models that can be enabled with:
VLLM_SPYRE_USE_CHUNKED_PREFILL=1 VLLM_SPYRE_USE_CB=1
What's Changed
- feat: chunked prefill spyre model runner by @wallashss in #552
- [tests] cleanup: remove temporary hack by @yannicks1 in #555
- ChunkedPrefillSpyreScheduler: No Interleaving by @sducouedic in #554
- fix: left padding of prompts less than chunk size by @wallashss in #557
- feat: left padding from model runner to scheduler by @wallashss in #559
- [CP] rewrite scheduler constraints for chunked prefill (🐛 fix) by @yannicks1 in #560
- [CB] remove decode/prefill prioritization heuristic by @yannicks1 in #561
- Bugfix: padding block cannot be reused with chunked prefill by @sducouedic in #563
- [CP] scheduler constraints typo by @yannicks1 in #565
- test: add maybe_xfail for quantized micro static batch logprobs checks by @tjohnson31415 in #566
- [CP] fix empty model runner output by @yannicks1 in #570
- [CB] remove env var VLLM_SPYRE_ENABLE_PREFILL_OPTIMIZATION by @yannicks1 in #562
- [CP] optimal chunked prefill scheduler constraints by @yannicks1 in #564
- [CP] Simplify code by @maxdebayser in #572
- [CB] tighten constraint max model length decode sequences by @yannicks1 in #573
- Set default chunk size to 4k for granite 3 8b TP4 by @tjohnson31415 in #571
- Interleave chunked prefills with single decoding steps by @sducouedic in #558
- feat/fix: add finish_requests to handle removal from ongoing_prefills by @tjohnson31415 in #577
- fix: check only decoding requests in _satisfies_last_chunk_constraints by @tjohnson31415 in #576
- tests: include chunked prefill on existing tests by @wallashss in #574
- Add step tests for chunked prefill by @maxdebayser in #575
- docs: chunked prefill updated documentation by @wallashss in #578
- [Docs] Prep and publish GH Pages doc by @rafvasq in #579
- [Docs] Update GH artifact versions by @rafvasq in #581
- [Docs] Add workflow files to docs action triggers by @rafvasq in #582
- [Docs] Avoid multiple artifacts by @rafvasq in #583
- fix test_compare_graphs_chunked_prefill by @tjohnson31415 in #580
- [Docs] Use mkdocs gh-deploy by @rafvasq in #584
- [Docs] Update links to documentation by @rafvasq in #587
- [PC] Refactor CB model runner to use vLLMs block pool by @maxdebayser in #585
- [Docs] Add note about move by @rafvasq in #589
- test: a few test configuration updates to have chunked prefill tests pass on Spyre by @tjohnson31415 in #588
- Update default granite 8b chunk size to 1024 by @tjohnson31415 in #592
- fix time logging and other small things by @yannicks1 in #590
Full Changelog: v1.2.3...v1.3.0
v1.2.3
Includes a change required to support Torch >= 2.8
What's Changed
- fix: use sampling during warmup and disable backed_size_oblivious after model compilation by @tjohnson31415 in #551
Full Changelog: v1.2.2...v1.2.3
v1.2.2
v1.2.1
v1.2.0
v1.2.0
- ✨ Adds custom GoldenTokenInjector LogitsProcessor for evaluating model quality
- ✨ Initial Granite 4 model support
- 🐛 Fixes a bug where min_tokens was not behaving properly (forced longer sequences than desired)
- 🐛 Fixes a bug in handling top_k that could crash the server
- 📝 Adds runtime_config_validator to check and warn about unsupported model configurations that may not work
What's Changed
- update and expand online example to continious batching by @yannicks1 in #517
- refact: removed unnecessary logits processor by @wallashss in #520
- test: update tests to use golden token injection by @wallashss in #510
- Fix test model revision usage by @prashantgupta24 in #522
- [ppc64le] Update ppc64le dependencies by @Daniel-Schenker in #524
- [CI] Enable model revisions in GHA test by @ckadner in #523
- Manage supported model configurations by @ckadner in #445
- 📜 Add documentation and diagrams on the plugin architecture by @maxdebayser in #530
- ♻️ Simplify env var overrides and add tests by @joerunde in #525
- [Docs] Add arch doc to dev guide view by @rafvasq in #534
- Granite4 2b & 3b support by @yannicks1 in #496
- 🐛 Fix fp8 model name check with quantization check by @gkumbhat in #535
- 📝 add supported torch versions by @joerunde in #528
- add e5-multilingual to known configurations by @maxdebayser in #533
- fix: logits processor state at each step by @wallashss in #544
- fix crashes with the usage of top_k by @tjohnson31415 in #543
- feat: improve golden token injection by @maxdebayser in #540
- Update links to granite FP8 model by @ckadner in #539
- fix: min_tokens > 1 causes long generation with continuous batching by @tjohnson31415 in #545
Full Changelog: v1.1.0...v1.2.0
v1.1.0
v1.1.0
- ⬆️ Adds support for vllm v0.11.0
- 🔥 Drops support for vllm v0.10.1.1
- ✨ Writes performance metrics to file when
VLLM_SPYRE_PERF_METRIC_LOGGING_ENABLEDis set - 🐛 Fixes a bug where incorrect logits processors were applied to requests under load
- 🐛 Fixes a bug where
/chat/completionsrequired a user-specifiedmax_tokensparam to function
What's Changed
- fix: unbatch removals of requests from input_batch by @tjohnson31415 in #511
- 🐛 fixup more tests to use the default max model length by @joerunde in #512
- ✨ Add vLLM 0.11.0 support by @joerunde in #513
- [CB] consistent max context length by @yannicks1 in #514
- [docs] rephrase comment about continuous batching configuration by @yannicks1 in #518
- [CB] set new_tokens to to max value given the constraints by @yannicks1 in #516
- ✨ add debug perf logger by @joerunde in #515
Full Changelog: v1.0.2...v1.1.0
v1.0.2
v1.0.2 patch- test fixes only
This contains fixes for our test suites to run with the full granite 8b models, and to be compatible with post-1.0 versions of the spyre runtime stack
What's Changed
- feat: golden token injector logits processor by @wallashss in #478
- 🐛 fixup full_model marker by @joerunde in #507
- 🐛 use 512 tokens instead of 256 by @joerunde in #509
Full Changelog: v1.0.1...v1.0.2
v1.0.1
1.0.1 Bugfix Release
This Release:
- Fixes a bug where cancelling multiple in-flight requests could crash the vllm server
- Fixes a bug where granite-3.x-8b models were not detected correctly, leading to
VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERSnot functioning properly - Fixes a bug where the number of processors was not detected correctly for setting threading configs.
VLLM_SPYRE_NUM_CPUSis now available as a manual override to set the number of cpu cores available to vllm
- Fixes a bug where attempting to run pooling models in continuous batching mode would crash, instead of defaulting to static batching
- Fixes a bug where the lower bound of FMS was not properly specified
- Disables prompt logprobs completely because it's still broken
- Updates the "simple compile backend" to
inductorto align with vLLM
What's Changed
- disable prompt logprobs by @yannicks1 in #486
- [docs] update docs continuous batching by @yannicks1 in #485
- 🐛 correct fms lower bound by @joerunde in #493
- 🎨 scheduler: make holdback queue a local variable by @yannicks1 in #465
- [CB] 🐛 fix padding of position ids by @yannicks1 in #495
- [s390x] Update s390x depencies by @nikheal2 in #494
- fix: logits processors for CB by @wallashss in #484
- [fp8] fix cb scheduler step tests by @yannicks1 in #491
- 🔥 remove auto-marked xfail for fp8, include fp8 tests by default, add xfail manually by @prashantgupta24 in #490
- feat: add VLLM_SPYRE_NUM_CPUS and psutil to help with cpu checks by @tjohnson31415 in #487
- 🐛 implement better checking for granite by @joerunde in #500
- Better Error Handling for attempts to run CB with pooling models by @gmarinho2 in #476
- 🔥 remove unused test parametrizations by @joerunde in #505
- 🔧 Update default simple compile backend by @joerunde in #506
New Contributors
Full Changelog: v1.0.0...v1.0.1
v1.0.0
🎉 vllm-spyre v1.0.0 🎉
This release of vllm-spyre is compatible with the 1.0.0 version of the spyre runtime stack.
See the docs for a list of supported models and configurations
Supported Features:
- ⚡⚡⚡ Production-ready continuous batching (with
VLLM_SPYRE_USE_CB=1) for a gpu-like user experience - 🤓 Accurate text generation results with continuous batching for contexts up to 32k
- 🤏 Support for FP8-quantized models
- 🥅 Support for enforcing pre-compiled model graphs with
VLLM_SPYRE_REQUIRE_PRECOMPILED_DECODERS=1
Known Issues:
- The container image for this release does not have the correct v1.0 spyre runtime stack installed and will not funciton properly, the containerfile is still for demonstration purposes only
- Logits processors (custom and builtin) are not applied to the first generated token (prefill phase). Users might have incorrect results for the sampling params:
min_p,logit_biasandmin_tokens. - It is possible to crash the server with an IndexError and a StackTrace pointing at
logits[self.logits_slice] = -float("inf")if sending and cancelling batches of requests with certain parameters; see #492 - The lower bound for ibm-fms is wrong, it should be <= 1.4.0. The lockfile contains a valid set of dependencies. See #493
- For reranker models, with the sendnn backend the outputs scores can be up to 15% different compared with a sentence-tranformers inference on GPU or CPU.
What's Changed
- [CB] 🧹 moving VLLM_SPYRE_MAX_WAITING_TIME_SECONDS to dev branch by @yannicks1 in #459
- [fp8] fix tests: increase ISCLOSE_ABS_TOL_QUANTIZATION by @yannicks1 in #460
- Fix dimension of tensor passed to transformer classifier by @maxdebayser in #458
- [CB][FP8] throw error for batch size 1 by @yannicks1 in #467
- fix: tests for graph comparison with FP8 by @wallashss in #462
- Add Sampling Params tests by @gmarinho2 in #379
- ⬆️ bump vllm lower bound to 0.10.1.1 by @prashantgupta24 in #468
- feat: enable custom logits processors by @wallashss in #473
- 🐛 override flex_hdma_p2psize by @joerunde in #475
- test: restored test_swap_decode_programs_for_cb with 32K context by @wallashss in #474
- [Tests] Enable up to 32k by @rafvasq in #472
- ⬆️ bump vllm upper bound to support 0.10.2 by @prashantgupta24 in #463
- get the token_type_ids from pooling params by @maxdebayser in #480
- disable transformers pooler by @maxdebayser in #481
- 🔥 rip out VLLM_SPYRE_TEST_BACKEND_LIST by @prashantgupta24 in #482
- Document supported model configurations by @ckadner in #479
- Disable compilation catalog by @gkumbhat in #471
- 🐛 use eager compile by @joerunde in #488
- [CB] optimization only return last block of prefill logits by @yannicks1 in #464
- [high prio] enable VLLM_SPYRE_ENABLE_PREFILL_OPTIMIZATION by default by @yannicks1 in #477
- fix: custom logits processor by @wallashss in #489
Full Changelog: v0.9.4...v1.0.0