[Spec Decode] Integrate Suffix Decoding from Arctic Inference #25784

aurickq · 2025-09-26T19:53:56Z

Purpose

This PR adds Suffix Decoding (https://arxiv.org/abs/2411.04975) as a new speculative decoding method in vLLM. Suffix Decoding is a dynamic n-gram matching method that:

Uses suffix trees to generate speculative tokens quickly using branch frequency counts.
Can keep a history of prior model responses, which tends to work very well with repetitive agentic use cases.
Can be dynamically updated with newly generated tokens, and FIFO eviction of older requests.

Test Plan

Benchmark Suffix Decoding against the current ngram speculator.
Write and run unit tests
Documentation

Test Result

Benchmarks on Specbench and Blazedit are below (on H200). Suffix Decoding beats ngram in pretty much all cases. In practice, we have seen larger speedups for real user interactions and agentic requests, since they tend to exhibit more output repetition than these benchmark datasets.

Script for benchmark reproduction: benchmark.sh

Specbench

Time per output token (ms)

method	spec_len	concurrency 1	concurrency 4	concurrency 16	concurrency 64
suffix (w/ cache)	5	4.4	4.64	5.85	10.55
suffix (w/ cache)	12	4.39	4.63	5.85	10.66
suffix (w/ cache)	32	4.39	4.63	5.82	10.67
suffix (w/o cache)	5	4.74	5.06	6.16	10.67
suffix (w/o cache)	12	4.73	5.02	6.15	10.76
suffix (w/o cache)	32	4.76	5.05	6.2	10.73
ngram [5, 5]	5	5.6	5.84	6.94	11.07
ngram [5, 5]	12	5.58	5.8	6.89	11.19
ngram [5, 5]	32	5.59	5.82	7.04	11.83
ngram [3, 5]	5	5.21	5.5	6.61	10.66
ngram [3, 5]	12	5.16	5.44	6.59	11.15
ngram [3, 5]	32	5.18	5.52	6.87	13.37

Total drafted tokens

method	spec_len	concurrency 1	concurrency 4	concurrency 16	concurrency 64
suffix (w/ cache)	5	68790	69238	68795	68452
suffix (w/ cache)	12	71154	71655	70952	71446
suffix (w/ cache)	32	71154	71378	71531	71283
suffix (w/o cache)	5	48012	48139	48081	48164
suffix (w/o cache)	12	50043	50258	50326	50282
suffix (w/o cache)	32	50043	49761	50466	49928
ngram [5, 5]	5	12460	12615	12610	12590
ngram [5, 5]	12	26268	26307	26673	26629
ngram [5, 5]	32	65293	65338	64615	64327
ngram [3, 5]	5	31606	31826	31608	31460
ngram [3, 5]	12	69535	69035	68498	68005
ngram [3, 5]	32	172779	169136	169809	170677

Total accepted tokens

method	spec_len	concurrency 1	concurrency 4	concurrency 16	concurrency 64
suffix (w/ cache)	5	18537	18727	18461	18437
suffix (w/ cache)	12	18609	18781	18614	18780
suffix (w/ cache)	32	18609	18751	18852	18654
suffix (w/o cache)	5	15401	15486	15377	15534
suffix (w/o cache)	12	15442	15637	15628	15558
suffix (w/o cache)	32	15442	15361	15669	15568
ngram [5, 5]	5	4757	4812	4794	4741
ngram [5, 5]	12	5046	5208	5046	5179
ngram [5, 5]	32	5149	5219	5203	5109
ngram [3, 5]	5	9278	9260	9288	9242
ngram [3, 5]	12	9857	9678	9722	9782
ngram [3, 5]	32	10040	9856	10011	9975

Blazedit

Time per output token (ms)

method	spec_len	concurrency 1	concurrency 4	concurrency 16	concurrency 64
suffix (w/ cache)	5	2.13	2.44	3.27	5.96
suffix (w/ cache)	12	1.77	2.04	2.84	5.77
suffix (w/ cache)	32	1.82	2.01	2.88	5.63
suffix (w/o cache)	5	2.22	2.44	3.31	5.99
suffix (w/o cache)	12	1.89	2.09	2.88	5.62
suffix (w/o cache)	32	1.91	2.11	2.85	5.63
ngram [5, 5]	5	2.75	3.05	3.99	6.66
ngram [5, 5]	12	2.41	2.68	3.51	6.23
ngram [5, 5]	32	2.23	2.51	3.55	7.46
ngram [3, 5]	5	2.44	2.69	3.57	6.18
ngram [3, 5]	12	2.05	2.31	3.11	6.03
ngram [3, 5]	32	1.86	2.22	3.33	8.13

Total drafted tokens

method	spec_len	concurrency 1	concurrency 4	concurrency 16	concurrency 64
suffix (w/ cache)	5	161067	164646	163410	171591
suffix (w/ cache)	12	188892	185344	186202	179407
suffix (w/ cache)	32	188892	181810	185943	184837
suffix (w/o cache)	5	149045	152582	153911	153363
suffix (w/o cache)	12	173522	174035	178302	171757
suffix (w/o cache)	32	173522	167821	178697	171921
ngram [5, 5]	5	122885	124817	123925	116898
ngram [5, 5]	12	164000	168710	177866	169000
ngram [5, 5]	32	305025	303489	303603	316235
ngram [3, 5]	5	146892	146052	152542	143307
ngram [3, 5]	12	223238	231225	228872	231770
ngram [3, 5]	32	432295	434561	456818	433020

Total accepted tokens

method	spec_len	concurrency 1	concurrency 4	concurrency 16	concurrency 64
suffix (w/ cache)	5	104448	107678	105853	112648
suffix (w/ cache)	12	119902	114103	116161	109788
suffix (w/ cache)	32	119902	113189	114991	115209
suffix (w/o cache)	5	101846	105089	106614	104780
suffix (w/o cache)	12	114345	114439	117194	112100
suffix (w/o cache)	32	114345	109405	117543	110273
ngram [5, 5]	5	89233	91067	90410	85974
ngram [5, 5]	12	94002	95939	101547	97922
ngram [5, 5]	32	102083	103021	104049	106095
ngram [3, 5]	5	95830	96248	98966	94171
ngram [3, 5]	12	103658	106170	106975	110182
ngram [3, 5]	32	110953	110166	113630	111404

Older Results (before optimizing)

refactor-bench (out=1024)

Results are mean TPOT (ms)

method	spec_len	concurrency 1	concurrency 4	concurrency 16	concurrency 64
suffix (w/ cache)	5	2.15	3.68	9.02	26.64
suffix (w/ cache)	12	1.91	3.36	8.56	26.32
suffix (w/ cache)	32	1.81	3.22	8.58	26.78
suffix (w/o cache)	5	2.35	3.92	9.2	26.78
suffix (w/o cache)	12	2.13	3.65	8.92	26.68
suffix (w/o cache)	32	2.04	3.56	8.98	27.77
ngram	5	2.99	4.7	10.41	28.62
ngram	12	2.68	4.41	9.85	28.66
ngram	32	2.58	4.32	10.57	32.63

spec-bench (out=256)

Results are mean TPOT (ms)

method	spec_len	concurrency 1	concurrency 4	concurrency 16	concurrency 64
suffix (w/ cache)	5	4.27	4.67	6.17	12.03
suffix (w/ cache)	12	4.26	4.71	6.2	12.11
suffix (w/ cache)	32	4.28	4.73	6.17	12.27
suffix (w/o cache)	5	4.63	5.09	6.38	11.68
suffix (w/o cache)	12	4.63	5.1	6.37	11.62
suffix (w/o cache)	32	4.62	5.06	6.35	11.66
ngram	5	5.38	5.7	6.77	10.98
ngram	12	5.37	5.67	6.76	10.99
ngram	32	5.37	5.73	6.87	11.76

mergify · 2025-09-26T19:54:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aurickq.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request integrates Suffix Decoding from Arctic Inference as a new speculative decoding method. The changes are well-structured, adding new configuration options, validation, and the core logic for proposing draft tokens and managing the suffix cache. My review identifies a potential type inconsistency in the token sequences passed to the arctic-inference library, which could lead to runtime errors. I've suggested a fix to ensure consistency.

vllm/v1/worker/gpu_model_runner.py

simon-mo · 2025-09-26T19:57:08Z

@codex review

simon-mo · 2025-09-26T19:57:59Z

note to reviewers:

We discussed with the Snowflake team that importing from arctic-inference is acceptable path forward and the team is committed in maintaining it as a separate library.
Please focus on code quality, interfaces, UX, etc.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

vllm/v1/worker/gpu_model_runner.py

keyboardAnt · 2025-09-26T20:42:23Z

@aurickq, thanks for your awesome contribution, the results look good!

Suffix decoding outperforms n-gram at out=1024, but falls behind at out=256 with concurrency=64 (+5.8% in the best case). Any idea why?

aurickq · 2025-09-28T18:47:16Z

@aurickq, thanks for your awesome contribution, the results look good!

Suffix decoding outperforms n-gram at out=1024, but falls behind at out=256 with concurrency=64 (+5.8% in the best case). Any idea why?

The out=1024 and out=256 are also two different datasets, so might not be very comparable. Other than that, when the concurrency is high and the number of output tokens is low (e.g. 256), the request completion time becomes dominated by mixed-prefill batches that drag up the mean TPOT metric. So it makes sense for these cases the performance of suffix and ngram will approach each other.

As for why suffix becomes a little worse than ngram for spec_bench out=256 and concurrency=64, here is my guess: the SpecBench dataset is more open-ended (higher entropy, less repetition) than refactor-benchmark, so we should already would expect suffix/ngram to perform worse on it. The benchmark is also small (400-500 examples), so suffix decoding might not have built a sufficiently large cache to accurately predict the next tokens. From the benchmarks, the performance of suffix decoding actually is better when this cache is disabled in this setting.

I have some ideas for solving this latter issue when the cached data is sparse, which I might later implement and contribute as a "suffix v2" method, if it works.

Neo9061 · 2025-09-29T19:04:53Z

Thanks a lot for the contribution @aurickq ! A few questions.

In your benchmarking, when there is cache enabled, is it referring to the global tree? what training data are you using to construct the global tree?
Can we enable an option to make the global tree static which uses some offline training data? as explained in other thread, this will be very useful for multi-tenets requests. Plan to merge Suffix decoding into vLLM mainline? snowflakedb/ArcticInference#171 (comment)
Can your PR work with the hybrid PR [Spec Decode][Hybrid] Add ngram-eagle SD method #24344 where they enable n-gram and EAGLE? such that we can hybrid suffix decoding and eagle?
For the comparison between suffix decoding w/o cache and n-gram, what do you think of the reason to make the suffix decoding w/o cache working better than n-gram? In my understanding, they are almost equivalent when suffix decoding does not use global cache. One of reason I could think of is the dynamic drafting length suffix decoding has over the n-gram.

aurickq · 2025-09-29T19:49:42Z

@Neo9061

"w/ cache" means using the global suffix tree, and "w/o cache" means not using the global suffix tree (setting suffix_decoding_max_cached_requests = 0. The per-prompt suffix trees are used in both cases. In these benchmarks, the only requests being cached are the earlier requests in the same benchmark. The performance would probably be much better in a more realistic setting when more requests can be cached over a longer period of time.
I think this is a good idea, but I would like to address this in a follow-up PR once the core suffix speculation is enabled. It could use more input from the community on interface design, like what's the best format to read the "static" cache.
The current PR doesn't consider hybrid speculation yet, would also be good to add in the future.
Yeah they are "almost" equivalent except for suffix decoding's frequency stats and scoring mechanism. For each speculation length, suffix decoding can speculate up to that many tokens but can also speculate less if there is no probable continuation to save on verification costs. It also means that out of several possible continuations, suffix decoding can choose the most "frequent" one to maximize the probability of acceptance.

Jialin · 2025-10-25T14:44:53Z

Rebased. Could someone help trigger CI?

@aurickq Could you try to address the DCO and doc build failure first?

vllm/config/speculative.py

…o suffix-decoding

aurickq · 2025-10-27T19:23:35Z

@aurickq Could you try to address the DCO and doc build failure first?

fixed the doc failure. for dco in the past i've avoided addressing this since it leaks my personal email publicly :) (not sure if this part changed)

…roject#25784) Co-authored-by: Aurick Qiao <[email protected]>

ggg-s · 2025-11-06T03:24:30Z

@aurickq Why would you use this parameter --no-enable-prefix-caching ?

…roject#25784) Co-authored-by: Aurick Qiao <[email protected]>

* add fault_report_addr in FaultToleranceConfig * add handle fault&get_fault_info api Signed-off-by: w00689259 <[email protected]> * remove fault_report_address in CoreEngineActorManager __init__ Signed-off-by: a798347923 <[email protected]> * ruff format Signed-off-by: a798347923 <[email protected]> * add handle fault&get_fault_info api Signed-off-by: w00689259 <[email protected]> * fix one bug. Signed-off-by: fangyuchu <[email protected]> * add fault_report_port in FaultToleranceConfig Signed-off-by: a798347923 <[email protected]> * add zmq_addr concatenate with fault_report_addr and fault_report_port Signed-off-by: a798347923 <[email protected]> * fault reporter bug fix Signed-off-by: w00689259 <[email protected]> * fault reporter bug fix Signed-off-by: w00689259 <[email protected]> * fault reporter bug fix Signed-off-by: w00689259 <[email protected]> * fault reporter bug fix Signed-off-by: w00689259 <[email protected]> * fault reporter bug fix Signed-off-by: w00689259 <[email protected]> * fault reporter bug fix Signed-off-by: w00689259 <[email protected]> * fix some bug * fault reporter bug fix Signed-off-by: w00689259 <[email protected]> * fault reporter bug fix Signed-off-by: w00689259 <[email protected]> * remove fault_report_addr in FaultToleranceConfig Signed-off-by: a798347923 <[email protected]> * refactor: relocate method serialization functions to serial_util.py Signed-off-by: fangyuchu <[email protected]> * fix actor bug * fix actor bug * add engine_core_cmd_addr in FaultToleranceConfig Signed-off-by: a798347923 <[email protected]> * add and use _stop_worker_execution in EngineCoreGuard Signed-off-by: a798347923 <[email protected]> * add and use run in WorkerGuard Signed-off-by: a798347923 <[email protected]> * fix actor bug * fix bug * fix sentinel * fix bug vllm/v1/engine/core.py:847: error: Missing positional argument "tp_size" in call to "EngineCoreGuard" Signed-off-by: a798347923 <[email protected]> * fix bug error: Missing positional arguments "length", "byteorder" in call to "to_bytes" of "int" Signed-off-by: a798347923 <[email protected]> * fix bug in fault tolerance mode Signed-off-by: w00689259 <[email protected]> * fix bug in fault tolerance mode Signed-off-by: w00689259 <[email protected]> * change fault_report_port to internal_fault_report_port add external_fault_notify_port Signed-off-by: a798347923 <[email protected]> * change fault_report_port to internal_fault_report_port add external_fault_notify_port Signed-off-by: a798347923 <[email protected]> * add _recv_cmd func use deserialize_method_call and run_method in run func Signed-off-by: a798347923 <[email protected]> * Update core.py fix bug error: Need type annotation for "kwargs" (hint: "kwargs: dict[<type>, <type>] = ...") Signed-off-by: a798347923 <[email protected]> * add self.ctx.term() in shutdown() Signed-off-by: a798347923 <[email protected]> * changed import deserialize_method_call,serialize_method_call Signed-off-by: a798347923 <[email protected]> * changed init worker_guard in init_device Signed-off-by: a798347923 <[email protected]> * Update core.py add import serialize_method_call Signed-off-by: a798347923 <[email protected]> * Update gpu_worker.py changed init WorkerGuard in init_device Signed-off-by: a798347923 <[email protected]> * Update gpu_worker.py FIX BUG self.worker_guard: WorkerGuard|None = None Signed-off-by: a798347923 <[email protected]> * Update gpu_worker.py fix bug error: Argument 1 to "deserialize_method_call" has incompatible type "str | None"; expected "str" [arg-type] Signed-off-by: a798347923 <[email protected]> * Update gpu_worker.py ruff format Signed-off-by: a798347923 <[email protected]> * Update core.py ruff-format Signed-off-by: a798347923 <[email protected]> * actively send exception information Signed-off-by: w00689259 <[email protected]> * actively send exception information Signed-off-by: w00689259 <[email protected]> * actively send exception information Signed-off-by: w00689259 <[email protected]> * change engine_core_cmd_addr(str) to engine_core_cmd_addrs(list[str]) in EngineZmqAddresses Signed-off-by: a798347923 <[email protected]> * change engine_core_cmd_addr(str) to engine_core_cmd_addrs(list[str]) in EngineZmqAddresses Signed-off-by: a798347923 <[email protected]> * Update utils.py delete engine_core_cmd_addr in EngineZmqAddresses Signed-off-by: a798347923 <[email protected]> * Remove redundant configuration: fault-pub-port Signed-off-by: fangyuchu <[email protected]> * Send pause instructions after receiving fault info in ClientGuard Signed-off-by: fangyuchu <[email protected]> * change engine_core_guard_identities from dict[int, bytes] to list[bytes] Signed-off-by: a798347923 <[email protected]> * fix bug "only the worker guard of engine core 0 can receive messages sent from engine core guard Signed-off-by: a798347923 <[email protected]> * change local_rank to rank_in_group in WorkerGuard Signed-off-by: a798347923 <[email protected]> * changed del self.client_cmd_registry[int(unhealthy_engine.engine_id)] Signed-off-by: a798347923 <[email protected]> * add gloo communication timeout * fix some bug * add stateless_process_group gloo_comm_timeout * reconstruct fault receiver&fault handler Signed-off-by: w00689259 <[email protected]> * fix some bug * reconstruct fault receiver&fault handler Signed-off-by: w00689259 <[email protected]> * reconstruct fault receiver&fault handler Signed-off-by: w00689259 <[email protected]> * fix return format Signed-off-by: w00689259 <[email protected]> * fix return format Signed-off-by: w00689259 <[email protected]> * fix return format Signed-off-by: w00689259 <[email protected]> * add abort request * fix some bug * fix some bug * fix some bug * add dt for client guard Signed-off-by: w00689259 <[email protected]> * add dt for client guard Signed-off-by: w00689259 <[email protected]> * add dt for client guard Signed-off-by: w00689259 <[email protected]> * Implementation of two types of pause: a soft one by using flag signals and a hard one by aborting nccl communicators. Signed-off-by: fangyuchu <[email protected]> * Refine certain log forms and fix a minor bug in pause function. Signed-off-by: fangyuchu <[email protected]> * Refactor and abstract the recv_msg logic in CG,ECG,WG. Signed-off-by: fangyuchu <[email protected]> * [Frontend] Align finish_reason when tool is called with OpenAI (vllm-project#25054) Signed-off-by: Sungyoon Jeong <[email protected]> Co-authored-by: Chauncey <[email protected]> * [Hybrid] Pass kernel block size to builders (vllm-project#27753) Signed-off-by: Thomas Parnell <[email protected]> * [Bugfix] Padded Eagle Specdec with Chunked Prefill (vllm-project#26263) Signed-off-by: Rémi Delacourt <[email protected]> Signed-off-by: Rémi Delacourt <[email protected]> Signed-off-by: remi <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> * [XPU]Refine Dockerfile.xpu, avoid oneccl dependency issue (vllm-project#27964) Signed-off-by: Kunshang Ji <[email protected]> * Add and check method uuid when sending commands and receiving results. Signed-off-by: fangyuchu <[email protected]> * Add ORCA endpoint load metrics support (vllm-project#24905) Signed-off-by: Misha Efimov <[email protected]> * [CI/Build] Remove the flaky gpt-oss lora test (vllm-project#27966) Signed-off-by: Jee Jee Li <[email protected]> * Abstract the logic of sending instructions and waiting responses from FaultHandler Signed-off-by: fangyuchu <[email protected]> * [Model] Add PaddleOCR-VL Model Support (vllm-project#27758) Signed-off-by: zhangyue <[email protected]> Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: zhangyue66 <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Isotr0py <[email protected]> * Add options in EngineCoreGuard to recv execution results from WorkerGuard Signed-off-by: fangyuchu <[email protected]> * Early exit for MoE LoRA kernels (vllm-project#27131) Signed-off-by: gnovack <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Bugfix] Skip gs:// model paths for speculator detection (vllm-project#27846) Signed-off-by: Peter Schuurman <[email protected]> * [BUG] Make 'binary' default option for saving torch compile artifacts when using standalone_compile (vllm-project#27616) Signed-off-by: ahao-anyscale <[email protected]> * [CI/Testing] Add basic single node dual batch overlap test (vllm-project#27235) Signed-off-by: Lucas Wilkinson <[email protected]> * [Spec Decode] Integrate Suffix Decoding from Arctic Inference (vllm-project#25784) Co-authored-by: Aurick Qiao <[email protected]> * [Feature][Benchmarks] Support `inf` burstiness (vllm-project#26941) Signed-off-by: Sophie du Couédic <[email protected]> * [Bugfix][Qwen][Multimodal] Move Qwen2_5_vl sdpa to custom op and reenable compile (vllm-project#27764) Signed-off-by: Lucas Kabela <[email protected]> * [Bugfix] change FlashMLA reorder_batch_threshold (vllm-project#27777) Signed-off-by: Matthew Bonanni <[email protected]> * [Docs] add runai_streamer_sharded to LoadConfig (vllm-project#27937) Signed-off-by: Andy Xie <[email protected]> * Add TP parameter to attention tests (vllm-project#27683) Signed-off-by: Matthew Bonanni <[email protected]> * [Bugfix][plugin] fla crash on plugin (vllm-project#27322) * [Bugfix] Fix MoE Routing Simulation (vllm-project#28002) Signed-off-by: Tyler Michael Smith <[email protected]> * Remove the tpu docker image nightly build. (vllm-project#27997) Signed-off-by: Qiliang Cui <[email protected]> * [Bugfix][ROCm] Fix ViT rotary embeddings for torch.compile compatibility on ROCm (vllm-project#27748) Signed-off-by: vllmellm <[email protected]> * [LoRA] Lora shrink swizzle (vllm-project#27694) Signed-off-by: li2haipeng <[email protected]> Signed-off-by: Haipeng Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [Refactor] Lazy import tool_parser (vllm-project#27974) Signed-off-by: chaunceyjiang <[email protected]> * [NIXL][XPU] Pin NIXL version to 0.7.0 (vllm-project#27849) Signed-off-by: zhenwei-intel <[email protected]> * [Metrics] Enable sleep state metric outside of dev mode (vllm-project#27867) Signed-off-by: Mark McLoughlin <[email protected]> * [Bug] Batch invariant: Fix flash attn MLA `RuntimeError: scheduler_metadata must have shape (metadata_size)` (vllm-project#27884) * [CPU]Improve dynamic 4bit moe performance (vllm-project#27240) Signed-off-by: Zhang Xiangze <[email protected]> * [CI/Build] Update LM Eval Version in AMD CI (vllm-project#27944) Signed-off-by: zhewenli <[email protected]> * [KV Connector] Make KVCacheConfig an explicit constructor argument (vllm-project#27887) Signed-off-by: Mark McLoughlin <[email protected]> * [Model] fix ernie45 reasoning_parser (vllm-project#27973) Signed-off-by: wangyafeng <[email protected]> * [CI/Build] Fix OpenAI API correctness on AMD CI (vllm-project#28022) Signed-off-by: zhewenli <[email protected]> * [BugFix][Performance] Restore flashinfer autotuning for all scenarios (vllm-project#27904) * Support worker reinitialization after hard pause; add task queue in FaultHandler to ensure sequential task execution Signed-off-by: fangyuchu <[email protected]> * resolve conflicts Signed-off-by: w00689259 <[email protected]> * resolve conflicts Signed-off-by: w00689259 <[email protected]> * resolve conflicts Signed-off-by: w00689259 <[email protected]> * Load tuned fused_moe_lora shrink and expand kernel configs separately (vllm-project#27435) Signed-off-by: Yu Gong <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * resolve conflicts Signed-off-by: w00689259 <[email protected]> * resolve conflicts Signed-off-by: w00689259 <[email protected]> * resolve conflicts Signed-off-by: w00689259 <[email protected]> * Support using Int4PreshuffledTensor after loading (vllm-project#26066) Signed-off-by: Jerry Zhang <[email protected]> * [Core] Enable StatLogger in LLMEngine (vllm-project#28020) Signed-off-by: Zhuohan Li <[email protected]> * [Model][Bugfix] fix pipeline parallelism support for NemotronH (vllm-project#27968) Signed-off-by: Tomer Asida <[email protected]> * [Model] add optimal triton fused moe configs for NemotronH MoE (vllm-project#27967) Signed-off-by: Tomer Asida <[email protected]> * [Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses. (vllm-project#27123) * [BugFix] Fix incorrect preallocated sampled_token_ids tensor size (vllm-project#28025) Signed-off-by: Nick Hill <[email protected]> * [Perf] SM100 - add swap AB optimization to CUTLASS FP8 GEMM (vllm-project#27284) Signed-off-by: Faqin Zhong <[email protected]> Co-authored-by: Faqin Zhong <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [PERF] Decouple projections from GDN custom op (vllm-project#27512) Signed-off-by: Vadim Gimpelson <[email protected]> * [model] Add support for openPangu_Ultra_MoE (vllm-project#27521) Signed-off-by: yuantao <[email protected]> Signed-off-by: yt0428 <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [PerfFix] Avoid separate thread for MP executor shm spin (vllm-project#28012) Signed-off-by: Nick Hill <[email protected]> * [AsyncScheduling] Don't schedule past request max_tokens (vllm-project#27922) Signed-off-by: Nick Hill <[email protected]> * Remove deprecated `--rope-scaling` and `--rope-theta` (vllm-project#28006) Signed-off-by: Harry Mellor <[email protected]> * [ROCm][Perf] New design on ROCm AITER MHA backend Implementation (vllm-project#25763) Signed-off-by: ganyi <[email protected]> * Added disable rule to track files under benchmarks/lib (vllm-project#28048) Signed-off-by: Nadav Kluger <[email protected]> * [Multimodal] Make MediaConnector extensible. (vllm-project#27759) Signed-off-by: Chenheli Hua <[email protected]> * [ROCm] gemm_a16w16 upstreaming (vllm-project#26969) Signed-off-by: Aleksandr Malyshev <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> * Revert "[PERF] Decouple projections from GDN custom op" (vllm-project#28080) Signed-off-by: Vadim Gimpelson <[email protected]> * add engine core ut Signed-off-by: w00689259 <[email protected]> * add engine core ut Signed-off-by: w00689259 <[email protected]> * [Qwen3-Next] MOE configs for A100-SXM4-80GB TP4 TP8 (vllm-project#27740) * [XPU] Add gpt-oss model support for Intel GPU (vllm-project#27786) Signed-off-by: Kunshang Ji <[email protected]> * [CI/Build] Enable some fixed tests in AMD CI (vllm-project#28078) Signed-off-by: zhewenli <[email protected]> * [V0 deprecation] Remove VLLM_USE_V1 usage in most modules (vllm-project#27955) Signed-off-by: wangxiyuan <[email protected]> * [Bugfix] Fix encoder-only model support for transformers backend (vllm-project#28021) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]> * [BugFix] Fix DCP Assert (AssertionError: DCP not support reorder_batch_threshold > 1 now.) (vllm-project#28100) Signed-off-by: Lucas Wilkinson <[email protected]> * [Model, Core] Support Granite Speech & LoRA for STT (vllm-project#24455) * [Refactor] Lazy-loaded reasoning_parser (vllm-project#28092) Signed-off-by: chaunceyjiang <[email protected]> * [Refactor] to simplify and extract the shared logic between chat completion and responses (vllm-project#27961) Signed-off-by: chaunceyjiang <[email protected]> * [bugfix] fix wrong `dcp_local_seq_lens` calc (vllm-project#27518) Signed-off-by: Qiu <[email protected]> * [Hybrid allocator + kv connector] revert connector test changes related to hybrid allocator (vllm-project#28011) Signed-off-by: KuntaiDu <[email protected]> * [Misc] fix import error for DeepSeekR1ReasoningParser (vllm-project#28114) Signed-off-by: chaunceyjiang <[email protected]> * Fix excessive logging noise by reducing the log level of the MinimaxM2ToolParser import success message (vllm-project#27635) Signed-off-by: minatoaquaMK2 <[email protected]> * Bugfix: Cutlass FP8 FusedMoE bad scaling factors (vllm-project#27255) Signed-off-by: Amir Klein <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [Graph Partition][Cache] Use inductor partition ops config (vllm-project#27702) Signed-off-by: Boyuan Feng <[email protected]> * [XPU] Enable custom routing functions in IPEX for Llama4 (vllm-project#28004) Signed-off-by: frost-intel <[email protected]> * add kimi reasoning parser (vllm-project#28128) Signed-off-by: wangzhengtao <[email protected]> Co-authored-by: wangzhengtao <[email protected]> * [DCP] check return_lse for all layers in dcp (vllm-project#27929) Signed-off-by: Chen Zhang <[email protected]> * [BugFix] Support EP/DP + EPLB with MTP (vllm-project#25311) Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Sage Moore <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> * Enabling cooperative multi-gpu tests on multi-gpu nodes (vllm-project#27986) Signed-off-by: Alexei V. Ivanov <[email protected]> * [ROCm][MLA] Support block-size > 1 for AITER MLA backend (vllm-project#27224) Signed-off-by: ganyi <[email protected]> Co-authored-by: wuhuikx <[email protected]> * [Bugfix] Validate custom logits processor xargs for online serving (vllm-project#27560) Signed-off-by: Isotr0py <[email protected]> * [misc] add vLLM Beijing Meetup (vllm-project#28127) Signed-off-by: Jiaju Zhang <[email protected]> * [Kernel] Fuse computation of g and beta for Gated Delta Net (vllm-project#28095) Signed-off-by: zjy0516 <[email protected]> * [Core] add support for reasoning parser plugins (vllm-project#28075) Signed-off-by: walter beller-morales <[email protected]> * [Bugfix] vLLM should check Inductor config for compile cache enablement status (vllm-project#27637) Signed-off-by: Yanan Cao <[email protected]> * [FlashInfer] Avoid FlashInfer block_size 16 + head_size 256 on blackwell (vllm-project#27994) Signed-off-by: Chen Zhang <[email protected]> * [CI]: Add LMCacheConnector Unit Tests (vllm-project#27852) Signed-off-by: Samuel Shen <[email protected]> Co-authored-by: Samuel Shen <[email protected]> Co-authored-by: Yihua Cheng <[email protected]> * [Feature] Extend batch invariant torch.compile to B200 (vllm-project#27856) Signed-off-by: PaulZhang12 <[email protected]> * [Bugfix] Fix Qwen3-Reranker-8B load (vllm-project#28117) Signed-off-by: wang.yuqi <[email protected]> * [Docs] Clean up README_TUNING.md (vllm-project#28088) Signed-off-by: windsonsea <[email protected]> * [Hardware][IBM Z] Optimize s390x Dockerfile (vllm-project#28023) Signed-off-by: Rehan Khan <[email protected]> * [Chore] Remove Nemotron-Nano-VL config copy (vllm-project#28126) Signed-off-by: Isotr0py <[email protected]> * [Docs] Add guide to debugging vLLM-torch.compile integration (vllm-project#28094) Signed-off-by: Richard Zou <[email protected]> * [Feature]: Add corrupted request metric to V1 metrics system. (vllm-project#27306) Signed-off-by: atalhens <[email protected]> * [CI/Build] Update checking logic in cutlass_group_gemm_supported (vllm-project#27948) Signed-off-by: zhewenli <[email protected]> * [CI/Build] Fix `test_defaults_with_usage_context` in AMD CI (vllm-project#27926) Signed-off-by: zhewenli <[email protected]> * [Core][Hybrid allocator + connector 2/n] Unify `remove_skipped_blocks` by `get_last_useful_token` (vllm-project#25431) Signed-off-by: KuntaiDu <[email protected]> * [Debugging] Add annotation for easier trace analysis (vllm-project#22496) * [PERF] Decouple projections from GDN custom op. Attempt 2 (vllm-project#28083) Signed-off-by: Vadim Gimpelson <[email protected]> * [Bug] Fix cpu disable shared_experts `VLLM_DISABLE_SHARED_EXPERTS_STREAM` (vllm-project#28157) Signed-off-by: yewentao256 <[email protected]> * [Bug] Fix env string `"0"` same to `True` (vllm-project#28159) Signed-off-by: yewentao256 <[email protected]> * Ensure WorkerGuard command execution returns result; fix missing set_device when TP>1 Signed-off-by: fangyuchu <[email protected]> * [Feature] Enable TP + EP `shared_experts` overlap with router, 3.7% E2E performance improvement (vllm-project#28164) Signed-off-by: yewentao256 <[email protected]> * [CI Failure] `nm-testing/Qwen2-0.5B-Instruct-FP8-SkipQKV` was removed from HF. Skip it in tests (vllm-project#28170) Signed-off-by: Vadim Gimpelson <[email protected]> * [Misc] Remove the duplicate code (vllm-project#28111) Signed-off-by: chaunceyjiang <[email protected]> * rename& format logger Signed-off-by: w00689259 <[email protected]> * rename& format logger Signed-off-by: w00689259 <[email protected]> * feat(nccl): enable non-blocking NCCL communicators to support ncclCommAbort Signed-off-by: fangyuchu <[email protected]> --------- Signed-off-by: w00689259 <[email protected]> Signed-off-by: a798347923 <[email protected]> Signed-off-by: fangyuchu <[email protected]> Signed-off-by: zWaNg3 <[email protected]> Signed-off-by: a798347923 <[email protected]> Signed-off-by: Sungyoon Jeong <[email protected]> Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Rémi Delacourt <[email protected]> Signed-off-by: Rémi Delacourt <[email protected]> Signed-off-by: remi <[email protected]> Signed-off-by: Kunshang Ji <[email protected]> Signed-off-by: Misha Efimov <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: zhangyue <[email protected]> Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: zhangyue66 <[email protected]> Signed-off-by: gnovack <[email protected]> Signed-off-by: Peter Schuurman <[email protected]> Signed-off-by: ahao-anyscale <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Sophie du Couédic <[email protected]> Signed-off-by: Lucas Kabela <[email protected]> Signed-off-by: Matthew Bonanni <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Qiliang Cui <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: li2haipeng <[email protected]> Signed-off-by: Haipeng Li <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: zhenwei-intel <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Zhang Xiangze <[email protected]> Signed-off-by: zhewenli <[email protected]> Signed-off-by: wangyafeng <[email protected]> Signed-off-by: Yu Gong <[email protected]> Signed-off-by: Jerry Zhang <[email protected]> Signed-off-by: Zhuohan Li <[email protected]> Signed-off-by: Tomer Asida <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Faqin Zhong <[email protected]> Signed-off-by: Vadim Gimpelson <[email protected]> Signed-off-by: yuantao <[email protected]> Signed-off-by: yt0428 <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: ganyi <[email protected]> Signed-off-by: Nadav Kluger <[email protected]> Signed-off-by: Chenheli Hua <[email protected]> Signed-off-by: Aleksandr Malyshev <[email protected]> Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: Qiu <[email protected]> Signed-off-by: KuntaiDu <[email protected]> Signed-off-by: minatoaquaMK2 <[email protected]> Signed-off-by: Amir Klein <[email protected]> Signed-off-by: Boyuan Feng <[email protected]> Signed-off-by: frost-intel <[email protected]> Signed-off-by: wangzhengtao <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: ilmarkov <[email protected]> Signed-off-by: Sage Moore <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: Jiaju Zhang <[email protected]> Signed-off-by: zjy0516 <[email protected]> Signed-off-by: walter beller-morales <[email protected]> Signed-off-by: Yanan Cao <[email protected]> Signed-off-by: Samuel Shen <[email protected]> Signed-off-by: PaulZhang12 <[email protected]> Signed-off-by: wang.yuqi <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Rehan Khan <[email protected]> Signed-off-by: Richard Zou <[email protected]> Signed-off-by: atalhens <[email protected]> Signed-off-by: yewentao256 <[email protected]> Co-authored-by: fangyuchu <[email protected]> Co-authored-by: a798347923 <[email protected]> Co-authored-by: w00689259 <[email protected]> Co-authored-by: fangyuchu <[email protected]> Co-authored-by: TianZhuo <[email protected]> Co-authored-by: a798347923 <[email protected]> Co-authored-by: Sungyoon Jeong <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: Thomas Parnell <[email protected]> Co-authored-by: Rémi Delacourt <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: Misha Efimov <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: zhang-prog <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: gnovack <[email protected]> Co-authored-by: pwschuurman <[email protected]> Co-authored-by: ahao-anyscale <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Co-authored-by: Sophie du Couédic <[email protected]> Co-authored-by: Lucas Kabela <[email protected]> Co-authored-by: Matthew Bonanni <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Hank_ <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: QiliangCui <[email protected]> Co-authored-by: vllmellm <[email protected]> Co-authored-by: li2haipeng <[email protected]> Co-authored-by: liuzhenwei <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Wentao Ye <[email protected]> Co-authored-by: xiangze-arm <[email protected]> Co-authored-by: Zhewen Li <[email protected]> Co-authored-by: CSWYF3634076 <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: yugong333 <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Co-authored-by: tomeras91 <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: lyrisz <[email protected]> Co-authored-by: Faqin Zhong <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Vadim Gimpelson <[email protected]> Co-authored-by: yt0428 <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Pleaplusone <[email protected]> Co-authored-by: nadavkluger <[email protected]> Co-authored-by: Chenheli Hua <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: Aleksandr Malyshev <[email protected]> Co-authored-by: tou <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Qiu <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Eric Yue <[email protected]> Co-authored-by: amirkl94 <[email protected]> Co-authored-by: Boyuan Feng <[email protected]> Co-authored-by: Frost Mitchell <[email protected]> Co-authored-by: bigmoyan <[email protected]> Co-authored-by: wangzhengtao <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Ilya Markov <[email protected]> Co-authored-by: Sage Moore <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: wuhuikx <[email protected]> Co-authored-by: Jiaju Zhang <[email protected]> Co-authored-by: Jiangyun Zhu <[email protected]> Co-authored-by: Walter Beller-Morales <[email protected]> Co-authored-by: gmagogsfm <[email protected]> Co-authored-by: Samuel Shen <[email protected]> Co-authored-by: Samuel Shen <[email protected]> Co-authored-by: Yihua Cheng <[email protected]> Co-authored-by: Paul Zhang <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: R3hankhan <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Snehlata <[email protected]> Co-authored-by: Dayeol Lee <[email protected]>

…roject#25784) Co-authored-by: Aurick Qiao <[email protected]>

keyboardAnt · 2025-11-12T17:49:52Z

Here are two quick suggestions for validating your hypotheses on why suffix decoding is 5.8% slower even in the best case at out=256 with concurrency=64:

@keyboardAnt you are right to keep pushing on this ;). After some profiling I found the main reason is is actually due to the overhead from suffix decoding itself. At concurrency=64 the total cache update time per step is ~0.4ms and the total speculation time per step is ~1.5ms. To verify, I set the suffix_decoding_max_tree_depth to 32 instead of 64, which reduces the cost of both update and speculate, and it eliminated the gap vs ngram at concurrency = 64.

In hindsight, this is not very surprising since these suffix decoding operations are still all happening on a single CPU thread. For speculation, it should be straightforward to parallelize it across multiple threads. Parallelizing the cache update is a bit harder, a better solution here is probably just to let it run asynchronously so it can be overlapped with the forward pass (will require speculating using a stale version of the cache by 1 decoding step but should be inconsequential).

These would be easy wins on top of this initial implementation.

@aurickq, as you know, a key limitation of existing speculative decoding methods is the large batch setup. This is especially true for drafting methods that require additional gpu flops (like eagle).

It seems like suffix decoding has the potential to unlock speedups for larger batches. However, your profiling confirmed that the tree-update latency (a cpu-only procedure) scales with the batch size. This scaling behavior reduces the overall efficiency of suffix decoding and ultimately confines the method to smaller batches.

To unlock larger batches, we could implement multithreading by sharding the tree and assigning a separate lock to each shard. There are multiple ways to shard, but an effective approach would keep shard sizes roughly balanced (low variance).

Wdyt?

…roject#25784) Co-authored-by: Aurick Qiao <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

aurickq · 2025-11-14T05:14:26Z

@aurickq Why would you use this parameter --no-enable-prefix-caching ?

Just wanted to be able to run multiple benchmarks in a row without restarting the server. Otherwise, the results will be skewed by the caching.

aurickq · 2025-11-14T05:24:15Z

@aurickq, as you know, a key limitation of existing speculative decoding methods is the large batch setup. This is especially true for drafting methods that require additional gpu flops (like eagle).

It seems like suffix decoding has the potential to unlock speedups for larger batches. However, your profiling confirmed that the tree-update latency (a cpu-only procedure) scales with the batch size. This scaling behavior reduces the overall efficiency of suffix decoding and ultimately confines the method to smaller batches.

To unlock larger batches, we could implement multithreading by sharding the tree and assigning a separate lock to each shard. There are multiple ways to shard, but an effective approach would keep shard sizes roughly balanced (low variance).

Wdyt?

There are two parts - the speculations and the updates.

For the speculations, parallelizing it is a good idea, since it does not modify the tree. I would love to see this done.

For the updates, parallelizing is harder because it will modify the tree. My preference is to avoid any parallelization scheme that involves locking. It will make the tree implementation a lot harder to maintain and update in the future if we need to promise that it is thread-safe.

I think a simpler way to handle the updates is to make it run asynchronously in the background, but still in a single thread, so that it can be overlapped with the rest of the decoding loop. My feeling is this is good enough to completely hide away the update overhead, even for much higher concurrencies (but need to confirm).

…roject#25784) Co-authored-by: Aurick Qiao <[email protected]>

### What this PR does / why we need it? This PR integrate suffix decoding (https://arxiv.org/abs/2411.04975) from vllm (vllm-project/vllm#25784) # Suffix Decoding is a dynamic n-gram matching method that: 1. Uses suffix trees to generate speculative tokens quickly using branch frequency counts. 2. Can keep a history of prior model responses, which tends to work very well with repetitive agentic use cases. 3. Can be dynamically updated with newly generated tokens, and FIFO eviction of older requests. # ### Does this PR introduce _any_ user-facing change? This feature should be implemented as opt-in and remain seamless for users who do not require suffix speculative decoding. For users who wish to enable it, they must first install arctic-inference: `pip install arctic-inference ` After installation, the suffix speculative decoding feature can be enabled using the following speculative config: `--speculative_config '{"method": "suffix", "num_speculative_tokens": 5}' ` ### How was this patch tested? This PR is currently being tested on vLLM main:vllm-project/vllm@83f478b with PR vllm-project/vllm#25784 In our previous testing, suffix decoding achieved a 13%-30% throughput improvement over n-gram on the sonnet dataset, tested on vllm-ascend v0.9.1 with concurrency ranging from 2 to 40. - vLLM version: v0.11.2 --------- Signed-off-by: fluctlux <[email protected]>

sfc-gh-aqiao added 2 commits September 24, 2025 00:45

Integrate Suffix Decoding

2d4180c

update

db621b4

aurickq requested review from ProExpertProg, WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners September 26, 2025 19:53

mergify bot added the v1 label Sep 26, 2025

mergify bot added the needs-rebase label Sep 26, 2025

gemini-code-assist bot reviewed Sep 26, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

aurickq mentioned this pull request Sep 26, 2025

Plan to merge Suffix decoding into vLLM mainline? snowflakedb/ArcticInference#171

Closed

chatgpt-codex-connector bot reviewed Sep 26, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

add tests

a15727d

mergify bot added the ci/build label Sep 29, 2025

Jialin reviewed Oct 25, 2025

View reviewed changes

vllm/config/speculative.py Outdated Show resolved Hide resolved

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 25, 2025

sfc-gh-aqiao added 4 commits October 27, 2025 18:56

fix import

d85dc3f

Merge branch 'suffix-decoding' of https://github.com/aurickq/vllm int…

2f5b451

…o suffix-decoding

Merge branch 'main' into suffix-decoding

387fe61

config

aae2b61

aurickq and others added 6 commits October 27, 2025 18:55

Merge branch 'main' into suffix-decoding

7ade5e7

Trigger CI

ae0beb3

Merge branch 'main' into suffix-decoding

ba82677

Merge branch 'main' into suffix-decoding

cfcfcde

Merge branch 'main' into suffix-decoding

78798c2

Merge branch 'main' into suffix-decoding

d851834

simon-mo approved these changes Nov 3, 2025

View reviewed changes

simon-mo merged commit 2c19d96 into vllm-project:main Nov 3, 2025
88 of 91 checks passed

zhaozuy pushed a commit to zhaozuy/vllm that referenced this pull request Nov 4, 2025

[Spec Decode] Integrate Suffix Decoding from Arctic Inference (vllm-p…

8c21856

…roject#25784) Co-authored-by: Aurick Qiao <[email protected]>

omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request Nov 4, 2025

[Spec Decode] Integrate Suffix Decoding from Arctic Inference (vllm-p…

8382ee2

…roject#25784) Co-authored-by: Aurick Qiao <[email protected]>

juliendenize pushed a commit to juliendenize/vllm that referenced this pull request Nov 6, 2025

[Spec Decode] Integrate Suffix Decoding from Arctic Inference (vllm-p…

f93d2ce

…roject#25784) Co-authored-by: Aurick Qiao <[email protected]>

fluctlux mentioned this pull request Nov 7, 2025

[Feature] Integrate Suffix Spec Decoding vllm-project/vllm-ascend#4045

Merged

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

[Spec Decode] Integrate Suffix Decoding from Arctic Inference (vllm-p…

cc1a63f

…roject#25784) Co-authored-by: Aurick Qiao <[email protected]>

keyboardAnt mentioned this pull request Nov 12, 2025

[Feature | speculative decoding] Support suffix decoding sgl-project/sglang#13165

Open

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Nov 13, 2025

[Spec Decode] Integrate Suffix Decoding from Arctic Inference (vllm-p…

2151e3f

…roject#25784) Co-authored-by: Aurick Qiao <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Spec Decode] Integrate Suffix Decoding from Arctic Inference (vllm-p…

5786709

…roject#25784) Co-authored-by: Aurick Qiao <[email protected]>

Uh oh!

[Spec Decode] Integrate Suffix Decoding from Arctic Inference #25784

[Spec Decode] Integrate Suffix Decoding from Arctic Inference #25784

Uh oh!

Conversation

aurickq commented Sep 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Specbench

Blazedit

Older Results (before optimizing)

Uh oh!

mergify bot commented Sep 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

simon-mo commented Sep 26, 2025

Uh oh!

simon-mo commented Sep 26, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

keyboardAnt commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aurickq commented Sep 28, 2025

Uh oh!

Neo9061 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aurickq commented Sep 29, 2025

Uh oh!

Jialin commented Oct 25, 2025

Uh oh!

Uh oh!

aurickq commented Oct 27, 2025

Uh oh!

Uh oh!

ggg-s commented Nov 6, 2025

Uh oh!

keyboardAnt commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aurickq commented Nov 14, 2025

Uh oh!

aurickq commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

aurickq commented Sep 26, 2025 •

edited by github-actions bot

Loading

keyboardAnt commented Sep 26, 2025 •

edited

Loading

Neo9061 commented Sep 29, 2025 •

edited

Loading

keyboardAnt commented Nov 12, 2025 •

edited

Loading