[0.11.0][Bug Fix] Fixes ngram spec decode bug introduced by vllm #3817

wxsIcey · 2025-10-28T03:48:06Z

What this PR does / why we need it?

[Bug Fix] Fixes ngram spec decode bug introduced by vllm vllm-project/vllm#24986

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

def main():
    prompts = [
        "The future of AI is",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    # Create an LLM.
    llm = LLM(
            model="LLM-Research/Meta-Llama-3.1-8B-Instruct",
            tensor_parallel_size=1,
            speculative_config={
                "method": "ngram",
                "num_speculative_tokens": 5, # 每次最多推测 5 个 token
                "prompt_lookup_max": 4, # 最多使用 4 个 n-gram 进行匹配
            },
            enforce_eager=True,
        )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    print(f"Outputs: {outputs}")
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

vLLM version: v0.11.0rc3
vLLM main: vllm-project/vllm@c9461e0

Signed-off-by: Icey <[email protected]>

github-actions · 2025-10-28T03:48:15Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request aims to fix a bug in the n-gram speculative decoding by refactoring to a batched approach, aligning with an upstream vLLM change. However, the implementation introduces a critical bug where it uses stale sequence lengths when proposing draft tokens, which will likely prevent n-gram matching from working correctly. I've provided a suggestion to fix this by tracking and using the updated sequence lengths, which aligns with the correct implementation in upstream vLLM.

gemini-code-assist · 2025-10-28T03:50:20Z

vllm_ascend/spec_decode/ngram_proposer.py

+        valid_ngram_requests = []
        for i, sampled_ids in enumerate(valid_sampled_token_ids):
            num_sampled_ids = len(sampled_ids)
            if not num_sampled_ids:
-                # Skip speculative decoding.
-                draft_token_ids.append([])
                continue

-            # Skip requests that require top-p, top-k, etc.
            req_id = self.runner.input_batch.req_ids[i]
            if req_id in self.runner.input_batch.spec_decode_unsupported_reqs:
-                draft_token_ids.append([])
                continue

-            # Add sampled_token_ids to token_ids_cpu.
+            num_tokens = self.runner.input_batch.num_tokens_no_spec[i]
+            if num_tokens >= self.runner.input_batch.max_model_len:
+                # Skip requests that have already reached the max model length.
+                continue
+
            start_idx = self.runner.input_batch.num_tokens_no_spec[i]
            end_idx = start_idx + num_sampled_ids
            self.runner.input_batch.token_ids_cpu[
                i, start_idx:end_idx] = sampled_ids
-            drafter_output = self.propose(
-                self.runner.input_batch.token_ids_cpu[i, :end_idx])
-            if drafter_output is None or len(drafter_output) == 0:
-                draft_token_ids.append([])
-            else:
-                draft_token_ids.append(drafter_output.tolist())
-        return draft_token_ids
+
+            valid_ngram_requests.append(i)
+
+        draft_token_ids = self.batch_propose(
+            len(valid_sampled_token_ids),
+            valid_ngram_requests,
+            self.runner.input_batch.num_tokens_no_spec,
+            self.runner.input_batch.token_ids_cpu,
+        )
+
+        return draft_token_ids


The batch_propose method is called with self.runner.input_batch.num_tokens_no_spec, which contains the sequence lengths before the newly sampled tokens are appended. This causes the propose method to look for n-grams in an outdated sequence, missing the most recent token. This will likely cause the n-gram matching to fail and prevent any speculative tokens from being proposed.

The fix is to calculate the new sequence lengths after appending the sampled tokens and pass these new lengths to batch_propose. This aligns with the corrected implementation in upstream vLLM.

valid_ngram_requests = [] new_num_tokens = self.runner.input_batch.num_tokens_no_spec.copy() for i, sampled_ids in enumerate(valid_sampled_token_ids): num_sampled_ids = len(sampled_ids) if not num_sampled_ids: continue req_id = self.runner.input_batch.req_ids[i] if req_id in self.runner.input_batch.spec_decode_unsupported_reqs: continue num_tokens = self.runner.input_batch.num_tokens_no_spec[i] if num_tokens >= self.runner.input_batch.max_model_len: # Skip requests that have already reached the max model length. continue start_idx = self.runner.input_batch.num_tokens_no_spec[i] end_idx = start_idx + num_sampled_ids self.runner.input_batch.token_ids_cpu[ i, start_idx:end_idx] = sampled_ids new_num_tokens[i] = end_idx valid_ngram_requests.append(i) draft_token_ids = self.batch_propose( len(valid_sampled_token_ids), valid_ngram_requests, new_num_tokens, self.runner.input_batch.token_ids_cpu, ) return draft_token_ids

wangxiyuan · 2025-11-05T00:47:56Z

is this a cherry-pick? If yes, please mention it in commit message or title

wxsIcey · 2025-11-05T01:13:13Z

is this a cherry-pick? If yes, please mention it in commit message or title

This PR fixed the issue introduced upstream, but ngram still has unresolved accuracy issues, so I'm unsure if this PR can be merged.

[0.11.0][Bug Fix] Fixes ngram spec decode bug introduced by vllm

c1cf2ed

Signed-off-by: Icey <[email protected]>

wxsIcey added ready read for review ready-for-test start test by label for PR labels Oct 28, 2025

gemini-code-assist bot reviewed Oct 28, 2025

View reviewed changes

wxsIcey closed this Nov 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[0.11.0][Bug Fix] Fixes ngram spec decode bug introduced by vllm #3817

[0.11.0][Bug Fix] Fixes ngram spec decode bug introduced by vllm #3817

Uh oh!

wxsIcey commented Oct 28, 2025

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 28, 2025

Uh oh!

wangxiyuan commented Nov 5, 2025

Uh oh!

wxsIcey commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[0.11.0][Bug Fix] Fixes ngram spec decode bug introduced by vllm #3817

[0.11.0][Bug Fix] Fixes ngram spec decode bug introduced by vllm #3817

Uh oh!

Conversation

wxsIcey commented Oct 28, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan commented Nov 5, 2025

Uh oh!

wxsIcey commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants