Skip to content

Conversation

@wxsIcey
Copy link
Collaborator

@wxsIcey wxsIcey commented Oct 28, 2025

What this PR does / why we need it?

[Bug Fix] Fixes ngram spec decode bug introduced by vllm vllm-project/vllm#24986

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

def main():
    prompts = [
        "The future of AI is",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    # Create an LLM.
    llm = LLM(
            model="LLM-Research/Meta-Llama-3.1-8B-Instruct",
            tensor_parallel_size=1,
            speculative_config={
                "method": "ngram",
                "num_speculative_tokens": 5, # 每次最多推测 5 个 token
                "prompt_lookup_max": 4, # 最多使用 4 个 n-gram 进行匹配
            },
            enforce_eager=True,
        )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    print(f"Outputs: {outputs}")
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@wxsIcey wxsIcey added ready read for review ready-for-test start test by label for PR labels Oct 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a bug in the n-gram speculative decoding by refactoring to a batched approach, aligning with an upstream vLLM change. However, the implementation introduces a critical bug where it uses stale sequence lengths when proposing draft tokens, which will likely prevent n-gram matching from working correctly. I've provided a suggestion to fix this by tracking and using the updated sequence lengths, which aligns with the correct implementation in upstream vLLM.

Comment on lines +42 to +71
valid_ngram_requests = []
for i, sampled_ids in enumerate(valid_sampled_token_ids):
num_sampled_ids = len(sampled_ids)
if not num_sampled_ids:
# Skip speculative decoding.
draft_token_ids.append([])
continue

# Skip requests that require top-p, top-k, etc.
req_id = self.runner.input_batch.req_ids[i]
if req_id in self.runner.input_batch.spec_decode_unsupported_reqs:
draft_token_ids.append([])
continue

# Add sampled_token_ids to token_ids_cpu.
num_tokens = self.runner.input_batch.num_tokens_no_spec[i]
if num_tokens >= self.runner.input_batch.max_model_len:
# Skip requests that have already reached the max model length.
continue

start_idx = self.runner.input_batch.num_tokens_no_spec[i]
end_idx = start_idx + num_sampled_ids
self.runner.input_batch.token_ids_cpu[
i, start_idx:end_idx] = sampled_ids
drafter_output = self.propose(
self.runner.input_batch.token_ids_cpu[i, :end_idx])
if drafter_output is None or len(drafter_output) == 0:
draft_token_ids.append([])
else:
draft_token_ids.append(drafter_output.tolist())
return draft_token_ids

valid_ngram_requests.append(i)

draft_token_ids = self.batch_propose(
len(valid_sampled_token_ids),
valid_ngram_requests,
self.runner.input_batch.num_tokens_no_spec,
self.runner.input_batch.token_ids_cpu,
)

return draft_token_ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The batch_propose method is called with self.runner.input_batch.num_tokens_no_spec, which contains the sequence lengths before the newly sampled tokens are appended. This causes the propose method to look for n-grams in an outdated sequence, missing the most recent token. This will likely cause the n-gram matching to fail and prevent any speculative tokens from being proposed.

The fix is to calculate the new sequence lengths after appending the sampled tokens and pass these new lengths to batch_propose. This aligns with the corrected implementation in upstream vLLM.

        valid_ngram_requests = []
        new_num_tokens = self.runner.input_batch.num_tokens_no_spec.copy()
        for i, sampled_ids in enumerate(valid_sampled_token_ids):
            num_sampled_ids = len(sampled_ids)
            if not num_sampled_ids:
                continue

            req_id = self.runner.input_batch.req_ids[i]
            if req_id in self.runner.input_batch.spec_decode_unsupported_reqs:
                continue

            num_tokens = self.runner.input_batch.num_tokens_no_spec[i]
            if num_tokens >= self.runner.input_batch.max_model_len:
                # Skip requests that have already reached the max model length.
                continue

            start_idx = self.runner.input_batch.num_tokens_no_spec[i]
            end_idx = start_idx + num_sampled_ids
            self.runner.input_batch.token_ids_cpu[
                i, start_idx:end_idx] = sampled_ids
            new_num_tokens[i] = end_idx

            valid_ngram_requests.append(i)

        draft_token_ids = self.batch_propose(
            len(valid_sampled_token_ids),
            valid_ngram_requests,
            new_num_tokens,
            self.runner.input_batch.token_ids_cpu,
        )

        return draft_token_ids

@wangxiyuan
Copy link
Collaborator

is this a cherry-pick? If yes, please mention it in commit message or title

@wxsIcey
Copy link
Collaborator Author

wxsIcey commented Nov 5, 2025

is this a cherry-pick? If yes, please mention it in commit message or title

This PR fixed the issue introduced upstream, but ngram still has unresolved accuracy issues, so I'm unsure if this PR can be merged.

@wxsIcey wxsIcey closed this Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants