Skip to content

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #5016

@cadedaniel

Description

@cadedaniel

🚀 The feature, motivation and pitch

Speculative decoding can achieve 50%+ latency reduction, but in vLLM it can suffer from the throughput-optimized default scheduling strategy where prefills are prioritized eagerly. Chunked prefill is a recent work in vLLM which optimizes this by spreading out the prefill work over many different decode batches. We can combine chunked prefill with speculative decoding's dynamic speculation length to get the best of both worlds.

This is a complex task that requires some design, if you're interested please reach out.

Alternatives

No response

Additional context

cc @LiuXiaoxuanPKU @comaniac @rkooo567

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions