[Feature] [Spec decode]: Combine chunked prefill with speculative decoding

### 🚀 The feature, motivation and pitch

Speculative decoding can achieve 50%+ latency reduction, but in vLLM it can suffer from the throughput-optimized default scheduling strategy where prefills are prioritized eagerly. Chunked prefill is a recent work in vLLM which optimizes this by spreading out the prefill work over many different decode batches. We can combine chunked prefill with speculative decoding's dynamic speculation length to get the best of both worlds.

This is a complex task that requires some design, if you're interested please reach out.

### Alternatives

_No response_

### Additional context

cc @LiuXiaoxuanPKU @comaniac @rkooo567 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #5016

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #5016

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions