-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Closed
Labels
feature requestNew feature or requestNew feature or requestspeculative-decodingstaleOver 90 days of inactivityOver 90 days of inactivity
Description
🚀 The feature, motivation and pitch
Speculative decoding can achieve 50%+ latency reduction, but in vLLM it can suffer from the throughput-optimized default scheduling strategy where prefills are prioritized eagerly. Chunked prefill is a recent work in vLLM which optimizes this by spreading out the prefill work over many different decode batches. We can combine chunked prefill with speculative decoding's dynamic speculation length to get the best of both worlds.
This is a complex task that requires some design, if you're interested please reach out.
Alternatives
No response
Additional context
comaniac, LiuXiaoxuanPKU, WangErXiao and abhigoyal1997
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or requestspeculative-decodingstaleOver 90 days of inactivityOver 90 days of inactivity