-
Notifications
You must be signed in to change notification settings - Fork 31
docs: chunked prefill updated documentation #578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Wallas Santos <[email protected]>
|
👋 Hi! Thank you for contributing to vLLM support on Spyre. Or this can be done with Now you are good to go 🚀 |
Signed-off-by: Wallas Santos <[email protected]>
rafvasq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nits! LGTM otherwise 👍
docs/user_guide/configuration.md
Outdated
|
|
||
| Chunked prefill is a technique that improves ITL (Inference Time Latency) in continuous batching mode when large prompts need to be prefetched. Without it, these large prefills can negatively impact the performance of ongoing decodes. In essence, chunked prefill divides incoming prompts into smaller segments and processes them incrementally, allowing the system to balance prefill work with active decoding tasks. | ||
|
|
||
| For configuration and tuning guidance, see the [vLLM official documentation on chunked prefill](https://docs.vllm.ai/en/latest/configuration/optimization/#chunked-prefill.). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't think it breaks the link
| For configuration and tuning guidance, see the [vLLM official documentation on chunked prefill](https://docs.vllm.ai/en/latest/configuration/optimization/#chunked-prefill.). | |
| For configuration and tuning guidance, see the [vLLM official documentation on chunked prefill](https://docs.vllm.ai/en/latest/configuration/optimization/#chunked-prefill). |
docs/user_guide/configuration.md
Outdated
|
|
||
| As in vLLM, the `max_num_batched_tokens` parameter controls how chunks are formed. However, because current versions of vLLM-Spyre cannot prefill and decode within the same engine step, `max_num_batched_tokens` specifies the chunk size, whereas in upstream vLLM it represents a shared token budget for both prefills and decodes. | ||
|
|
||
| This parameter should be tuned according to your infrastructure. For convenience, when using the model `ibm-granite/granite-3.3-8b-instruct` with `tp=4`, vLLM-Spyre automatically sets max_num_batched_tokens to `4096`, a value known to produce good results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This parameter should be tuned according to your infrastructure. For convenience, when using the model `ibm-granite/granite-3.3-8b-instruct` with `tp=4`, vLLM-Spyre automatically sets max_num_batched_tokens to `4096`, a value known to produce good results. | |
| This parameter should be tuned according to your infrastructure. For convenience, when using the model `ibm-granite/granite-3.3-8b-instruct` with `tp=4`, vLLM-Spyre automatically sets `max_num_batched_tokens` to `4096`, a value known to produce good results. |
docs/user_guide/configuration.md
Outdated
|
|
||
| ## Chunked Prefill | ||
|
|
||
| Chunked prefill is a technique that improves ITL (Inference Time Latency) in continuous batching mode when large prompts need to be prefetched. Without it, these large prefills can negatively impact the performance of ongoing decodes. In essence, chunked prefill divides incoming prompts into smaller segments and processes them incrementally, allowing the system to balance prefill work with active decoding tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Chunked prefill is a technique that improves ITL (Inference Time Latency) in continuous batching mode when large prompts need to be prefetched. Without it, these large prefills can negatively impact the performance of ongoing decodes. In essence, chunked prefill divides incoming prompts into smaller segments and processes them incrementally, allowing the system to balance prefill work with active decoding tasks. | |
| Chunked prefill is a technique that improves inference time latency (ITL) in continuous batching mode when large prompts need to be prefetched. Without it, these large prefills can negatively impact the performance of ongoing decodes. In essence, chunked prefill divides incoming prompts into smaller segments and processes them incrementally, allowing the system to balance prefill work with active decoding tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's Inter-Token Latency, that was close! hahah.
I fixed it and put it in a similar style that you suggested.
docs/user_guide/configuration.md
Outdated
| In the vLLM v1 engine, this feature is enabled by default. In vLLM-Spyre, however, users must explicitly enable it by setting the environment variable `VLLM_SPYRE_USE_CHUNKED_PREFILL=1`. | ||
|
|
||
| !!! note | ||
| Chunked prefill requires continuous batching to be enabled by setting: `VLLM_SPYRE_USE_CB=1`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Chunked prefill requires continuous batching to be enabled by setting: `VLLM_SPYRE_USE_CB=1`. | |
| Chunked prefill requires continuous batching to be enabled by setting `VLLM_SPYRE_USE_CB=1`. |
docs/user_guide/configuration.md
Outdated
| !!! note | ||
| Chunked prefill requires continuous batching to be enabled by setting: `VLLM_SPYRE_USE_CB=1`. | ||
|
|
||
| As in vLLM, the `max_num_batched_tokens` parameter controls how chunks are formed. However, because current versions of vLLM-Spyre cannot prefill and decode within the same engine step, `max_num_batched_tokens` specifies the chunk size, whereas in upstream vLLM it represents a shared token budget for both prefills and decodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also mention that in vllm-spyre currently we only prefill a single prompt with continuous batching
docs/user_guide/configuration.md
Outdated
|
|
||
| As in vLLM, the `max_num_batched_tokens` parameter controls how chunks are formed. However, because current versions of vLLM-Spyre cannot prefill and decode within the same engine step, `max_num_batched_tokens` specifies the chunk size, whereas in upstream vLLM it represents a shared token budget for both prefills and decodes. | ||
|
|
||
| This parameter should be tuned according to your infrastructure. For convenience, when using the model `ibm-granite/granite-3.3-8b-instruct` with `tp=4`, vLLM-Spyre automatically sets max_num_batched_tokens to `4096`, a value known to produce good results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"good results" is a bit ambiguous. 4K will maximize hardware utilization but won't necessarily maximize prefix cache hits. This setting should ideally stay between 1K and 4K and must be a multiple of the block size (64).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions, good catch for the multiple of 64.
Q: But, is prefix caching based on chunk size? Should it be based on just blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it has to be based on chunk size as well.
maxdebayser
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks good already. I just requested some small additions.
Signed-off-by: Wallas Santos <[email protected]>
maxdebayser
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Updated user guide and supported feature on vllm-spyre documentation.