docs: chunked prefill updated documentation #578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

wallashss merged 3 commits into main from wallas-cp-docs

Nov 25, 2025

Collaborator

wallashss commented Nov 25, 2025

Description

Updated user guide and supported feature on vllm-spyre documentation.


          docs: chunked prefill updated documentation

30aa8ad

Signed-off-by: Wallas Santos <[email protected]>

wallashss requested a review from rafvasq as a code owner

November 25, 2025 14:14

github-actions bot commented Nov 25, 2025

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀


          docs: fix linting

7287d70

Signed-off-by: Wallas Santos <[email protected]>

rafvasq approved these changes

View reviewed changes

Collaborator

rafvasq left a comment

Just a few nits! LGTM otherwise 👍

docs/user_guide/configuration.md Outdated

    
              Chunked prefill is a technique that improves ITL (Inference Time Latency) in continuous batching mode when large prompts need to be prefetched. Without it, these large prefills can negatively impact the performance of ongoing decodes. In essence, chunked prefill divides incoming prompts into smaller segments and processes them incrementally, allowing the system to balance prefill work with active decoding tasks.

              For configuration and tuning guidance, see the [vLLM official documentation on chunked prefill](https://docs.vllm.ai/en/latest/configuration/optimization/#chunked-prefill.).

Collaborator

rafvasq Nov 25, 2025

nit: don't think it breaks the link

Suggested change

      
            For configuration and tuning guidance, see the [vLLM official documentation on chunked prefill](https://docs.vllm.ai/en/latest/configuration/optimization/#chunked-prefill.).
          
            For configuration and tuning guidance, see the [vLLM official documentation on chunked prefill](https://docs.vllm.ai/en/latest/configuration/optimization/#chunked-prefill).

docs/user_guide/configuration.md Outdated

    
              As in vLLM, the `max_num_batched_tokens` parameter controls how chunks are formed. However, because current versions of vLLM-Spyre cannot prefill and decode within the same engine step, `max_num_batched_tokens` specifies the chunk size, whereas in upstream vLLM it represents a shared token budget for both prefills and decodes.

              This parameter should be tuned according to your infrastructure. For convenience, when using the model `ibm-granite/granite-3.3-8b-instruct` with `tp=4`, vLLM-Spyre automatically sets max_num_batched_tokens to `4096`, a value known to produce good results.

Collaborator

rafvasq Nov 25, 2025

Suggested change

      
            This parameter should be tuned according to your infrastructure. For convenience, when using the model `ibm-granite/granite-3.3-8b-instruct` with `tp=4`, vLLM-Spyre automatically sets max_num_batched_tokens to `4096`, a value known to produce good results.
          
            This parameter should be tuned according to your infrastructure. For convenience, when using the model `ibm-granite/granite-3.3-8b-instruct` with `tp=4`, vLLM-Spyre automatically sets `max_num_batched_tokens` to `4096`, a value known to produce good results.

docs/user_guide/configuration.md Outdated

    
              ## Chunked Prefill

              Chunked prefill is a technique that improves ITL (Inference Time Latency) in continuous batching mode when large prompts need to be prefetched. Without it, these large prefills can negatively impact the performance of ongoing decodes. In essence, chunked prefill divides incoming prompts into smaller segments and processes them incrementally, allowing the system to balance prefill work with active decoding tasks.

Collaborator

rafvasq Nov 25, 2025

Suggested change

      
            Chunked prefill is a technique that improves ITL (Inference Time Latency) in continuous batching mode when large prompts need to be prefetched. Without it, these large prefills can negatively impact the performance of ongoing decodes. In essence, chunked prefill divides incoming prompts into smaller segments and processes them incrementally, allowing the system to balance prefill work with active decoding tasks.
          
            Chunked prefill is a technique that improves inference time latency (ITL) in continuous batching mode when large prompts need to be prefetched. Without it, these large prefills can negatively impact the performance of ongoing decodes. In essence, chunked prefill divides incoming prompts into smaller segments and processes them incrementally, allowing the system to balance prefill work with active decoding tasks.

Collaborator Author

wallashss Nov 25, 2025

It's Inter-Token Latency, that was close! hahah.

I fixed it and put it in a similar style that you suggested.

docs/user_guide/configuration.md Outdated

    
              In the vLLM v1 engine, this feature is enabled by default. In vLLM-Spyre, however, users must explicitly enable it by setting the environment variable `VLLM_SPYRE_USE_CHUNKED_PREFILL=1`.

              !!! note

                  Chunked prefill requires continuous batching to be enabled by setting: `VLLM_SPYRE_USE_CB=1`.

Collaborator

rafvasq Nov 25, 2025

Suggested change

      
                Chunked prefill requires continuous batching to be enabled by setting: `VLLM_SPYRE_USE_CB=1`.
          
                Chunked prefill requires continuous batching to be enabled by setting `VLLM_SPYRE_USE_CB=1`.

maxdebayser reviewed

View reviewed changes

docs/user_guide/configuration.md Outdated

    
              !!! note

                  Chunked prefill requires continuous batching to be enabled by setting: `VLLM_SPYRE_USE_CB=1`.

              As in vLLM, the `max_num_batched_tokens` parameter controls how chunks are formed. However, because current versions of vLLM-Spyre cannot prefill and decode within the same engine step, `max_num_batched_tokens` specifies the chunk size, whereas in upstream vLLM it represents a shared token budget for both prefills and decodes.

Collaborator

maxdebayser Nov 25, 2025

Maybe also mention that in vllm-spyre currently we only prefill a single prompt with continuous batching

maxdebayser reviewed

View reviewed changes

docs/user_guide/configuration.md Outdated

    
              As in vLLM, the `max_num_batched_tokens` parameter controls how chunks are formed. However, because current versions of vLLM-Spyre cannot prefill and decode within the same engine step, `max_num_batched_tokens` specifies the chunk size, whereas in upstream vLLM it represents a shared token budget for both prefills and decodes.

              This parameter should be tuned according to your infrastructure. For convenience, when using the model `ibm-granite/granite-3.3-8b-instruct` with `tp=4`, vLLM-Spyre automatically sets max_num_batched_tokens to `4096`, a value known to produce good results.

Collaborator

maxdebayser Nov 25, 2025

"good results" is a bit ambiguous. 4K will maximize hardware utilization but won't necessarily maximize prefix cache hits. This setting should ideally stay between 1K and 4K and must be a multiple of the block size (64).

Collaborator Author

wallashss Nov 25, 2025

Thanks for the suggestions, good catch for the multiple of 64.
Q: But, is prefix caching based on chunk size? Should it be based on just blocks?

Collaborator

maxdebayser Nov 25, 2025

No, it has to be based on chunk size as well.

maxdebayser requested changes

View reviewed changes

Collaborator

maxdebayser left a comment

Thanks! This looks good already. I just requested some small additions.


          docs: address suggestions

b248597

Signed-off-by: Wallas Santos <[email protected]>

wallashss requested a review from maxdebayser

November 25, 2025 18:23

maxdebayser approved these changes

View reviewed changes

Collaborator

maxdebayser left a comment

LGTM

wallashss merged commit 344d700 into main

19 checks passed

wallashss deleted the wallas-cp-docs branch

November 25, 2025 18:31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet