Skip to content

Tkv limit bug#954

Merged
joerunde merged 8 commits intotorch-spyre:mainfrom
Daniel-Schenker:tkv_limit_bug
May 4, 2026
Merged

Tkv limit bug#954
joerunde merged 8 commits intotorch-spyre:mainfrom
Daniel-Schenker:tkv_limit_bug

Conversation

@Daniel-Schenker
Copy link
Copy Markdown
Collaborator

Description

This PR addresses a bug in the scheduler code where the scheduler would allow requests that would eventually exceed the max batch tkv limit during an ongoing batch. The PR includes a new test case to reproduce the error and the fix for said error.

Related Issues

Test Plan

Run the test file with: pytest tests/v1/worker/test_scheduler_tkv_limits.py -sv
Second test case will pass with changes in sendnn_inference/v1/core/scheduler.py included in this PR, and will fail without.

Checklist

  • I have read the contributing guidelines
  • My code follows the project's code style (run bash format.sh)
  • I have added tests for my changes (if applicable)
  • I have updated the documentation (if applicable)
  • My commits include a Signed-off-by: line (DCO compliance)

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

@Daniel-Schenker
Copy link
Copy Markdown
Collaborator Author

Initial proposed changes are here. More than happy to address any comments and add/change code and comments if things aren't clear.

Comment thread sendnn_inference/v1/core/scheduler.py Outdated
# Helper function to round up to the nearest block size
# Uses bitwise alignment for better performance
block_size = SpyrePlatform.get_block_size()
return (n + block_size - 1) & ~(block_size - 1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should make this generic, because the bitwise ops only work if the block size is a power of 2. What about hardcoding 64 here, and sticking an assert in the calling code that block_size == 64?

I have a reasonably low care amount because we never expect the block size to change, only a slight preference to not leave a foot-gun lying around.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

monkeypatch.setenv("VLLM_DT_MAX_BATCH_TKV_LIMIT", "131072")

# Define prompt lengths for first set of requests
prompt_lengths_1 = [
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1024] * 16?

or [1018] + [1024] * 15 if the first 1018 is relevant

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed, also realized that the second set of prompt lengths was identical to the first. Only difference between the 2 sets of requests is the max tokens so consolidated that as well.

Comment thread sendnn_inference/v1/core/scheduler.py Outdated

# Compute the effective token length of the new request
new_req_max_tkv = new_req_tkv + request.max_tokens - 1
# Rounded up to the nearest block size
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention here (and below) that we're rounding up to the end of the block to account for padding?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed


@pytest.mark.cpu
@pytest.mark.chunked_prefill
def test_scheduler_tkv_limits_ongoing_batch(monkeypatch: pytest.MonkeyPatch):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A high-level overview of what's going on here would be great.

IIUC this situation is:

  • We first schedule a 16x8k batch that would fully fill the 128k TKV limit
  • We then inject a batch of smaller requests partway through processing, which should be able to schedule only because they are guaranteed to finish processing just before the TKV is long enough to overrun the limit with the larger batch size
  • This flexes the logic for injecting shorter requests into a running batch, which is not tested by the other test in this file

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@Daniel-Schenker
Copy link
Copy Markdown
Collaborator Author

Initial comments have been addressed. I have re-confirmed that the new test case fails without scheduler.py changes and passes with changes.

Copy link
Copy Markdown
Collaborator

@yannicks1 yannicks1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it looks like this bug was (at least partially) introduced in #913 (please correct me if I am wrong). Prior to 913 we passed n_blocks into the function which had information about the new request and did calculation based on this. While we fixed an edge case in 913, the new approach did not use n_blocks, hence the new request was not considered.

With this PR we consider the new request again while still fixing the edge case bug.

def round_up_to_block_size(n: int) -> int:
# Helper function to round up to the nearest block size
# Uses bitwise alignment for better performance
return (n + 63) & ~63
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spicy stuff 🌶️

Comment thread sendnn_inference/v1/core/scheduler.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_blocks is unused, please remove

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@yannicks1
Copy link
Copy Markdown
Collaborator

Initial comments have been addressed. I have re-confirmed that the new test case fails without scheduler.py changes and passes with changes.

could you quickly run your newly added test (that captured the new failure) with a pre 913 commit? This would help me understand better...

Signed-off-by: Daniel Schenker <[email protected]>
Comment thread sendnn_inference/v1/core/scheduler.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be gone too ..

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Signed-off-by: Daniel Schenker <[email protected]>
Copy link
Copy Markdown
Collaborator

@joerunde joerunde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ship it!

@joerunde joerunde merged commit 395d367 into torch-spyre:main May 4, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants