Skip to content

[WIP] fix check_batch_tkv_limit_cp conservatively#961

Merged
sducouedic merged 1 commit intotorch-spyre:mainfrom
yannicks1:tkv-batch-check-safety-margin
May 8, 2026
Merged

[WIP] fix check_batch_tkv_limit_cp conservatively#961
sducouedic merged 1 commit intotorch-spyre:mainfrom
yannicks1:tkv-batch-check-safety-margin

Conversation

@yannicks1
Copy link
Copy Markdown
Collaborator

added an extra block of slack (left-padding can push a sequence's runtime tkv up to one block past the scheduler's estimate when the batch re-aligns on admission) for both new and decode requests.

…nt for all edge cases

Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

👋 Hi! Thank you for contributing.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

@yannicks1
Copy link
Copy Markdown
Collaborator Author

bot:bench
NUM_PROMPTS=1000
MAX_RUN_TIME=36000
IGNORE_EOS=1
CUSTOM_OUTPUT_LEN=-1
MAX_CONCURRENT=4

@yannicks1
Copy link
Copy Markdown
Collaborator Author

bot:bench
NUM_PROMPTS=1000
MAX_RUN_TIME=36000
IGNORE_EOS=1
CUSTOM_OUTPUT_LEN=-1
MAX_CONCURRENT=8

@yannicks1
Copy link
Copy Markdown
Collaborator Author

bot:test

@yannicks1
Copy link
Copy Markdown
Collaborator Author

bot:bench
NUM_PROMPTS=1000
MAX_RUN_TIME=36000
IGNORE_EOS=1
CUSTOM_OUTPUT_LEN=-1
MAX_CONCURRENT=8

new_req_max_tkv = round_up_to_block_size(new_req_tkv + request.max_tokens - 1)
# Extra block of slack: left-padding can push a sequence's runtime tkv up to
# one block past the scheduler's estimate when the batch re-aligns on admission.
new_req_max_tkv += self.block_size
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dang, yeah my reasoning on only needing to round up was that the padding only happens when a request actually gets to the next block, so if our math is correct here then you'd only get to the next block if some request was going to end inside it, in which case that request would round up and cover all the cases of other requests getting padded to somewhere within that block.

But clearly that reasoning isn't sound 🙃

Copy link
Copy Markdown
Collaborator

@joerunde joerunde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ship it

@joerunde joerunde marked this pull request as ready for review May 8, 2026 17:30
@sducouedic sducouedic merged commit ce90aab into torch-spyre:main May 8, 2026
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants