Skip to content

Conversation

@qthequartermasterman
Copy link
Contributor

@qthequartermasterman qthequartermasterman commented Sep 26, 2025

Purpose

PR #24278 introduced casting tensors in MsgpackEncoder to the cpu before serializing. Although this did not introduce any performance regression at the time, casting between devices can be very expensive, and doing so every time a tensor is sent between processes has the potential into introduce major performance regressions. Tensors that will be serialized with Msgpack should be explicitly cast to CPU before any encoding, as recommended by @njhill.

This is a spiritual successor to #22962 which does a similar casting to CPU in the OpenAI-compatible API. This PR change ensures that ALL prompt embeds tensors are cast to CPU before being processed, even if they are submitted in offline mode.

Test Plan

No new tests are needed. I have a few local scripts that I use to hit vLLM with prompt embeds with thousands of requests from different devices. Those local scripts all pass.

Test Result

Local relevant tests are passing. Pending CI.

@DarkLight1337


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…den performance regressions

Signed-off-by: Andrew Sansom <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to prevent hidden performance regressions by explicitly moving tensors to the CPU before serialization, rather than relying on an implicit cast within the MsgpackEncoder. The change correctly adds a .cpu() call for prompt_embeds in the input preprocessing stage. However, reverting the .cpu() call in MsgpackEncoder._encode_tensor creates a risk for other types of tensor inputs, such as those from multi-modal data, which may not be on the CPU. This could lead to runtime errors. I've added a critical comment to address this by making the CPU requirement explicit with a check.

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 26, 2025 04:03
@DarkLight1337 DarkLight1337 added this to the v0.11.0 milestone Sep 26, 2025
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 26, 2025
@vllm-bot vllm-bot merged commit e84e073 into vllm-project:main Sep 26, 2025
45 of 50 checks passed
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
…idden performance regressions (#25738)

Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…idden performance regressions (vllm-project#25738)

Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…idden performance regressions (vllm-project#25738)

Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants