fix: revert cast to cpu in `MsgpackEncoder._encode_tensor` to avoid hidden performance regressions #25738

qthequartermasterman · 2025-09-26T03:56:40Z

Purpose

PR #24278 introduced casting tensors in MsgpackEncoder to the cpu before serializing. Although this did not introduce any performance regression at the time, casting between devices can be very expensive, and doing so every time a tensor is sent between processes has the potential into introduce major performance regressions. Tensors that will be serialized with Msgpack should be explicitly cast to CPU before any encoding, as recommended by @njhill.

This is a spiritual successor to #22962 which does a similar casting to CPU in the OpenAI-compatible API. This PR change ensures that ALL prompt embeds tensors are cast to CPU before being processed, even if they are submitted in offline mode.

Test Plan

No new tests are needed. I have a few local scripts that I use to hit vLLM with prompt embeds with thousands of requests from different devices. Those local scripts all pass.

Test Result

Local relevant tests are passing. Pending CI.

@DarkLight1337

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…den performance regressions Signed-off-by: Andrew Sansom <[email protected]>

gemini-code-assist

Code Review

This pull request aims to prevent hidden performance regressions by explicitly moving tensors to the CPU before serialization, rather than relying on an implicit cast within the MsgpackEncoder. The change correctly adds a .cpu() call for prompt_embeds in the input preprocessing stage. However, reverting the .cpu() call in MsgpackEncoder._encode_tensor creates a risk for other types of tensor inputs, such as those from multi-modal data, which may not be on the CPU. This could lead to runtime errors. I've added a critical comment to address this by making the CPU requirement explicit with a check.

vllm/v1/serial_utils.py

…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]>

…idden performance regressions (#25738) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: yewentao256 <[email protected]>

…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]>

…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]>

fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid hid…

781a8ab

…den performance regressions Signed-off-by: Andrew Sansom <[email protected]>

qthequartermasterman requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 26, 2025 03:56

mergify bot added the v1 label Sep 26, 2025

gemini-code-assist bot reviewed Sep 26, 2025

View reviewed changes

vllm/v1/serial_utils.py Show resolved Hide resolved

DarkLight1337 approved these changes Sep 26, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) September 26, 2025 04:03

DarkLight1337 added this to the v0.11.0 milestone Sep 26, 2025

qthequartermasterman mentioned this pull request Sep 26, 2025

[CORE] Prompt Embeddings Support for v1 Engine #24278

Merged

5 tasks

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 26, 2025

qthequartermasterman mentioned this pull request Sep 26, 2025

feat: Support Prefix Caching with Prompt Embeds #25741

Closed

5 tasks

vllm-bot merged commit e84e073 into vllm-project:main Sep 26, 2025
45 of 50 checks passed

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid h…

6410745

…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid h…

74ea69f

…idden performance regressions (#25738) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: yewentao256 <[email protected]>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid h…

190209f

…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid h…

e3c3ca1

…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid h…

0abd271

…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid h…

c008837

…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: revert cast to cpu in `MsgpackEncoder._encode_tensor` to avoid hidden performance regressions #25738

fix: revert cast to cpu in `MsgpackEncoder._encode_tensor` to avoid hidden performance regressions #25738

Uh oh!

qthequartermasterman commented Sep 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid hidden performance regressions #25738

fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid hidden performance regressions #25738

Uh oh!

Conversation

qthequartermasterman commented Sep 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: revert cast to cpu in `MsgpackEncoder._encode_tensor` to avoid hidden performance regressions #25738

fix: revert cast to cpu in `MsgpackEncoder._encode_tensor` to avoid hidden performance regressions #25738

qthequartermasterman commented Sep 26, 2025 •

edited by github-actions bot

Loading