[BugFix] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors #22962

qthequartermasterman · 2025-08-15T04:50:56Z

Purpose

While working on the prototype for #22124, I discovered two types of tensor inputs for Prompt Embeds that can cause downstream failures: tensors serialized on unavailable devices and sparse tensors.

(1) If you serialize a prompt embeds tensor (say) on a MacBook under 'mps', if you send that to vLLM today, it will attempt to deserialize it onto the 'mps' device which in general will not be available on devices running vLLM. By forcing all deserialized tensors onto the CPU in torch.load, we can better control those tensors. They will be eventually cast to the correct device anyway later in the engine.

(2) Because prompt_embeds take more memory than token ids, transmission can be slower. one way to get around this is to use sparse tensors, for models/inputs where it makes sense. Because the sparse tensors are never cast to dense tensors in vLLM today, both the v0 engine and the prototype I'm working on in the v1 engine will choke at various points.

This PR is mostly a superset of existing behavior. The only existing behavior that doesn't crash that is affected is when a non-cpu device is used for both saving and loading. In this case, it will now be loaded onto the CPU (which doesn't have a significant runtime performance affect), and then cast to the appropriate device in the engine, whereas before it would load onto the same device it was serialized on, then cast.

Test Plan

Add a property-based test using hypothesis (already a test dependency) and hypothesis-torch (which provides torch strategies for hypothesis, a new dependency for tests) which attempts to load arbitrary tensors, including tensors of arbitrary sparse/dense layouts and arbitrary devices. The test asserts that the deserialized tensor equals the original, and is on the CPU and is dense.

Test Result

New test pass on my local environment. If CI passes, it should be good to go.

(Optional) Documentation Update

None needed, I think.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Andrew Sansom <[email protected]>

qthequartermasterman · 2025-08-15T04:51:22Z

@DarkLight1337

gemini-code-assist

Code Review

This pull request effectively addresses two important issues with prompt embeds: handling tensors serialized on unavailable devices and supporting sparse tensors. The changes in serving_engine.py are clean and directly solve these problems by mapping loaded tensors to the CPU and converting them to a dense format. The addition of a property-based test using hypothesis is a great way to ensure the robustness of this functionality. I've found a minor issue in the new test that could cause it to fail for valid inputs, which I've commented on. Overall, this is a solid contribution.

tests/entrypoints/openai/test_prompt_validation.py

github-actions · 2025-08-15T05:14:34Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Andrew Sansom <[email protected]>

DarkLight1337

LGTM, thanks

robertgshaw2-redhat · 2025-08-15T16:36:01Z

tests/entrypoints/openai/test_prompt_validation.py

+import pybase64
 import pytest
 import regex as re
+import torch


this is probably causing the fork issue

Thanks! I think it wasn't the torch import, but investigating your comment did help me better isolate the issue! Instead hypothesis-torch had a side effect that was initializing cuda during test collection. Because it has a hypothesis plugin, this side effect of registering strategies for certain types occured any time pytest was invoked, regardless of whether the collected tests depended on hypothesis or hypothesis-torch.

I submitted a patch to hypothesis-torch, but it didn't seem to fully resolve the issue on the failing tests running locally. I realized though that even if that issue were resolved, simply trying to instantiate a CUDA tensor in this test would be enough to re-initialize cuda and cause the failures in any tests that ran alongside these. So I decided to just drop using hypothesis-torch altogether for generating arbitrary tensors, as well as testing against non-cpu tensors.

The test is weaker than I'd like, but side effects make life difficult. 😢

We'll see if CI is happy after this. Thanks for taking a look.

Signed-off-by: Andrew Sansom <[email protected]>

DarkLight1337 · 2025-08-16T00:41:57Z

Test is failing, PTAL

Signed-off-by: Andrew Sansom <[email protected]>

…unavailable devices and sparse tensors (vllm-project#22962) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Yiwen Chen <[email protected]>

…unavailable devices and sparse tensors (vllm-project#22962) Signed-off-by: Andrew Sansom <[email protected]>

…unavailable devices and sparse tensors (vllm-project#22962) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Duncan Moss <[email protected]>

…unavailable devices and sparse tensors (vllm-project#22962) Signed-off-by: Andrew Sansom <[email protected]>

…unavailable devices and sparse tensors (vllm-project#22962) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

…unavailable devices and sparse tensors (vllm-project#22962) Signed-off-by: Andrew Sansom <[email protected]>

qthequartermasterman added 3 commits August 14, 2025 23:40

fix: more robust prompt embeds loading

9c57bd3

Signed-off-by: Andrew Sansom <[email protected]>

test: add all layout types

78e4c89

Signed-off-by: Andrew Sansom <[email protected]>

test: do not require so many examples

08143ff

Signed-off-by: Andrew Sansom <[email protected]>

qthequartermasterman requested review from DarkLight1337, aarnphm, robertgshaw2-redhat and simon-mo as code owners August 15, 2025 04:50

mergify bot added ci/build frontend labels Aug 15, 2025

gemini-code-assist bot reviewed Aug 15, 2025

View reviewed changes

tests/entrypoints/openai/test_prompt_validation.py Outdated Show resolved Hide resolved

test: do not assert tensor is sparse

3f539e6

Signed-off-by: Andrew Sansom <[email protected]>

DarkLight1337 approved these changes Aug 15, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) August 15, 2025 12:38

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 15, 2025

qthequartermasterman changed the title ~~[FIX] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors~~ [BugFix] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors Aug 15, 2025

qthequartermasterman added 2 commits August 15, 2025 08:32

Merge branch 'main' into more-robust-prompt-embeds-loading

c5242a9

Merge branch 'main' into more-robust-prompt-embeds-loading

82c337f

robertgshaw2-redhat reviewed Aug 15, 2025

View reviewed changes

test: remove hypothesis-torch dependency

5275035

Signed-off-by: Andrew Sansom <[email protected]>

auto-merge was automatically disabled August 15, 2025 21:13
Head branch was pushed to by a user without write access

DarkLight1337 enabled auto-merge (squash) August 15, 2025 22:05

test: remvoe vestigial device argument

db84d86

Signed-off-by: Andrew Sansom <[email protected]>

auto-merge was automatically disabled August 16, 2025 03:59
Head branch was pushed to by a user without write access

qthequartermasterman requested a review from robertgshaw2-redhat August 16, 2025 03:59

DarkLight1337 enabled auto-merge (squash) August 16, 2025 04:31

DarkLight1337 merged commit 78863f8 into vllm-project:main Aug 16, 2025
41 checks passed

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[BugFix] Add support for loading prompt embeds tensors serialized on …

db052e0

…unavailable devices and sparse tensors (vllm-project#22962) Signed-off-by: Andrew Sansom <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[BugFix] Add support for loading prompt embeds tensors serialized on …

91f01ba

…unavailable devices and sparse tensors (vllm-project#22962) Signed-off-by: Andrew Sansom <[email protected]>

This was referenced Sep 23, 2025

[CORE] Prompt Embeddings Support for v1 Engine #24278

Merged

fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid hidden performance regressions #25738

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors #22962

[BugFix] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors #22962

Uh oh!

qthequartermasterman commented Aug 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

qthequartermasterman commented Aug 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Aug 15, 2025

Uh oh!

DarkLight1337 left a comment

Uh oh!

robertgshaw2-redhat Aug 15, 2025

Uh oh!

qthequartermasterman Aug 15, 2025 •

edited

Loading

Uh oh!

DarkLight1337 commented Aug 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[BugFix] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors #22962

[BugFix] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors #22962

Uh oh!

Conversation

qthequartermasterman commented Aug 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

qthequartermasterman commented Aug 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Aug 15, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

qthequartermasterman Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Aug 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qthequartermasterman commented Aug 15, 2025 •

edited by github-actions bot

Loading

qthequartermasterman Aug 15, 2025 •

edited

Loading