Skip to content

Conversation

@qthequartermasterman
Copy link
Contributor

@qthequartermasterman qthequartermasterman commented Aug 15, 2025

Purpose

While working on the prototype for #22124, I discovered two types of tensor inputs for Prompt Embeds that can cause downstream failures: tensors serialized on unavailable devices and sparse tensors.

(1) If you serialize a prompt embeds tensor (say) on a MacBook under 'mps', if you send that to vLLM today, it will attempt to deserialize it onto the 'mps' device which in general will not be available on devices running vLLM. By forcing all deserialized tensors onto the CPU in torch.load, we can better control those tensors. They will be eventually cast to the correct device anyway later in the engine.

(2) Because prompt_embeds take more memory than token ids, transmission can be slower. one way to get around this is to use sparse tensors, for models/inputs where it makes sense. Because the sparse tensors are never cast to dense tensors in vLLM today, both the v0 engine and the prototype I'm working on in the v1 engine will choke at various points.

This PR is mostly a superset of existing behavior. The only existing behavior that doesn't crash that is affected is when a non-cpu device is used for both saving and loading. In this case, it will now be loaded onto the CPU (which doesn't have a significant runtime performance affect), and then cast to the appropriate device in the engine, whereas before it would load onto the same device it was serialized on, then cast.

Test Plan

Add a property-based test using hypothesis (already a test dependency) and hypothesis-torch (which provides torch strategies for hypothesis, a new dependency for tests) which attempts to load arbitrary tensors, including tensors of arbitrary sparse/dense layouts and arbitrary devices. The test asserts that the deserialized tensor equals the original, and is on the CPU and is dense.

Test Result

New test pass on my local environment. If CI passes, it should be good to go.

(Optional) Documentation Update

None needed, I think.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@qthequartermasterman
Copy link
Contributor Author

@DarkLight1337

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses two important issues with prompt embeds: handling tensors serialized on unavailable devices and supporting sparse tensors. The changes in serving_engine.py are clean and directly solve these problems by mapping loaded tensors to the CPU and converting them to a dense format. The addition of a property-based test using hypothesis is a great way to ensure the robustness of this functionality. I've found a minor issue in the new test that could cause it to fail for valid inputs, which I've commented on. Overall, this is a solid contribution.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) August 15, 2025 12:38
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 15, 2025
@qthequartermasterman qthequartermasterman changed the title [FIX] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors [BugFix] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors Aug 15, 2025
import pybase64
import pytest
import regex as re
import torch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably causing the fork issue

Copy link
Contributor Author

@qthequartermasterman qthequartermasterman Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think it wasn't the torch import, but investigating your comment did help me better isolate the issue! Instead hypothesis-torch had a side effect that was initializing cuda during test collection. Because it has a hypothesis plugin, this side effect of registering strategies for certain types occured any time pytest was invoked, regardless of whether the collected tests depended on hypothesis or hypothesis-torch.

I submitted a patch to hypothesis-torch, but it didn't seem to fully resolve the issue on the failing tests running locally. I realized though that even if that issue were resolved, simply trying to instantiate a CUDA tensor in this test would be enough to re-initialize cuda and cause the failures in any tests that ran alongside these. So I decided to just drop using hypothesis-torch altogether for generating arbitrary tensors, as well as testing against non-cpu tensors.

The test is weaker than I'd like, but side effects make life difficult. 😢

We'll see if CI is happy after this. Thanks for taking a look.

auto-merge was automatically disabled August 15, 2025 21:13

Head branch was pushed to by a user without write access

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) August 15, 2025 22:05
@DarkLight1337
Copy link
Member

Test is failing, PTAL

auto-merge was automatically disabled August 16, 2025 03:59

Head branch was pushed to by a user without write access

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) August 16, 2025 04:31
@DarkLight1337 DarkLight1337 merged commit 78863f8 into vllm-project:main Aug 16, 2025
41 checks passed
666even666 pushed a commit to 666even666/vllm that referenced this pull request Aug 18, 2025
…unavailable devices and sparse tensors (vllm-project#22962)

Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Yiwen Chen <[email protected]>
divakar-amd pushed a commit to divakar-amd/vllm_upstream that referenced this pull request Aug 20, 2025
…unavailable devices and sparse tensors (vllm-project#22962)

Signed-off-by: Andrew Sansom <[email protected]>
djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Aug 21, 2025
…unavailable devices and sparse tensors (vllm-project#22962)

Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Duncan Moss <[email protected]>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
…unavailable devices and sparse tensors (vllm-project#22962)

Signed-off-by: Andrew Sansom <[email protected]>
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
…unavailable devices and sparse tensors (vllm-project#22962)

Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Xiao Yu <[email protected]>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
…unavailable devices and sparse tensors (vllm-project#22962)

Signed-off-by: Andrew Sansom <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build frontend ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants