-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[BugFix] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors #22962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors #22962
Conversation
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
Signed-off-by: Andrew Sansom <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses two important issues with prompt embeds: handling tensors serialized on unavailable devices and supporting sparse tensors. The changes in serving_engine.py are clean and directly solve these problems by mapping loaded tensors to the CPU and converting them to a dense format. The addition of a property-based test using hypothesis is a great way to ensure the robustness of this functionality. I've found a minor issue in the new test that could cause it to fail for valid inputs, which I've commented on. Overall, this is a solid contribution.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Andrew Sansom <[email protected]>
DarkLight1337
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
| import pybase64 | ||
| import pytest | ||
| import regex as re | ||
| import torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably causing the fork issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think it wasn't the torch import, but investigating your comment did help me better isolate the issue! Instead hypothesis-torch had a side effect that was initializing cuda during test collection. Because it has a hypothesis plugin, this side effect of registering strategies for certain types occured any time pytest was invoked, regardless of whether the collected tests depended on hypothesis or hypothesis-torch.
I submitted a patch to hypothesis-torch, but it didn't seem to fully resolve the issue on the failing tests running locally. I realized though that even if that issue were resolved, simply trying to instantiate a CUDA tensor in this test would be enough to re-initialize cuda and cause the failures in any tests that ran alongside these. So I decided to just drop using hypothesis-torch altogether for generating arbitrary tensors, as well as testing against non-cpu tensors.
The test is weaker than I'd like, but side effects make life difficult. 😢
We'll see if CI is happy after this. Thanks for taking a look.
Signed-off-by: Andrew Sansom <[email protected]>
Head branch was pushed to by a user without write access
|
Test is failing, PTAL |
Signed-off-by: Andrew Sansom <[email protected]>
Head branch was pushed to by a user without write access
…unavailable devices and sparse tensors (vllm-project#22962) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Yiwen Chen <[email protected]>
…unavailable devices and sparse tensors (vllm-project#22962) Signed-off-by: Andrew Sansom <[email protected]>
…unavailable devices and sparse tensors (vllm-project#22962) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Duncan Moss <[email protected]>
…unavailable devices and sparse tensors (vllm-project#22962) Signed-off-by: Andrew Sansom <[email protected]>
…unavailable devices and sparse tensors (vllm-project#22962) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…unavailable devices and sparse tensors (vllm-project#22962) Signed-off-by: Andrew Sansom <[email protected]>
Purpose
While working on the prototype for #22124, I discovered two types of tensor inputs for Prompt Embeds that can cause downstream failures: tensors serialized on unavailable devices and sparse tensors.
(1) If you serialize a prompt embeds tensor (say) on a MacBook under 'mps', if you send that to vLLM today, it will attempt to deserialize it onto the 'mps' device which in general will not be available on devices running vLLM. By forcing all deserialized tensors onto the CPU in
torch.load, we can better control those tensors. They will be eventually cast to the correct device anyway later in the engine.(2) Because prompt_embeds take more memory than token ids, transmission can be slower. one way to get around this is to use sparse tensors, for models/inputs where it makes sense. Because the sparse tensors are never cast to dense tensors in vLLM today, both the v0 engine and the prototype I'm working on in the v1 engine will choke at various points.
This PR is mostly a superset of existing behavior. The only existing behavior that doesn't crash that is affected is when a non-cpu device is used for both saving and loading. In this case, it will now be loaded onto the CPU (which doesn't have a significant runtime performance affect), and then cast to the appropriate device in the engine, whereas before it would load onto the same device it was serialized on, then cast.
Test Plan
Add a property-based test using
hypothesis(already a test dependency) andhypothesis-torch(which provides torch strategies forhypothesis, a new dependency for tests) which attempts to load arbitrary tensors, including tensors of arbitrary sparse/dense layouts and arbitrary devices. The test asserts that the deserialized tensor equals the original, and is on the CPU and is dense.Test Result
New test pass on my local environment. If CI passes, it should be good to go.
(Optional) Documentation Update
None needed, I think.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.