-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid hidden performance regressions
#25738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid hidden performance regressions
#25738
Conversation
…den performance regressions Signed-off-by: Andrew Sansom <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to prevent hidden performance regressions by explicitly moving tensors to the CPU before serialization, rather than relying on an implicit cast within the MsgpackEncoder. The change correctly adds a .cpu() call for prompt_embeds in the input preprocessing stage. However, reverting the .cpu() call in MsgpackEncoder._encode_tensor creates a risk for other types of tensor inputs, such as those from multi-modal data, which may not be on the CPU. This could lead to runtime errors. I've added a critical comment to address this by making the CPU requirement explicit with a check.
…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]>
…idden performance regressions (#25738) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: yewentao256 <[email protected]>
…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]>
…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]>
…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]>
…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…idden performance regressions (vllm-project#25738) Signed-off-by: Andrew Sansom <[email protected]>
Purpose
PR #24278 introduced casting tensors in
MsgpackEncoderto the cpu before serializing. Although this did not introduce any performance regression at the time, casting between devices can be very expensive, and doing so every time a tensor is sent between processes has the potential into introduce major performance regressions. Tensors that will be serialized with Msgpack should be explicitly cast to CPU before any encoding, as recommended by @njhill.This is a spiritual successor to #22962 which does a similar casting to CPU in the OpenAI-compatible API. This PR change ensures that ALL prompt embeds tensors are cast to CPU before being processed, even if they are submitted in offline mode.
Test Plan
No new tests are needed. I have a few local scripts that I use to hit vLLM with prompt embeds with thousands of requests from different devices. Those local scripts all pass.
Test Result
Local relevant tests are passing. Pending CI.
@DarkLight1337
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.