-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Bugfix][Frontend] Fixed issue where requests with duplicate request IDs might be sent to EngineCore simultaneously #15326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
Thanks for your contribution! I agree that this is a race condition. Appreciate you digging in |
vllm/v1/engine/output_processor.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this something more descriptive? get_parent_and_children_reqs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should probably also reflect the fact that the parent request is being removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fact that the parent request is being removed
Yes, Do you have any good suggestions? How about try_pop_parent?
|
Thanks a ton! I reviewed the implementation in detail and you have fixed the problem! Just left some minor comments about naming the functions and comments. Ping me on slack when this is ready! |
|
Thanks for this @hidva, I agree with @robertgshaw2-redhat's comments. However, I was already thinking it might be more robust to have the engine return finished notifications for all requests, including those whose abort is initiated from the front-end process. Currently it just stops sending any outputs for these but we could change it so that there will be a terminating RequestOutput with "aborted" finish_reason in these cases. Then we can clean up the output processor request states based on these responses rather than the current logic that's a bit disjoint. Another reason to do this is that in addition to the leak that you pointed out, there may still be a bug where such aborted requests aren't captured properly in the metrics, because |
|
Apologies for the delay; I was on vacation until now. I will continue to follow up on this PR. |
However, there are indeed some scenarios where only the frontend can notify the engine to stop outputting, such as the presence of a stop string or when the client disconnects. If we let the engine return finished notifications for all requests, how should the engine be aware of such external conditions like client disconnection?
Yes, we should add a call to In other words, after we introduced the concepts of aborted requests and finished requests, we also introduced two interfaces: |
|
Thanks @hidva just to be clear, I think this PR would be good to merge in its current form but that we should consider a follow-on to address the other things I mentioned.
The front-end would still initiate the aborts in the same way, i.e. for client disconnection and stop strings. It's just that the engine would now be guaranteed to subsequently return a final RequestOutput for these with aborted finish reason (this will require a change in the engine of course).
Regardless of the idempotence I think that it would be nice if we always do the cleanup when receiving the final response for a given request, irrespective of how it was terminated. |
|
@njhill Is there anything else that needs to be done for this PR? Also, I'm not sure why the two tests are failing. |
|
@hidva it seems that the test is hanging. Could you try merging in the latest main again? It's possible that it's a side-effect of the changes. |
…IDs might be sent to EngineCore simultaneously Signed-off-by: 盏一 <[email protected]>
Signed-off-by: 盏一 <[email protected]>
Signed-off-by: 盏一 <[email protected]>
Signed-off-by: 盏一 <[email protected]>
|
@njhill Could you help me rerun the Entrypoints test? It seems like a fluke, and I don't have the necessary permissions. |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@hidva apologies, could you merge in the latest main branch. Hopefully that should also resolve the test failure. |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
Since vllm-project#9550 and vllm-project#10968 we support client's supplying a custom request ID. The motivation for this is that it can be very helpful when you need to correlate vLLM logs with logs of a related service. Since the request ID is used ubiquitously across vLLM as a unique key, it obviously is problematic if we ever have multiple in-flight requests using the same client-provided request ID. We saw this happening recently when `vllm serve bench` started including a request ID and the request IDs from multiple concurrent instances caused collisions. See vllm-project#27723 We try to guard against request ID collisions currently in the frontend in OutputProcessor: ``` def add_request(...): if request_id in self.request_states: raise ValueError(f"Request id {request_id} already running.") ``` however, this is not always effective: 1) We can have abort race conditions where a request is no longer tracked by the frontend, but still not completed in the engine. See vllm-project#15326 for an attempt to fix this. 2) With P/D, a request will continue to be tracked by the prefill engine long after the prefill request has been completed in the frontend, while we wait for the decode side to fetch the KV blocks Let's instead ensure we use a unique request ID internaly, even when a client provides a custom request ID. We can do this simply by prepending a short random prefix given that we already add a prefix to the client-provided ID. A full 32 character random UUID would be overkill as a prefix, so how many random characters would be sufficient? 8 characters gives us 32 bits of entropy, or 16^8 possible prefixes. Using the collision probability approximation from https://preshing.com/20110504/hash-collision-probabilities: N = 16^8 and k is the number of generated prefixes, then the probability of collision is (k^2)/(2N), so If a client somehow caused vLLM to hold 10k requests that reuse the same client-provided ID, then there would be a 1.16% chance of collision: ``` >>> (k**2)/(2*N) 0.011641532182693481 ``` That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt]. Signed-off-by: Mark McLoughlin <[email protected]>
Since vllm-project#9550 and vllm-project#10968 we support client's supplying a custom request ID. The motivation for this is that it can be very helpful when you need to correlate vLLM logs with logs of a related service. Since the request ID is used ubiquitously across vLLM as a unique key, it obviously is problematic if we ever have multiple in-flight requests using the same client-provided request ID. We saw this happening recently when `vllm serve bench` started including a request ID and the request IDs from multiple concurrent instances caused collisions. See vllm-project#27723 We try to guard against request ID collisions currently in the frontend in OutputProcessor: ``` def add_request(...): if request_id in self.request_states: raise ValueError(f"Request id {request_id} already running.") ``` however, this is not always effective: 1) We can have abort race conditions where a request is no longer tracked by the frontend, but still not completed in the engine. See vllm-project#15326 for an attempt to fix this. 2) With P/D, a request will continue to be tracked by the prefill engine long after the prefill request has been completed in the frontend, while we wait for the decode side to fetch the KV blocks Let's instead ensure we use a unique request ID internaly, even when a client provides a custom request ID. We can do this simply by prepending a short random prefix given that we already add a prefix to the client-provided ID. A full 32 character random UUID would be overkill as a prefix, so how many random characters would be sufficient? 8 characters gives us 32 bits of entropy, or 16^8 possible prefixes. Using the collision probability approximation from https://preshing.com/20110504/hash-collision-probabilities: N = 16^8 and k is the number of generated prefixes, then the probability of collision is (k^2)/(2N), so If a client somehow caused vLLM to hold 10k requests that reuse the same client-provided ID, then there would be a 1.16% chance of collision: ``` >>> (k**2)/(2*N) 0.011641532182693481 ``` That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt]. Signed-off-by: Mark McLoughlin <[email protected]>
|
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you! |
Since vllm-project#9550 and vllm-project#10968 we support client's supplying a custom request ID. The motivation for this is that it can be very helpful when you need to correlate vLLM logs with logs of a related service. Since the request ID is used ubiquitously across vLLM as a unique key, it obviously is problematic if we ever have multiple in-flight requests using the same client-provided request ID. We saw this happening recently when `vllm serve bench` started including a request ID and the request IDs from multiple concurrent instances caused collisions. See vllm-project#27723 We try to guard against request ID collisions currently in the frontend in OutputProcessor: ``` def add_request(...): if request_id in self.request_states: raise ValueError(f"Request id {request_id} already running.") ``` however, this is not always effective: 1) We can have abort race conditions where a request is no longer tracked by the frontend, but still not completed in the engine. See vllm-project#15326 for an attempt to fix this. 2) With P/D, a request will continue to be tracked by the prefill engine long after the prefill request has been completed in the frontend, while we wait for the decode side to fetch the KV blocks Let's instead ensure we use a unique request ID internaly, even when a client provides a custom request ID. We can do this simply by prepending a short random prefix given that we already add a prefix to the client-provided ID. A full 32 character random UUID would be overkill as a prefix, so how many random characters would be sufficient? 8 characters gives us 32 bits of entropy, or 16^8 possible prefixes. Using the collision probability approximation from https://preshing.com/20110504/hash-collision-probabilities: N = 16^8 and k is the number of generated prefixes, then the probability of collision is (k^2)/(2N), so If a client somehow caused vLLM to hold 10k requests that reuse the same client-provided ID, then there would be a 1.16% chance of collision: ``` >>> (k**2)/(2*N) 0.011641532182693481 ``` That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt]. Signed-off-by: Mark McLoughlin <[email protected]>
Since vllm-project#9550 and vllm-project#10968 we support client's supplying a custom request ID. The motivation for this is that it can be very helpful when you need to correlate vLLM logs with logs of a related service. Since the request ID is used ubiquitously across vLLM as a unique key, it obviously is problematic if we ever have multiple in-flight requests using the same client-provided request ID. We saw this happening recently when `vllm serve bench` started including a request ID and the request IDs from multiple concurrent instances caused collisions. See vllm-project#27723 We try to guard against request ID collisions currently in the frontend in OutputProcessor: ``` def add_request(...): if request_id in self.request_states: raise ValueError(f"Request id {request_id} already running.") ``` however, this is not always effective: 1) We can have abort race conditions where a request is no longer tracked by the frontend, but still not completed in the engine. See vllm-project#15326 for an attempt to fix this. 2) With P/D, a request will continue to be tracked by the prefill engine long after the prefill request has been completed in the frontend, while we wait for the decode side to fetch the KV blocks Let's instead ensure we use a unique request ID internaly, even when a client provides a custom request ID. We can do this simply by prepending a short random prefix given that we already add a prefix to the client-provided ID. A full 32 character random UUID would be overkill as a prefix, so how many random characters would be sufficient? 8 characters gives us 32 bits of entropy, or 16^8 possible prefixes. Using the collision probability approximation from https://preshing.com/20110504/hash-collision-probabilities: N = 16^8 and k is the number of generated prefixes, then the probability of collision is (k^2)/(2N), so If a client somehow caused vLLM to hold 10k requests that reuse the same client-provided ID, then there would be a 1.16% chance of collision: ``` >>> (k**2)/(2*N) 0.011641532182693481 ``` That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt]. Signed-off-by: Mark McLoughlin <[email protected]>
Currently, vllm allows users to send duplicate request IDs. At the same time, numerous modules in EngineCore use request IDs as dictionary keys, such as
KVCacheManager.req_to_blocks. This is based on the assumption that EngineCore always expects the Frontend to first abort a request before adding a new one with the same request ID:Currently,
AsyncLLMensures that duplicate request IDs must first be aborted before they can be added through the sequenceAsyncLLM._add_request->OutputProcessor.add_request:We can easily simulate the potential bug by enlarging the possible time window with an

await asyncio.sleep(13)inserted at the BUG point:To fix this issue, we categorized completed requests into two types:
handle_abort_reqs_handle_finished_reqsAnd ensured that the scope of request visibility in the Frontend always includes the scope of request visibility in EngineCore.