Skip to content

Add health check, make async Engine more robust#3015

Merged
Yard1 merged 8 commits into
vllm-project:mainfrom
Yard1:add_health_check
Mar 4, 2024
Merged

Add health check, make async Engine more robust#3015
Yard1 merged 8 commits into
vllm-project:mainfrom
Yard1:add_health_check

Conversation

@Yard1

@Yard1 Yard1 commented Feb 23, 2024

Copy link
Copy Markdown
Collaborator

For production usecases, we want to be able to detect Engine failures, especially ones that can happen silently (eg. due to NCCL timeouts). This PR adds a health check method (currently only checking the health of Ray workers) and makes the Async engine more robust by adding a timeout for each iteration as well as better error reporting.

@Yard1 Yard1 requested review from simon-mo and zhuohan123 February 26, 2024 20:39
@zhuohan123 zhuohan123 self-assigned this Mar 2, 2024

@zhuohan123 zhuohan123 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! In general LGTM. Left some small questions.

Comment thread vllm/engine/llm_engine.py Outdated
Comment thread vllm/engine/async_llm_engine.py Outdated
Comment thread vllm/engine/async_llm_engine.py Outdated
Comment on lines +42 to +44
finally:
if exception:
error_callback(exception)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We raise errors in both try branch and except branch. Then what does the finally here do?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to run the error callback even after we re-raise an exception in except

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could just do this in the except block though before re-raising (things will still run in the same order)

Comment thread vllm/engine/async_llm_engine.py Outdated
Comment on lines +174 to +178
async def wait_for_new_requests(self, clear: bool):
if not self.has_new_requests():
await self.new_requests_event.wait()
if clear:
self.new_requests_event.clear()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we always clear this flag?

Suggested change
async def wait_for_new_requests(self, clear: bool):
if not self.has_new_requests():
await self.new_requests_event.wait()
if clear:
self.new_requests_event.clear()
async def wait_for_new_requests(self):
if not self.has_new_requests():
await self.new_requests_event.wait()
self.new_requests_event.clear()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what's the reason behind this change? Why do we need to move the clear call from get_new_and_finished_requests to here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can always clear it.

The reason is to ensure the event is cleared as soon as we have new requests

Yard1 and others added 4 commits March 4, 2024 11:00
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

@njhill njhill left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Yard1 this looks great

Comment thread vllm/engine/async_llm_engine.py Outdated
Comment thread vllm/engine/async_llm_engine.py Outdated
Comment on lines +42 to +44
finally:
if exception:
error_callback(exception)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could just do this in the except block though before re-raising (things will still run in the same order)

Comment on lines +175 to +177
if not self.has_new_requests():
await self.new_requests_event.wait()
self.new_requests_event.clear()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to only clear before waiting

Suggested change
if not self.has_new_requests():
await self.new_requests_event.wait()
self.new_requests_event.clear()
if not self.has_new_requests():
self.new_requests_event.clear()
if not self.has_new_requests():
await self.new_requests_event.wait()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm can you explain why we should do it like that?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to avoid flip-flopping the event - it only needs to be cleared when you're actually about to wait on it. But I guess with python/asyncio it doesn't matter anyway.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it should be fine

@Yard1 Yard1 enabled auto-merge (squash) March 4, 2024 21:44
@Yard1 Yard1 merged commit ff578ca into vllm-project:main Mar 4, 2024
dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants