Add health check, make async Engine more robust#3015
Conversation
zhuohan123
left a comment
There was a problem hiding this comment.
Thanks for the contribution! In general LGTM. Left some small questions.
| finally: | ||
| if exception: | ||
| error_callback(exception) |
There was a problem hiding this comment.
We raise errors in both try branch and except branch. Then what does the finally here do?
There was a problem hiding this comment.
we want to run the error callback even after we re-raise an exception in except
There was a problem hiding this comment.
I think you could just do this in the except block though before re-raising (things will still run in the same order)
| async def wait_for_new_requests(self, clear: bool): | ||
| if not self.has_new_requests(): | ||
| await self.new_requests_event.wait() | ||
| if clear: | ||
| self.new_requests_event.clear() |
There was a problem hiding this comment.
Why don't we always clear this flag?
| async def wait_for_new_requests(self, clear: bool): | |
| if not self.has_new_requests(): | |
| await self.new_requests_event.wait() | |
| if clear: | |
| self.new_requests_event.clear() | |
| async def wait_for_new_requests(self): | |
| if not self.has_new_requests(): | |
| await self.new_requests_event.wait() | |
| self.new_requests_event.clear() |
There was a problem hiding this comment.
Also, what's the reason behind this change? Why do we need to move the clear call from get_new_and_finished_requests to here?
There was a problem hiding this comment.
Yes, we can always clear it.
The reason is to ensure the event is cleared as soon as we have new requests
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
| finally: | ||
| if exception: | ||
| error_callback(exception) |
There was a problem hiding this comment.
I think you could just do this in the except block though before re-raising (things will still run in the same order)
| if not self.has_new_requests(): | ||
| await self.new_requests_event.wait() | ||
| self.new_requests_event.clear() |
There was a problem hiding this comment.
Suggestion to only clear before waiting
| if not self.has_new_requests(): | |
| await self.new_requests_event.wait() | |
| self.new_requests_event.clear() | |
| if not self.has_new_requests(): | |
| self.new_requests_event.clear() | |
| if not self.has_new_requests(): | |
| await self.new_requests_event.wait() |
There was a problem hiding this comment.
Hmm can you explain why we should do it like that?
There was a problem hiding this comment.
Just to avoid flip-flopping the event - it only needs to be cleared when you're actually about to wait on it. But I guess with python/asyncio it doesn't matter anyway.
There was a problem hiding this comment.
Yeah I think it should be fine
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
For production usecases, we want to be able to detect Engine failures, especially ones that can happen silently (eg. due to NCCL timeouts). This PR adds a health check method (currently only checking the health of Ray workers) and makes the Async engine more robust by adding a timeout for each iteration as well as better error reporting.