-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[recipe] feat: Add sleep/wakeup mode for gen rm vllm service and add tqdm showing process #2739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…g the process of the request of the rm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a sleep/wakeup mechanism for the vLLM service to manage GPU resources effectively, along with a tqdm progress bar for better user experience during batch processing. The changes are well-intentioned. My review focuses on improving the robustness of the new functionality. I've pointed out a critical issue regarding potential resource leaks that should be addressed with a try...finally block, and a high-severity issue with the use of assert for runtime checks which should be replaced with proper exception handling. I also recommend reverting a change to zip to maintain strict checking of input data, preventing potential silent errors.
| with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor: | ||
| futures = [] | ||
| for data_source, solution_str, ground_truth, extra_info in zip( | ||
| data_sources, solution_strs, ground_truths, extra_infos, strict=True | ||
| for data_source, solution_str, ground_truth, extra_info, index in zip( | ||
| data_sources, solution_strs, ground_truths, extra_infos, indexes, strict=False | ||
| ): | ||
| future = executor.submit(compute_score, data_source, solution_str, ground_truth, extra_info) | ||
| future = executor.submit(compute_score, data_source, solution_str, ground_truth, extra_info, index) | ||
| time.sleep(0.001 * random.random()) | ||
| futures.append(future) | ||
|
|
||
| results = [future.result() for future in futures] | ||
| for future in tqdm.tqdm(as_completed(futures), total=len(futures)): | ||
| results.append(future.result()) | ||
| results = sorted(results, key=lambda x: x[-1], reverse=False) | ||
| results = [result[0] for result in results] | ||
|
|
||
| if SERVER_BACKEND == "VLLM" and USE_OFFLOAD: | ||
| vllm_execute_method("sleep") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation calls vllm_execute_method("wake_up") before processing and vllm_execute_method("sleep") after. If an exception occurs during batch processing (e.g., inside the with ThreadPoolExecutor block), the sleep call will be skipped. This will leave the model loaded in GPU memory, causing a resource leak.
To ensure the sleep method is always called to free up resources, you should wrap the processing logic in a try...finally block.
Example:
try:
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
# ... existing logic to submit and process futures ...
# ... existing logic to sort and prepare results ...
finally:
if SERVER_BACKEND == "VLLM" and USE_OFFLOAD:
vllm_execute_method("sleep")| assert task in ["sleep", "wake_up"], f"Invalid task: {task}" | ||
| url_root = BASE_URL | ||
| response = requests.post(url_root + "/" + task) | ||
| assert response.status_code == 200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using assert for runtime checks is not robust because assertions can be disabled with Python's -O flag, which would remove these crucial validations. For runtime error handling and input validation, it's better to use explicit conditional checks and raise appropriate exceptions.
The requests library provides response.raise_for_status() which is an idiomatic way to check for HTTP errors and will provide more context than a generic AssertionError.
| assert task in ["sleep", "wake_up"], f"Invalid task: {task}" | |
| url_root = BASE_URL | |
| response = requests.post(url_root + "/" + task) | |
| assert response.status_code == 200 | |
| if task not in {"sleep", "wake_up"}: | |
| raise ValueError(f"Invalid task: {task}") | |
| url_root = BASE_URL | |
| response = requests.post(f"{url_root}/{task}") | |
| response.raise_for_status() |
| for data_source, solution_str, ground_truth, extra_info in zip( | ||
| data_sources, solution_strs, ground_truths, extra_infos, strict=True | ||
| for data_source, solution_str, ground_truth, extra_info, index in zip( | ||
| data_sources, solution_strs, ground_truths, extra_infos, indexes, strict=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The zip function is called with strict=False. The previous version of this code used strict=True. Using strict=True is safer as it ensures that all input iterables have the same length, raising a ValueError if they don't. Switching to strict=False can hide potential bugs where input lists have mismatched lengths by silently truncating to the shortest list.
It is recommended to revert to strict=True to enforce data integrity.
| data_sources, solution_strs, ground_truths, extra_infos, indexes, strict=False | |
| data_sources, solution_strs, ground_truths, extra_infos, indexes, strict=True |
|
For external reward service, I think they should always use separate GPU resource? |
Yes, it’s generally best practice to use dedicated GPU resources for the external reward service when resources are plentiful. However, when GPU resources are constrained, especially considering that many GPUs allocated for training might be idle during reward model scoring, we prefer to utilize those idle GPU resources for the reward service. |
|
This PR seems to break the genrm test. @none0663 @wuxibin89 Suprisingly, Itself didn't test this. For #2794 to solve all CI faults in main, we may revert this first and add some robustness check. |
…tqdm showing process (volcengine#2739) ### What does this PR do? Add sleep/wakeup mode for gen rm vllm service and add tqdm showing process. This capability is particularly beneficial when the model server shares resources with a training workload on the same machine. It allows the reward model service to be temporarily offloaded (to free up GPU memory) during intensive training sessions and reloaded when the service is required again.
…and add tqdm showing process" (volcengine#2813) Reverts volcengine#2739 For volcengine#2794 to solve all CI faults.
…tqdm showing process (volcengine#2739) ### What does this PR do? Add sleep/wakeup mode for gen rm vllm service and add tqdm showing process. This capability is particularly beneficial when the model server shares resources with a training workload on the same machine. It allows the reward model service to be temporarily offloaded (to free up GPU memory) during intensive training sessions and reloaded when the service is required again.
…and add tqdm showing process" (volcengine#2813) Reverts volcengine#2739 For volcengine#2794 to solve all CI faults.
…tqdm showing process (volcengine#2739) ### What does this PR do? Add sleep/wakeup mode for gen rm vllm service and add tqdm showing process. This capability is particularly beneficial when the model server shares resources with a training workload on the same machine. It allows the reward model service to be temporarily offloaded (to free up GPU memory) during intensive training sessions and reloaded when the service is required again.
What does this PR do?
Add sleep/wakeup mode for gen rm vllm service and add tqdm showing process.
This capability is particularly beneficial when the model server shares resources with a training workload on the same machine. It allows the reward model service to be temporarily offloaded (to free up GPU memory) during intensive training sessions and reloaded when the service is required again.