Skip to content

Conversation

@none0663
Copy link
Contributor

What does this PR do?

Add sleep/wakeup mode for gen rm vllm service and add tqdm showing process.

This capability is particularly beneficial when the model server shares resources with a training workload on the same machine. It allows the reward model service to be temporarily offloaded (to free up GPU memory) during intensive training sessions and reloaded when the service is required again.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a sleep/wakeup mechanism for the vLLM service to manage GPU resources effectively, along with a tqdm progress bar for better user experience during batch processing. The changes are well-intentioned. My review focuses on improving the robustness of the new functionality. I've pointed out a critical issue regarding potential resource leaks that should be addressed with a try...finally block, and a high-severity issue with the use of assert for runtime checks which should be replaced with proper exception handling. I also recommend reverting a change to zip to maintain strict checking of input data, preventing potential silent errors.

Comment on lines 117 to 132
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
futures = []
for data_source, solution_str, ground_truth, extra_info in zip(
data_sources, solution_strs, ground_truths, extra_infos, strict=True
for data_source, solution_str, ground_truth, extra_info, index in zip(
data_sources, solution_strs, ground_truths, extra_infos, indexes, strict=False
):
future = executor.submit(compute_score, data_source, solution_str, ground_truth, extra_info)
future = executor.submit(compute_score, data_source, solution_str, ground_truth, extra_info, index)
time.sleep(0.001 * random.random())
futures.append(future)

results = [future.result() for future in futures]
for future in tqdm.tqdm(as_completed(futures), total=len(futures)):
results.append(future.result())
results = sorted(results, key=lambda x: x[-1], reverse=False)
results = [result[0] for result in results]

if SERVER_BACKEND == "VLLM" and USE_OFFLOAD:
vllm_execute_method("sleep")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation calls vllm_execute_method("wake_up") before processing and vllm_execute_method("sleep") after. If an exception occurs during batch processing (e.g., inside the with ThreadPoolExecutor block), the sleep call will be skipped. This will leave the model loaded in GPU memory, causing a resource leak.

To ensure the sleep method is always called to free up resources, you should wrap the processing logic in a try...finally block.

Example:

    try:
        with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
            # ... existing logic to submit and process futures ...
            # ... existing logic to sort and prepare results ...
    finally:
        if SERVER_BACKEND == "VLLM" and USE_OFFLOAD:
            vllm_execute_method("sleep")

Comment on lines +51 to +54
assert task in ["sleep", "wake_up"], f"Invalid task: {task}"
url_root = BASE_URL
response = requests.post(url_root + "/" + task)
assert response.status_code == 200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using assert for runtime checks is not robust because assertions can be disabled with Python's -O flag, which would remove these crucial validations. For runtime error handling and input validation, it's better to use explicit conditional checks and raise appropriate exceptions.

The requests library provides response.raise_for_status() which is an idiomatic way to check for HTTP errors and will provide more context than a generic AssertionError.

Suggested change
assert task in ["sleep", "wake_up"], f"Invalid task: {task}"
url_root = BASE_URL
response = requests.post(url_root + "/" + task)
assert response.status_code == 200
if task not in {"sleep", "wake_up"}:
raise ValueError(f"Invalid task: {task}")
url_root = BASE_URL
response = requests.post(f"{url_root}/{task}")
response.raise_for_status()

for data_source, solution_str, ground_truth, extra_info in zip(
data_sources, solution_strs, ground_truths, extra_infos, strict=True
for data_source, solution_str, ground_truth, extra_info, index in zip(
data_sources, solution_strs, ground_truths, extra_infos, indexes, strict=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The zip function is called with strict=False. The previous version of this code used strict=True. Using strict=True is safer as it ensures that all input iterables have the same length, raising a ValueError if they don't. Switching to strict=False can hide potential bugs where input lists have mismatched lengths by silently truncating to the shortest list.

It is recommended to revert to strict=True to enforce data integrity.

Suggested change
data_sources, solution_strs, ground_truths, extra_infos, indexes, strict=False
data_sources, solution_strs, ground_truths, extra_infos, indexes, strict=True

@wuxibin89
Copy link
Collaborator

For external reward service, I think they should always use separate GPU resource?

@none0663
Copy link
Contributor Author

For external reward service, I think they should always use separate GPU resource?

Yes, it’s generally best practice to use dedicated GPU resources for the external reward service when resources are plentiful. However, when GPU resources are constrained, especially considering that many GPUs allocated for training might be idle during reward model scoring, we prefer to utilize those idle GPU resources for the reward service.

@wuxibin89 wuxibin89 changed the title [recipe] add: Add sleep/wakeup mode for gen rm vllm service and add tqdm showing process [recipe] feat: Add sleep/wakeup mode for gen rm vllm service and add tqdm showing process Jul 29, 2025
@wuxibin89 wuxibin89 merged commit 76298ad into volcengine:main Jul 29, 2025
7 of 8 checks passed
@ETOgaosion
Copy link
Collaborator

This PR seems to break the genrm test. @none0663 @wuxibin89

Suprisingly, Itself didn't test this.

For #2794 to solve all CI faults in main, we may revert this first and add some robustness check.

ETOgaosion added a commit that referenced this pull request Jul 30, 2025
…and add tqdm showing process" (#2813)

Reverts #2739

For #2794 to solve all CI faults.
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jul 31, 2025
…tqdm showing process (volcengine#2739)

### What does this PR do?

Add sleep/wakeup mode for gen rm vllm service and add tqdm showing
process.

This capability is particularly beneficial when the model server shares
resources with a training workload on the same machine. It allows the
reward model service to be temporarily offloaded (to free up GPU memory)
during intensive training sessions and reloaded when the service is
required again.
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jul 31, 2025
Juniper1021 pushed a commit to Juniper1021/verl that referenced this pull request Aug 7, 2025
…tqdm showing process (volcengine#2739)

### What does this PR do?

Add sleep/wakeup mode for gen rm vllm service and add tqdm showing
process.

This capability is particularly beneficial when the model server shares
resources with a training workload on the same machine. It allows the
reward model service to be temporarily offloaded (to free up GPU memory)
during intensive training sessions and reloaded when the service is
required again.
Juniper1021 pushed a commit to Juniper1021/verl that referenced this pull request Aug 7, 2025
whatadayG pushed a commit to whatadayG/verl that referenced this pull request Sep 5, 2025
…tqdm showing process (volcengine#2739)

### What does this PR do?

Add sleep/wakeup mode for gen rm vllm service and add tqdm showing
process.

This capability is particularly beneficial when the model server shares
resources with a training workload on the same machine. It allows the
reward model service to be temporarily offloaded (to free up GPU memory)
during intensive training sessions and reloaded when the service is
required again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants