[BREAKING][rollout] feat: Added asynchronous reward model calculation in agent loop #3152

echo-rain · 2025-08-20T12:48:55Z

What does this PR do?

This PR will be based on PR#3055, and will further support asynchronous calculation of reward models based on the agent loop which only supports asynchronous reward function calculation.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

If you want to use this feature, you need to add the following configuration to the startup script configuration item

    reward_model.enable_resource_pool=True 
    reward_model.n_gpus_per_node=1 
    reward_model.nnodes=1

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

CLAassistant · 2025-08-20T12:49:02Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request introduces asynchronous reward model calculation within the agent loop. It achieves this by passing the reward model worker group (rm_wg) down to the AgentLoopWorker and RewardManagerWorker. A new reward_wrapper method is introduced to handle the asynchronous scoring. The changes also include support for a dedicated resource pool for the reward model, improving resource management. The overall implementation is consistent and correctly enables the new asynchronous workflow. My review includes one high-severity suggestion to improve configuration validation by using ValueError instead of assert, which can be disabled in production environments.

gemini-code-assist · 2025-08-20T12:51:16Z

verl/trainer/main_ppo.py

+            assert config.reward_model.n_gpus_per_node > 0, "config.reward_model.n_gpus_per_node must be greater than 0"
+            assert config.reward_model.nnodes > 0, "config.reward_model.nnodes must be greater than 0"


Using assert for validating user configuration can be risky, as assertions can be disabled with Python's -O (optimize) flag. This could lead to silent failures or unexpected behavior in production environments if the configuration is invalid. It's safer to raise a ValueError for configuration errors to ensure the validation is always performed.

Suggested change

assert config.reward_model.n_gpus_per_node > 0, "config.reward_model.n_gpus_per_node must be greater than 0"

assert config.reward_model.nnodes > 0, "config.reward_model.nnodes must be greater than 0"

if config.reward_model.n_gpus_per_node <= 0:

raise ValueError("config.reward_model.n_gpus_per_node must be greater than 0")

if config.reward_model.nnodes <= 0:

raise ValueError("config.reward_model.nnodes must be greater than 0")

wuxibin89 · 2025-08-23T03:20:46Z

There's a big difference between reward model as service and reward model as worker group.

Reward model as service: request sent to OpenAI compatible http server with chat completion api, e.g vllm server, sglang server. These http servers can accept individual request and batch them automatically. Hence in agent loop, once a sequence generation finished, we can send it to reward service immediately.

However, reward model as worker group lack the ability of automatic request batching, so if we send individual request in agent loop, these requests are processed serially, which is very inefficient.

echo-rain · 2025-08-25T03:05:35Z

There's a big difference between reward model as service and reward model as worker group.

Reward model as service: request sent to OpenAI compatible http server with chat completion api, e.g vllm server, sglang server. These http servers can accept individual request and batch them automatically. Hence in agent loop, once a sequence generation finished, we can send it to reward service immediately.

However, reward model as worker group lack the ability of automatic request batching, so if we send individual request in agent loop, these requests are processed serially, which is very inefficient.

You are right. I will add an executor in the next commit for the real batch execution, and only register the request when processing a single request.

…tion is always performed.

… isolation is enabled

echo-rain · 2025-08-28T11:28:50Z

There's a big difference between reward model as service and reward model as worker group.

Reward model as service: request sent to OpenAI compatible http server with chat completion api, e.g vllm server, sglang server. These http servers can accept individual request and batch them automatically. Hence in agent loop, once a sequence generation finished, we can send it to reward service immediately.

However, reward model as worker group lack the ability of automatic request batching, so if we send individual request in agent loop, these requests are processed serially, which is very inefficient.

The latest commit has added dynamic batching for reward model inference. While verifying performance data, we found that using a separate resource pool for the reward model in the on-policy scenario resulted in low computing resource utilization.

To address the issue of computing resource utilization, a potential solution would be to have rollout and reward pools use mutually exclusive resource pools, with actors sharing the computing resources of both. However, the current resource pool isolation mechanism does not appear to be easily modified to achieve this solution, so it is not implemented in this PR.

… in agent loop (volcengine#3152) ### What does this PR do? > This PR will be based on [PR#3055](volcengine#3055), and will further support asynchronous calculation of reward models based on the agent loop which only supports asynchronous reward function calculation. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > If you want to use this feature, you need to add the following configuration to the startup script configuration item ```python reward_model.enable_resource_pool=True reward_model.n_gpus_per_node=1 reward_model.nnodes=1 ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

echo-rain requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners August 20, 2025 12:48

gemini-code-assist bot reviewed Aug 20, 2025

View reviewed changes

echo-rain force-pushed the main_dev branch 2 times, most recently from 24357d6 to 248e084 Compare August 20, 2025 13:16

echo-rain changed the title ~~[WIP][rollout] feat: Added asynchronous reward model calculation in agent loop~~ [WIP][BREAKING][rollout] feat: Added asynchronous reward model calculation in agent loop Aug 21, 2025

echo-rain changed the title ~~[WIP][BREAKING][rollout] feat: Added asynchronous reward model calculation in agent loop~~ [BREAKING][rollout] feat: Added asynchronous reward model calculation in agent loop Aug 22, 2025

echo-rain added 7 commits August 28, 2025 16:37

feat: Added asynchronous reward model calculation in agent loop

c2dcc33

fix: raise a ValueError for configuration errors to ensure the valida…

1313acb

…tion is always performed.

feat：Added a new configuration field to confirm whether resource pool…

ddf14dd

… isolation is enabled

feat：Added an ut

48d9a98

fix：fix some codecheck

5dcd8fa

feat: add a batch executor to collect requests into a batch execution

41952c2

feat: add a batch executor to collect requests into a batch execution

d4fa70b

echo-rain force-pushed the main_dev branch from e7db7f7 to d4fa70b Compare August 28, 2025 10:05

wuxibin89 approved these changes Sep 2, 2025

View reviewed changes

wuxibin89 merged commit 844c929 into volcengine:main Sep 2, 2025
57 of 58 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BREAKING][rollout] feat: Added asynchronous reward model calculation in agent loop #3152

[BREAKING][rollout] feat: Added asynchronous reward model calculation in agent loop #3152

Uh oh!

echo-rain commented Aug 20, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Aug 20, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 20, 2025

Uh oh!

wuxibin89 commented Aug 23, 2025

Uh oh!

echo-rain commented Aug 25, 2025

Uh oh!

echo-rain commented Aug 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		assert config.reward_model.n_gpus_per_node > 0, "config.reward_model.n_gpus_per_node must be greater than 0"
		assert config.reward_model.nnodes > 0, "config.reward_model.nnodes must be greater than 0"

-            assert config.reward_model.n_gpus_per_node > 0, "config.reward_model.n_gpus_per_node must be greater than 0"
-            assert config.reward_model.nnodes > 0, "config.reward_model.nnodes must be greater than 0"
+            if config.reward_model.n_gpus_per_node <= 0:
+                raise ValueError("config.reward_model.n_gpus_per_node must be greater than 0")
+            if config.reward_model.nnodes <= 0:
+                raise ValueError("config.reward_model.nnodes must be greater than 0")

[BREAKING][rollout] feat: Added asynchronous reward model calculation in agent loop #3152

[BREAKING][rollout] feat: Added asynchronous reward model calculation in agent loop #3152

Uh oh!

Conversation

echo-rain commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

wuxibin89 commented Aug 23, 2025

Uh oh!

echo-rain commented Aug 25, 2025

Uh oh!

echo-rain commented Aug 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

echo-rain commented Aug 20, 2025 •

edited

Loading

CLAassistant commented Aug 20, 2025 •

edited

Loading