[recipe] fix: Fix a Typo in One_Step_Off_Policy and Add async of Generative Reward Model in Response Generation #3369

ZhichaoWang970201 · 2025-09-05T19:20:15Z

Fix a typo in verl/workers/fsdp_workers.py:
original code: if self.model_config.generation_config is not None
updated code: if self.generation_config is not None

Add async of generation reward model (GRM):
As the generative reward model is slow in the call. It is unreasonable to wait for all responses to be generated before sending to GRM for evaluation. So I add an async to start GRM evaluation once individual response generation is finished.

…into the async of response generation so that for each individual response generated, a reward generation will start at once. This will be very useful if the reward model is Generative Reward Model (GRM).

CLAassistant · 2025-09-05T19:20:22Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request introduces two main changes: a typo fix in verl/workers/fsdp_workers.py and the addition of asynchronous processing for the Generative Reward Model (GRM) in recipe/one_step_off_policy/ray_trainer.py. The typo fix is correct. The asynchronous GRM evaluation is a good performance improvement, parallelizing reward computation for generated responses. My review includes one high-severity comment regarding a potential bug and inefficiency in how extra reward information is combined, with a suggested fix.

gemini-code-assist · 2025-09-05T19:21:56Z

recipe/one_step_off_policy/ray_trainer.py

+        combined_extras_dict = {}
+        if extras_list and extras_list[0]:
+            for key in extras_list[0].keys():
+                combined_extras_dict[key] = [d[key] for d in extras_list if key in d]


The current implementation for combining extras_list into combined_extras_dict has a potential bug and is inefficient. It only considers keys from the first dictionary in extras_list, ignoring any unique keys present in other dictionaries. Additionally, it iterates through extras_list for each key, which is inefficient for large lists.

A more robust and efficient approach would be to iterate through each dictionary once and collect all key-value pairs.

Suggested change

combined_extras_dict = {}

if extras_list and extras_list[0]:

for key in extras_list[0].keys():

combined_extras_dict[key] = [d[key] for d in extras_list if key in d]

combined_extras_dict = {}

for extras in extras_list:

if extras:

for key, value in extras.items():

combined_extras_dict.setdefault(key, []).append(value)

ZhichaoWang970201 · 2025-09-05T19:22:32Z

@ji-huazhong Hey Huazhong, would you mind helping me review this pull request and let me know if there is anything to fixed. Thank you in advance.

…umentation - Updated README.md and various requirements files for clarity and accuracy. - Enhanced setup.py for better installation processes. - Modified multiple GitHub workflows to improve CI/CD processes and ensure compatibility with recent changes. - Refined documentation across several sections to provide clearer guidance and examples. - Adjusted Dockerfiles and scripts for better performance and compatibility with new models and configurations. This commit aims to streamline the development process and enhance user experience with updated documentation and improved workflows.

…rative Reward Model in Response Generation (volcengine#3369) Fix a typo in verl/workers/fsdp_workers.py: original code: if self.model_config.generation_config is not None updated code: if self.generation_config is not None Add async of generation reward model (GRM): As the generative reward model is slow in the call. It is unreasonable to wait for all responses to be generated before sending to GRM for evaluation. So I add an async to start GRM evaluation once individual response generation is finished. --------- Co-authored-by: zhichao (jimmy) <[email protected]>

fix a typo of one-step-off-policy and add async of reward generation …

52eb28d

…into the async of response generation so that for each individual response generated, a reward generation will start at once. This will be very useful if the reward model is Generative Reward Model (GRM).

gemini-code-assist bot reviewed Sep 5, 2025

View reviewed changes

Zhichao-Jimmy970201 added 2 commits September 26, 2025 03:53

Apply pre-commit fixes

b0d46c8

vermouth1992 changed the title ~~Fix a Typo in One_Step_Off_Policy and Add async of Generative Reward Model in Response Generation~~ [recipe] fix: Fix a Typo in One_Step_Off_Policy and Add async of Generative Reward Model in Response Generation Sep 26, 2025

vermouth1992 approved these changes Sep 26, 2025

View reviewed changes

vermouth1992 merged commit 377bbb8 into volcengine:main Sep 26, 2025
32 of 33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[recipe] fix: Fix a Typo in One_Step_Off_Policy and Add async of Generative Reward Model in Response Generation #3369

[recipe] fix: Fix a Typo in One_Step_Off_Policy and Add async of Generative Reward Model in Response Generation #3369

Uh oh!

ZhichaoWang970201 commented Sep 5, 2025

Uh oh!

CLAassistant commented Sep 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 5, 2025

Uh oh!

ZhichaoWang970201 commented Sep 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[recipe] fix: Fix a Typo in One_Step_Off_Policy and Add async of Generative Reward Model in Response Generation #3369

[recipe] fix: Fix a Typo in One_Step_Off_Policy and Add async of Generative Reward Model in Response Generation #3369

Uh oh!

Conversation

ZhichaoWang970201 commented Sep 5, 2025

Uh oh!

CLAassistant commented Sep 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

ZhichaoWang970201 commented Sep 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants