Skip to content

Conversation

@ZhichaoWang970201
Copy link
Contributor

Fix a typo in verl/workers/fsdp_workers.py:
original code: if self.model_config.generation_config is not None
updated code: if self.generation_config is not None

Add async of generation reward model (GRM):
As the generative reward model is slow in the call. It is unreasonable to wait for all responses to be generated before sending to GRM for evaluation. So I add an async to start GRM evaluation once individual response generation is finished.

…into the async of response generation so that for each individual response generated, a reward generation will start at once. This will be very useful if the reward model is Generative Reward Model (GRM).
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two main changes: a typo fix in verl/workers/fsdp_workers.py and the addition of asynchronous processing for the Generative Reward Model (GRM) in recipe/one_step_off_policy/ray_trainer.py. The typo fix is correct. The asynchronous GRM evaluation is a good performance improvement, parallelizing reward computation for generated responses. My review includes one high-severity comment regarding a potential bug and inefficiency in how extra reward information is combined, with a suggested fix.

Comment on lines +383 to +386
combined_extras_dict = {}
if extras_list and extras_list[0]:
for key in extras_list[0].keys():
combined_extras_dict[key] = [d[key] for d in extras_list if key in d]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation for combining extras_list into combined_extras_dict has a potential bug and is inefficient. It only considers keys from the first dictionary in extras_list, ignoring any unique keys present in other dictionaries. Additionally, it iterates through extras_list for each key, which is inefficient for large lists.

A more robust and efficient approach would be to iterate through each dictionary once and collect all key-value pairs.

Suggested change
combined_extras_dict = {}
if extras_list and extras_list[0]:
for key in extras_list[0].keys():
combined_extras_dict[key] = [d[key] for d in extras_list if key in d]
combined_extras_dict = {}
for extras in extras_list:
if extras:
for key, value in extras.items():
combined_extras_dict.setdefault(key, []).append(value)

@ZhichaoWang970201
Copy link
Contributor Author

@ji-huazhong Hey Huazhong, would you mind helping me review this pull request and let me know if there is anything to fixed. Thank you in advance.

…umentation

- Updated README.md and various requirements files for clarity and accuracy.
- Enhanced setup.py for better installation processes.
- Modified multiple GitHub workflows to improve CI/CD processes and ensure compatibility with recent changes.
- Refined documentation across several sections to provide clearer guidance and examples.
- Adjusted Dockerfiles and scripts for better performance and compatibility with new models and configurations.

This commit aims to streamline the development process and enhance user experience with updated documentation and improved workflows.
@vermouth1992 vermouth1992 changed the title Fix a Typo in One_Step_Off_Policy and Add async of Generative Reward Model in Response Generation [recipe] fix: Fix a Typo in One_Step_Off_Policy and Add async of Generative Reward Model in Response Generation Sep 26, 2025
@vermouth1992 vermouth1992 merged commit 377bbb8 into volcengine:main Sep 26, 2025
32 of 33 checks passed
masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request Oct 19, 2025
…rative Reward Model in Response Generation (volcengine#3369)

Fix a typo in verl/workers/fsdp_workers.py: 
    original code: if self.model_config.generation_config is not None
    updated code: if self.generation_config is not None

Add async of generation reward model (GRM):
As the generative reward model is slow in the call. It is unreasonable
to wait for all responses to be generated before sending to GRM for
evaluation. So I add an async to start GRM evaluation once individual
response generation is finished.

---------

Co-authored-by: zhichao (jimmy) <[email protected]>
techkang pushed a commit to techkang/verl that referenced this pull request Oct 31, 2025
…rative Reward Model in Response Generation (volcengine#3369)

Fix a typo in verl/workers/fsdp_workers.py: 
    original code: if self.model_config.generation_config is not None
    updated code: if self.generation_config is not None

Add async of generation reward model (GRM):
As the generative reward model is slow in the call. It is unreasonable
to wait for all responses to be generated before sending to GRM for
evaluation. So I add an async to start GRM evaluation once individual
response generation is finished.

---------

Co-authored-by: zhichao (jimmy) <[email protected]>
mtian8 pushed a commit to mtian8/verl that referenced this pull request Nov 1, 2025
…rative Reward Model in Response Generation (volcengine#3369)

Fix a typo in verl/workers/fsdp_workers.py: 
    original code: if self.model_config.generation_config is not None
    updated code: if self.generation_config is not None

Add async of generation reward model (GRM):
As the generative reward model is slow in the call. It is unreasonable
to wait for all responses to be generated before sending to GRM for
evaluation. So I add an async to start GRM evaluation once individual
response generation is finished.

---------

Co-authored-by: zhichao (jimmy) <[email protected]>
wangboxiong320 pushed a commit to wangboxiong320/verl that referenced this pull request Nov 1, 2025
…rative Reward Model in Response Generation (volcengine#3369)

Fix a typo in verl/workers/fsdp_workers.py: 
    original code: if self.model_config.generation_config is not None
    updated code: if self.generation_config is not None

Add async of generation reward model (GRM):
As the generative reward model is slow in the call. It is unreasonable
to wait for all responses to be generated before sending to GRM for
evaluation. So I add an async to start GRM evaluation once individual
response generation is finished.

---------

Co-authored-by: zhichao (jimmy) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants