-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[recipe] fix: Fix a Typo in One_Step_Off_Policy and Add async of Generative Reward Model in Response Generation #3369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…into the async of response generation so that for each individual response generated, a reward generation will start at once. This will be very useful if the reward model is Generative Reward Model (GRM).
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces two main changes: a typo fix in verl/workers/fsdp_workers.py and the addition of asynchronous processing for the Generative Reward Model (GRM) in recipe/one_step_off_policy/ray_trainer.py. The typo fix is correct. The asynchronous GRM evaluation is a good performance improvement, parallelizing reward computation for generated responses. My review includes one high-severity comment regarding a potential bug and inefficiency in how extra reward information is combined, with a suggested fix.
| combined_extras_dict = {} | ||
| if extras_list and extras_list[0]: | ||
| for key in extras_list[0].keys(): | ||
| combined_extras_dict[key] = [d[key] for d in extras_list if key in d] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation for combining extras_list into combined_extras_dict has a potential bug and is inefficient. It only considers keys from the first dictionary in extras_list, ignoring any unique keys present in other dictionaries. Additionally, it iterates through extras_list for each key, which is inefficient for large lists.
A more robust and efficient approach would be to iterate through each dictionary once and collect all key-value pairs.
| combined_extras_dict = {} | |
| if extras_list and extras_list[0]: | |
| for key in extras_list[0].keys(): | |
| combined_extras_dict[key] = [d[key] for d in extras_list if key in d] | |
| combined_extras_dict = {} | |
| for extras in extras_list: | |
| if extras: | |
| for key, value in extras.items(): | |
| combined_extras_dict.setdefault(key, []).append(value) |
|
@ji-huazhong Hey Huazhong, would you mind helping me review this pull request and let me know if there is anything to fixed. Thank you in advance. |
…umentation - Updated README.md and various requirements files for clarity and accuracy. - Enhanced setup.py for better installation processes. - Modified multiple GitHub workflows to improve CI/CD processes and ensure compatibility with recent changes. - Refined documentation across several sections to provide clearer guidance and examples. - Adjusted Dockerfiles and scripts for better performance and compatibility with new models and configurations. This commit aims to streamline the development process and enhance user experience with updated documentation and improved workflows.
…rative Reward Model in Response Generation (volcengine#3369) Fix a typo in verl/workers/fsdp_workers.py: original code: if self.model_config.generation_config is not None updated code: if self.generation_config is not None Add async of generation reward model (GRM): As the generative reward model is slow in the call. It is unreasonable to wait for all responses to be generated before sending to GRM for evaluation. So I add an async to start GRM evaluation once individual response generation is finished. --------- Co-authored-by: zhichao (jimmy) <[email protected]>
…rative Reward Model in Response Generation (volcengine#3369) Fix a typo in verl/workers/fsdp_workers.py: original code: if self.model_config.generation_config is not None updated code: if self.generation_config is not None Add async of generation reward model (GRM): As the generative reward model is slow in the call. It is unreasonable to wait for all responses to be generated before sending to GRM for evaluation. So I add an async to start GRM evaluation once individual response generation is finished. --------- Co-authored-by: zhichao (jimmy) <[email protected]>
…rative Reward Model in Response Generation (volcengine#3369) Fix a typo in verl/workers/fsdp_workers.py: original code: if self.model_config.generation_config is not None updated code: if self.generation_config is not None Add async of generation reward model (GRM): As the generative reward model is slow in the call. It is unreasonable to wait for all responses to be generated before sending to GRM for evaluation. So I add an async to start GRM evaluation once individual response generation is finished. --------- Co-authored-by: zhichao (jimmy) <[email protected]>
…rative Reward Model in Response Generation (volcengine#3369) Fix a typo in verl/workers/fsdp_workers.py: original code: if self.model_config.generation_config is not None updated code: if self.generation_config is not None Add async of generation reward model (GRM): As the generative reward model is slow in the call. It is unreasonable to wait for all responses to be generated before sending to GRM for evaluation. So I add an async to start GRM evaluation once individual response generation is finished. --------- Co-authored-by: zhichao (jimmy) <[email protected]>
Fix a typo in verl/workers/fsdp_workers.py:
original code: if self.model_config.generation_config is not None
updated code: if self.generation_config is not None
Add async of generation reward model (GRM):
As the generative reward model is slow in the call. It is unreasonable to wait for all responses to be generated before sending to GRM for evaluation. So I add an async to start GRM evaluation once individual response generation is finished.