-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[trainer] fix: avoid loading duplicated custom reward function to fix issue #3150 #3404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses issue #3150 by preventing the duplicate loading of a custom reward function module. The change checks sys.modules before importing the module from a file, which resolves pickling errors with ProcessPoolExecutor. I've added one suggestion to further improve the robustness of this fix by ensuring the specified reward function is validated in the module, whether it's newly loaded or was pre-existing.
|
Could you comment on Gemini's review? |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Hi, I think the suggestion made by Gemini would make the code more robust. |
What does this PR do?
This PR attempts to fix issue #3150, where custom reward function cannot be used with
reward_model.reward_manager=prime.The main cause of issue #3150 is that when using
verl.trainer.ppo.reward.get_custom_reward_fnto createreward_fnandval_reward_fnfor theRayPPOTrainer, the custom reward code file is loaded twice under the same module name "custom_module".This results in the actual reward functions
reward_fn.compute_scoreandval_reward_fn.compute_scorehaving the same module-function name (i.e., "custom_module") but differentid()values in memory, and onlyval_reward_fn.compute_scorehas the correct, latestid()address.Therefore, when using the "prime" reward manager, since it tries to use
concurrent.futures.ProcessPoolExecutorto pickle the reward functions for multiprocessing, theProcessPoolExecutorfinds thatreward_fn.compute_scorehas an incorrectid()compared to its global "custom_module"'s recordedid(), and thus fails to picklereward_fn.compute_score.The solution for this bug is to simply avoid loading duplicate modules from the same custom file in
verl.trainer.ppo.reward.get_custom_reward_fn.Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)