-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Implement Dual-Clip PPO Algorithm #784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
verl/trainer/config/ppo_trainer.yaml
Outdated
| grad_clip: 1.0 | ||
| clip_ratio: 0.2 | ||
| use_dual_clip: False # add Dual-clip PPO from https://arxiv.org/pdf/1912.09729 | ||
| clip_ratio_c: 3i # lower bound of the value for Dual-clip PPO from https://arxiv.org/pdf/1912.09729 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix type error,73c3349
|
I guess the lower clip should be added by default, could you remove the use option? Thanks |
fix in this commit |
|
Fix the Dual-Clip Implement bug after refer to the code from PPO dual clip and PPOxFamily, only apply the loss when advantages < 0 |
|
Could you please fix format? |
I have already run the |
Could you help rebase main? In the main branch, the format issue was fixed |
203ef0a to
7fa176d
Compare
Done, please rerun the CI test. |
Add the [Dual-Clip PPO](https://arxiv.org/pdf/1912.09729) algorithm to enhance the current PPO implementations. The Dual-Clip PPO introduces a approach by applying a lower bound to the policy ratio when the advantage is less than zero, when multiplied by a huge raito, does not exceed a specified lower bound. The concept is illustrated in the figure below: <img width="626" alt="Clipboard_Screenshot_1743047374" src="https://github.com/user-attachments/assets/93952edc-30c8-477e-bc3d-4770fabe55b8" /> So, the finall loss of the ppo is <img width="624" alt="Clipboard_Screenshot_1743047410" src="https://github.com/user-attachments/assets/5900490b-f64a-4bde-87d6-8359615b3337" /> This adjustment leads to a modified final loss calculation for the PPO, which could potentially improve training stability and performance in certain scenarios. I believe integrating this feature could provide significant benefits, and I look forward to feedback on this suggestion.
Add the [Dual-Clip PPO](https://arxiv.org/pdf/1912.09729) algorithm to enhance the current PPO implementations. The Dual-Clip PPO introduces a approach by applying a lower bound to the policy ratio when the advantage is less than zero, when multiplied by a huge raito, does not exceed a specified lower bound. The concept is illustrated in the figure below: <img width="626" alt="Clipboard_Screenshot_1743047374" src="https://github.com/user-attachments/assets/93952edc-30c8-477e-bc3d-4770fabe55b8" /> So, the finall loss of the ppo is <img width="624" alt="Clipboard_Screenshot_1743047410" src="https://github.com/user-attachments/assets/5900490b-f64a-4bde-87d6-8359615b3337" /> This adjustment leads to a modified final loss calculation for the PPO, which could potentially improve training stability and performance in certain scenarios. I believe integrating this feature could provide significant benefits, and I look forward to feedback on this suggestion.

Add the Dual-Clip PPO algorithm to enhance the current PPO implementations. The Dual-Clip PPO introduces a approach by applying a lower bound to the policy ratio when the advantage is less than zero, when multiplied by a huge raito, does not exceed a specified lower bound. The concept is illustrated in the figure below:


So, the finall loss of the ppo is
This adjustment leads to a modified final loss calculation for the PPO, which could potentially improve training stability and performance in certain scenarios. I believe integrating this feature could provide significant benefits, and I look forward to feedback on this suggestion.